
Databricks Certified Data Engineer - Professional
Get started today
Ultimate access to all questions.
To optimize performance in Apache Spark when joining a DataFrame df
with another DataFrame lookupDf
on a common key, which method should you use to leverage broadcast variables effectively?
To optimize performance in Apache Spark when joining a DataFrame df
with another DataFrame lookupDf
on a common key, which method should you use to leverage broadcast variables effectively?
Explanation:
The correct approach is to use df.join(broadcast(lookupDf), 'key'
because broadcasting lookupDf
instructs Spark to distribute a read-only copy of lookupDf
to each executor node. This reduces the amount of data shuffled across the network during the join operation, as lookupDf
is available locally on each node. This method maximizes performance by minimizing network shuffles, especially beneficial with large DataFrames or unevenly distributed join keys. Other options either fail to leverage broadcast variables effectively or may lead to unnecessary memory usage and performance issues.