
Answer-first summary for fast verification
Answer: Using `df.join(broadcast(lookupDf), 'key'`
The correct approach is to use `df.join(broadcast(lookupDf), 'key'` because broadcasting `lookupDf` instructs Spark to distribute a read-only copy of `lookupDf` to each executor node. This reduces the amount of data shuffled across the network during the join operation, as `lookupDf` is available locally on each node. This method maximizes performance by minimizing network shuffles, especially beneficial with large DataFrames or unevenly distributed join keys. Other options either fail to leverage broadcast variables effectively or may lead to unnecessary memory usage and performance issues.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
No comments yet.
To optimize performance in Apache Spark when joining a DataFrame df with another DataFrame lookupDf on a common key, which method should you use to leverage broadcast variables effectively?
A
Partitioning both DataFrames by 'key' before joining
B
Broadcasting both DataFrames before joining
C
Using df.join(broadcast(lookupDf), 'key'
D
Applying lookupDf.join(df, 'key') without broadcasting