Databricks Certified Data Engineer - Professional

Databricks Certified Data Engineer - Professional

Get started today

Ultimate access to all questions.


To optimize performance in Apache Spark when joining a DataFrame df with another DataFrame lookupDf on a common key, which method should you use to leverage broadcast variables effectively?




Explanation:

The correct approach is to use df.join(broadcast(lookupDf), 'key' because broadcasting lookupDf instructs Spark to distribute a read-only copy of lookupDf to each executor node. This reduces the amount of data shuffled across the network during the join operation, as lookupDf is available locally on each node. This method maximizes performance by minimizing network shuffles, especially beneficial with large DataFrames or unevenly distributed join keys. Other options either fail to leverage broadcast variables effectively or may lead to unnecessary memory usage and performance issues.