
Answer-first summary for fast verification
Answer: A copy of df2 will be sent to all worker nodes to facilitate the join.
The `broadcast` function in PySpark is designed to mark a DataFrame as small enough for use in broadcast joins. When the query is executed, a copy of the broadcasted DataFrame (df2 in this case) is distributed to all worker nodes to optimize the join operation. This approach is particularly useful for speeding up join operations when one of the DataFrames is small enough to be efficiently broadcasted across the cluster.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
No comments yet.
A data engineer is optimizing a join operation between two DataFrames, df1 and df2, using the following query: joined_df = df1.join(broadcast(df2), 'id', 'inner'). Which statement accurately describes how this join operation works?
A
The join operation will fail because 'inner' should be replaced with 'broadcast'.
B
A copy of df2 will be sent to all worker nodes to facilitate the join.
C
The join operation will fail because 'broadcast_df' should be used instead of 'broadcast'.
D
Only the first 10 MB of data from df2 will be used in the join.
E
The result of the join, joined_df, will be broadcasted to all worker nodes due to the use of the broadcast function.