
Answer-first summary for fast verification
Answer: The larger DataFrame employeesDF is being broadcasted rather than the smaller DataFrame storesDF.
The logical error in the code is that the larger DataFrame `employeesDF` is being broadcasted instead of the smaller `storesDF`. Broadcasting is intended for smaller DataFrames to be sent to all worker nodes, reducing data shuffling. Broadcasting a larger DataFrame leads to inefficiency due to excessive network usage and memory pressure. Other options are incorrect: Spark 3 still allows explicit `broadcast()` hints (B), wrapping the entire join in `broadcast()` is invalid (C), the `broadcast()` hint works regardless of `spark.sql.autoBroadcastJoinThreshold` (D), and only one DataFrame needs broadcasting (E).
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
No comments yet.
Identify the logical error in the following code block intended to efficiently perform a broadcast join between DataFrame storesDF and the much larger DataFrame employeesDF using the key column storeId. The current implementation may contain inefficiencies.
Code block:
storesDF.join(broadcast(employeesDF), "storeId")
storesDF.join(broadcast(employeesDF), "storeId")
A
The larger DataFrame employeesDF is being broadcasted rather than the smaller DataFrame storesDF.
B
There is never a need to call the broadcast() operation in Apache Spark 3.
C
The entire line of code should be wrapped in broadcast() rather than just DataFrame employeesDF.
D
The broadcast() operation will only perform a broadcast join if the Spark property spark.sql.autoBroadcastJoinThreshold is manually set.
E
Only one of the DataFrames is being broadcasted rather than both of the DataFrames.