
Explanation:
The logical error in the code is that the larger DataFrame employeesDF is being broadcasted instead of the smaller storesDF. Broadcasting is intended for smaller DataFrames to be sent to all worker nodes, reducing data shuffling. Broadcasting a larger DataFrame leads to inefficiency due to excessive network usage and memory pressure. Other options are incorrect: Spark 3 still allows explicit broadcast() hints (B), wrapping the entire join in broadcast() is invalid (C), the broadcast() hint works regardless of spark.sql.autoBroadcastJoinThreshold (D), and only one DataFrame needs broadcasting (E).
Ultimate access to all questions.
Identify the logical error in the following code block intended to efficiently perform a broadcast join between DataFrame storesDF and the much larger DataFrame employeesDF using the key column storeId. The current implementation may contain inefficiencies.
Code block:
storesDF.join(broadcast(employeesDF), "storeId")
storesDF.join(broadcast(employeesDF), "storeId")
A
The larger DataFrame employeesDF is being broadcasted rather than the smaller DataFrame storesDF.
B
There is never a need to call the broadcast() operation in Apache Spark 3.
C
The entire line of code should be wrapped in broadcast() rather than just DataFrame employeesDF.
D
The broadcast() operation will only perform a broadcast join if the Spark property spark.sql.autoBroadcastJoinThreshold is manually set.
E
Only one of the DataFrames is being broadcasted rather than both of the DataFrames.
No comments yet.