
Answer-first summary for fast verification
Answer: DataFrame B should be broadcasted because it is smaller and will eliminate the need for the shuffling of itself., DataFrame B should be broadcasted because it is smaller and will eliminate the need for the shuffling of DataFrame A.
In a broadcast join, the smaller DataFrame (B, 1 GB) is sent to all executors to avoid shuffling the larger DataFrame (A, 128 GB). Broadcasting B eliminates the need to shuffle B itself (since it is distributed via broadcast) and also avoids shuffling A (as each partition of A can join locally with B). Both options B and D correctly describe parts of this reasoning: B focuses on avoiding B's shuffle, while D emphasizes avoiding A's shuffle, which is critical due to A's larger size. Thus, both B and D are correct.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
No comments yet.
A Spark application has a 128 GB DataFrame A and a 1 GB DataFrame B. If a broadcast join were to be performed on these two DataFrames, which of the following describes which DataFrame should be broadcast and why?
A
Either DataFrame can be broadcasted. Their results will be identical in result and efficiency.
B
DataFrame B should be broadcasted because it is smaller and will eliminate the need for the shuffling of itself.
C
DataFrame A should be broadcasted because it is larger and will eliminate the need for the shuffling of DataFrame B.
D
DataFrame B should be broadcasted because it is smaller and will eliminate the need for the shuffling of DataFrame A.
E
DataFrame A should be broadcasted because it is smaller and will eliminate the need for the shuffling of itself.