Ultimate access to all questions.
A Spark application has a 128 GB DataFrame A and a 1 GB DataFrame B. If a broadcast join were to be performed on these two DataFrames, which of the following describes which DataFrame should be broadcast and why?
Explanation:
In a broadcast join, the smaller DataFrame (B, 1 GB) is sent to all executors to avoid shuffling the larger DataFrame (A, 128 GB). Broadcasting B eliminates the need to shuffle B itself (since it is distributed via broadcast) and also avoids shuffling A (as each partition of A can join locally with B). Both options B and D correctly describe parts of this reasoning: B focuses on avoiding B's shuffle, while D emphasizes avoiding A's shuffle, which is critical due to A's larger size. Thus, both B and D are correct.