
Answer-first summary for fast verification
Answer: Use broadcast join to force Spark to broadcast the smaller DataFrame.
#### Option A: Convert DataFrames to RDDs and apply a manual join operation to control data distribution. - **Incorrect**. This loses Catalyst optimizer benefits, predicate pushdown, and built-in optimizations. Manual RDD joins are low-level, error-prone, and rarely outperform DataFrame API for skew handling. #### Option B: Salt both DataFrames on the join key to distribute the skewed data more evenly across partitions. - **Incorrect (not the most effective in this context)**. Salting is a strong manual technique for large-to-large skewed joins: add a random "salt" to the skewed key on one side, explode/duplicate the other side with salt values, join on composite key, then clean up. It balances load but introduces overhead (data duplication, extra steps). It's effective when broadcast isn't possible and AQE isn't sufficient. #### Option C: Use broadcast join to force Spark to broadcast the smaller DataFrame. - **Correct**. Broadcast joins **completely avoid shuffling**, performing the join locally on each partition of the larger DataFrame. This makes them inherently immune to shuffle-related skew issues. The question assumes a scenario where one DataFrame **is** small enough to broadcast (common in dimension-fact joins where skew often occurs on the fact side). The reference explanation explicitly states broadcast is the "most efficient" by minimizing shuffle. In Databricks exams, this is a standard recommendation for small-large joins with potential skew. #### Option D: Increase the number of partitions specifically for the skewed key using custom partitioning. - **Incorrect**. Simply increasing partitions (`repartition` or `spark.sql.shuffle.partitions`) doesn't help key-based skew—all records for the same key hash to one partition. Custom partitioners are complex and rarely solve extreme skew effectively. ### Why C Over B, Even If Both Sides Could Be Large? - The question says "one join key has significantly more data," but doesn't explicitly state both DataFrames are large or that broadcast is impossible. In many real-world skew scenarios (e.g., fact table skewed on a key joining to a dimension table), one side **is** small/broadcastable. - Databricks (with AQE enabled by default) automatically handles many skew cases in large joins, but among the options, broadcast is prioritized when applicable as it eliminates shuffle entirely. - Exam questions often test the **optimal** choice: broadcast is simpler and more performant than salting when feasible. If the scenario strictly excludes broadcast (both sides huge), salting (B) would be better—but based on the wording and reference, C is the intended answer. ### Reference Answer The correct answer is **C**. For Databricks Certified Data Engineer - Professional preparation: Prioritize broadcast joins for any small-large join to avoid shuffle/skew issues altogether. Use salting or AQE/skew hints for confirmed large-large skewed joins. Always check Spark UI for task metrics to diagnose skew.
Ultimate access to all questions.
No comments yet.
Author: LeetQuiz Editorial Team
When dealing with data skew in Spark joins between a large fact table (where one join key has significantly more records than others) and a smaller dimension table, which method is most effective to reduce performance issues caused by shuffle skew?
A
Convert DataFrames to RDDs and apply a manual join operation to control data distribution.
B
Salt both DataFrames on the join key to distribute the skewed data more evenly across partitions.
C
Use broadcast join to force Spark to broadcast the smaller DataFrame.
D
Increase the number of partitions specifically for the skewed key using custom partitioning.