
Answer-first summary for fast verification
Answer: Decreasing the number of partitions using coalesce()
Decreasing the number of partitions using coalesce() reduces the number of partitions without a full shuffle, unlike repartition(), which can help reduce overhead when fewer partitions are desired. However, excessive use can lead to data skew or insufficient parallelism. Maintaining default partitioning allows Spark to manage partitioning, which can be beneficial in some cases, but explicit tuning based on data size and cluster capacity is usually preferable for optimal shuffling. Broadcasting small tables in joins can significantly reduce shuffling by sending the small table to all nodes, avoiding shuffle joins.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
Which strategy is most effective for reducing shuffling and enhancing query performance when optimizing a Spark job that processes a large Delta Lake table?
A
Using broadcast variables to minimize data transfer
B
Maintaining default partitioning to let Spark decide
C
Decreasing the number of partitions using coalesce()
D
Increasing the number of partitions to maximize parallelism
No comments yet.