
Answer-first summary for fast verification
Answer: Apply repartition based on the aggregation key before performing the aggregation.
Repartitioning the data based on the aggregation key before performing the aggregation is the most effective strategy to reduce shuffle size. This ensures that records with the same key are shuffled to the same partition, minimizing the data that needs to be moved across the network during aggregation. While other options like using coalesce (option A), increasing default parallelism (option B), or local aggregation with mapPartitions (option D) can offer some benefits, they are not as targeted or efficient in reducing shuffle size as repartitioning based on the aggregation key.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
Your Spark application is experiencing performance issues due to large shuffle operations during aggregation. Which of the following strategies would most effectively reduce the size of these shuffle operations?
A
Use coalesce to reduce the number of partitions just before aggregation.
B
Increase the default parallelism to create more tasks and thus reduce data per task.
C
Apply repartition based on the aggregation key before performing the aggregation.
D
Aggregate data locally on each partition with mapPartitions before the global aggregation.
No comments yet.