
Answer-first summary for fast verification
Answer: Use `repartition` when you need to increase the number of partitions and ensure a balanced distribution of data, which involves full data shuffling but can improve query performance.
Using `repartition` is appropriate when you need to increase the number of partitions and ensure a balanced distribution of data across these partitions. While this involves full data shuffling, it can significantly improve query performance by enabling more parallel processing units to work on the data concurrently.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
Describe a scenario where you would use repartition over coalesce in PySpark. Explain the reasoning behind your choice and the impact on data shuffling and performance.
A
Use repartition when you need to increase the number of partitions and ensure a balanced distribution of data, which involves full data shuffling but can improve query performance.
B
Use repartition when you need to decrease the number of partitions without considering data distribution, which is less efficient than using coalesce.
C
Use repartition for all scenarios as it is more versatile and always leads to better performance than coalesce.
D
Use repartition only when dealing with very small datasets where data shuffling is not a concern.
No comments yet.