Ultimate access to all questions.
Upgrade Now 🚀
Sign in to unlock AI tutor
Describe a scenario where you would use repartition over coalesce in PySpark. Explain the reasoning behind your choice and the impact on data shuffling and performance.
repartition
coalesce
A
Use repartition when you need to increase the number of partitions and ensure a balanced distribution of data, which involves full data shuffling but can improve query performance.
B
Use repartition when you need to decrease the number of partitions without considering data distribution, which is less efficient than using coalesce.
C
Use repartition for all scenarios as it is more versatile and always leads to better performance than coalesce.
D
Use repartition only when dealing with very small datasets where data shuffling is not a concern.