
Ultimate access to all questions.
Describe a scenario where you would use repartition over coalesce in PySpark. Explain the reasoning behind your choice and the impact on data shuffling and performance.
A
Use repartition when you need to increase the number of partitions and ensure a balanced distribution of data, which involves full data shuffling but can improve query performance.
B
Use repartition when you need to decrease the number of partitions without considering data distribution, which is less efficient than using coalesce.
C
Use repartition for all scenarios as it is more versatile and always leads to better performance than coalesce.
D
Use repartition only when dealing with very small datasets where data shuffling is not a concern.