
Answer-first summary for fast verification
Answer: Use repartition by range on the product ID column to partition the data based on the product ID values, aiming to distribute the skewed data more evenly across partitions.
In scenarios with skewed data, repartitioning by range on a specific column (in this case, product ID) is often the most effective strategy. This approach allows for a more even distribution of the skewed data across partitions, which can significantly improve the performance of operations that involve the partitioning column. Coalesce (Option A) reduces partitions without shuffling, which might not address skew. Repartition (Option B) redistributes data evenly but doesn't specifically target skew. Rebalance (Option D) also aims for even distribution but lacks control over partition count and doesn't specifically address skew. Therefore, Option C is the best choice for optimizing performance in this context.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
You are a data engineer working on a project that involves processing a large dataset in Azure Databricks using Spark. The dataset contains sales data from the past five years, with each record including a timestamp, product ID, and sales amount. Your task is to optimize the performance of a batch processing job that aggregates sales by product ID. The dataset is currently skewed, with a few product IDs accounting for a significant portion of the data. Considering the need for efficient resource utilization and minimizing job execution time, which partitioning strategy would you choose to optimize the performance of your job? Choose the best option from the following:
A
Use coalesce to reduce the number of partitions without shuffling the data, aiming to decrease overhead but potentially increasing skew.
B
Use repartition to increase the number of partitions and redistribute the data evenly, which may not address the skew issue effectively.
C
Use repartition by range on the product ID column to partition the data based on the product ID values, aiming to distribute the skewed data more evenly across partitions.
D
Use rebalance to redistribute the data evenly across all partitions without considering the number of partitions, which may not effectively handle the skew.
No comments yet.