You are a data engineer working on a project that involves processing a large dataset in Azure Databricks using Spark. The dataset contains sales data from the past five years, with each record including a timestamp, product ID, and sales amount. Your task is to optimize the performance of a batch processing job that aggregates sales by product ID. The dataset is currently skewed, with a few product IDs accounting for a significant portion of the data. Considering the need for efficient resource utilization and minimizing job execution time, which partitioning strategy would you choose to optimize the performance of your job? Choose the best option from the following:

Simulated

Use coalesce to reduce the number of partitions without shuffling the data, aiming to decrease overhead but potentially increasing skew.

14.9%

Use repartition to increase the number of partitions and redistribute the data evenly, which may not address the skew issue effectively.

21.1%

Use repartition by range on the product ID column to partition the data based on the product ID values, aiming to distribute the skewed data more evenly across partitions.

51.7%

Use rebalance to redistribute the data evenly across all partitions without considering the number of partitions, which may not effectively handle the skew.

12.4%

Databricks Certified Data Engineer - Professional

Get started today

Comments