You are designing a PySpark application to process a large dataset for a financial analytics project. The dataset includes transaction records from the past year, and your application needs to perform various transformations to calculate monthly spending trends. Given the project's requirements for timely insights and the dataset's size, you aim to optimize the job's performance by minimizing data shuffling. Which of the following strategies would be the MOST effective to achieve this goal, and why? Choose one option.

Simulated

Implement narrow transformations like select, filter, and where extensively, as they operate on single partitions without causing data shuffling.

50.0%

Limit the use of wide transformations such as groupBy, join, and agg to essential operations only, since they require data shuffling across partitions.

16.8%

Apply the persist() or cache() methods to intermediate datasets to speed up access times, though this does not directly reduce data shuffling.

19.4%

Increase the number of partitions using the repartition() method for each transformation to distribute the workload more evenly, potentially at the cost of increased shuffling.

13.8%

Databricks Certified Data Engineer - Professional

Get started today

Comments