Databricks Certified Data Engineer - Professional

Get started today

Ultimate access to all questions.

In the context of optimizing Spark jobs for large datasets, you are tasked with performing multiple aggregations across different columns. Considering the need for efficiency, scalability, and minimal data shuffling, which of the following strategies would you choose as the BEST approach? Choose one option.

Simulated

Utilize the groupBy() method to consolidate all aggregations into a single operation, thereby reducing the number of stages and data shuffling.

63.9%

Apply the groupBy() method separately for each aggregation, allowing for individual operation control but potentially increasing data shuffling and stages.

Comments

Loading comments...

Implement the pivot() method for simultaneous aggregations on multiple columns, suitable for specific scenarios but not universally applicable.

12.0%

Employ the rollup() method for hierarchical aggregations, which may not be optimal for all aggregation needs due to its specific use case.

7.2%