Ultimate access to all questions.
Upgrade Now 🚀
Sign in to unlock AI tutor
When optimizing a Spark DataFrame operation for large-scale aggregations on a dataset with a billion rows, which strategy would you employ to reduce processing time and enhance resource utilization?
A
Utilizing columnar storage formats like Parquet to improve scan efficiency during aggregation
B
Applying the coalesce method to reduce the number of partitions before aggregation
C
Implementing custom partitioning to ensure even distribution of data across nodes before aggregation
D
Leveraging broadcast variables to minimize data shuffling during join operations