Ultimate access to all questions.
Upgrade Now 🚀
Sign in to unlock AI tutor
When optimizing Apache Spark jobs for large-scale data processing in Databricks, you face performance bottlenecks due to data shuffling. Which strategy would you use to reduce data shuffle and enhance job performance?
A
Configuring the Spark session to use more efficient serialization formats like Parquet for intermediate data storage
B
Implementing a custom partitioner to ensure data is evenly distributed across partitions before shuffling
C
Utilizing broadcast variables to minimize data transfer across nodes during join operations
D
Increasing the size of the Spark executors and the number of cores to speed up the data processing