
Ultimate access to all questions.
You are designing a data processing pipeline in Azure Databricks to handle large-scale data analytics. The pipeline must efficiently process data with varying sizes, from gigabytes to terabytes, while optimizing for both cost and performance. The solution must also ensure that the data processing can scale dynamically based on the workload without manual intervention. Considering these requirements, which of the following strategies is the BEST to implement for ensuring data scalability and performance optimization in Spark? Choose the single best option.
A
Implement the 'repartition' function to manually adjust the number of partitions based on the data size, which requires pre-processing knowledge of the data volume.
B
Use the 'coalesce' function to reduce the number of partitions after filtering operations, which may lead to underutilization of resources for larger datasets.
C
Apply the 'cache' function to store all intermediate results in memory, which could lead to increased costs due to high memory usage for large datasets.
D
Utilize the 'persist' function with different storage levels (MEMORY_AND_DISK) to store intermediate results, allowing Spark to dynamically manage data storage based on workload and available resources.