
Ultimate access to all questions.
You are tasked with optimizing the performance of a Spark Structured Streaming job that processes a large dataset. The job is experiencing performance degradation due to the presence of numerous small files ('smalls'). Considering the constraints of cost efficiency, compliance with data governance policies, and the need for scalability, which of the following solutions is the BEST to address this issue? Choose one option.
A
Implement dynamic partition discovery to automatically adjust the number of partitions based on the dataset size, aiming to improve parallelism and query performance.
B
Configure the streaming job to use a larger batch interval, reducing the frequency of file processing to minimize the impact of small files on performance.
C
Apply file concatenation or bucketing strategies to reduce the number of small files, thereby decreasing I/O operations and enhancing query performance.
D
Enable speculative execution to automatically detect and mitigate the performance issues caused by small files by running additional tasks for slow-running partitions.