
Ultimate access to all questions.
Consider a Delta Lake table that is heavily impacted by 'smalls' (tiny files, scanning overhead, over partitioning). Describe in detail how you would redesign the data processing pipeline to mitigate these issues, focusing on the use of CDF and proper partitioning strategies. Provide a code snippet illustrating the key changes.
A
Continue using the current pipeline; small files do not affect performance.
B
Redesign the pipeline to use larger files and fewer partitions, leveraging CDF for efficient data processing.
C
Ignore the small files; focus on increasing the data volume to improve performance.
D
Recompute the entire dataset periodically to avoid small files.