
Ultimate access to all questions.
Describe a scenario where 'smalls' (tiny files, scanning overhead, over partitioning) significantly impact Spark query performance. Provide a detailed analysis of the performance issues encountered and propose a solution involving repartitioning and coalescing to address these issues. Include a code snippet demonstrating the solution.
A
Small files do not impact query performance.
B
Over partitioning always improves query performance.
C
Small files and over partitioning can lead to increased scanning overhead and reduced performance. Repartitioning and coalescing can help optimize performance.
D
Query performance is solely dependent on the data size, not the file size.