
Ultimate access to all questions.
In a scenario where you are processing a large dataset using Spark Structured Streaming in a cost-sensitive environment, you notice that the performance of your queries is significantly degraded due to the presence of numerous tiny files. These files are a result of frequent small writes to the storage system. Considering the need to optimize query performance while minimizing costs and ensuring scalability, which of the following strategies would be the MOST effective in addressing the issues caused by tiny files? Choose one option.
A
Increase the number of input partitions to improve parallelism and query performance, despite the potential increase in resource consumption.
B
Decrease the number of input partitions to reduce resource consumption, accepting a potential decrease in parallelism and query performance.
C
Implement file concatenation or bucketing strategies to reduce the number of tiny files, thereby decreasing I/O operations and improving query performance without significantly increasing costs.
D
Apply partition pruning to selectively read only the required partitions, ignoring the issue of tiny files altogether.