
Ultimate access to all questions.
In a scenario where you are processing a large dataset using Spark Structured Streaming, you notice that the performance of your queries is not as expected due to over-partitioning. The dataset is expected to grow significantly over time, and you need to ensure that your solution is cost-effective, scalable, and complies with data governance policies. Which of the following strategies would BEST optimize the performance of your queries by addressing the issues caused by over-partitioning, while also considering the future growth of the dataset? Choose one option.
A
Increase the number of input partitions to maximize parallelism and potentially improve query performance, without considering the impact on resource consumption.
B
Decrease the number of input partitions to reduce resource consumption and improve query performance, by finding an optimal balance between parallelism and resource usage.
C
Implement file concatenation or bucketing strategies to reduce the number of small files, which indirectly addresses the symptoms of over-partitioning but not the root cause.
D
Apply partition pruning techniques to selectively read only the necessary partitions, which improves query performance but does not directly reduce the overhead of over-partitioning.