
Ultimate access to all questions.
In a scenario where you are processing a large dataset using Spark Structured Streaming, you notice that certain tasks are taking significantly longer to complete than others, leading to inefficiencies in your data processing pipeline. Upon investigation, you identify that this is due to data skew, where some partitions contain significantly more data than others. Considering the need to optimize query performance while adhering to cost constraints and ensuring scalability, which of the following approaches would be the MOST effective in addressing the issue of data skew? Choose the best option from the four provided.
A
Increase the number of input partitions to improve parallelism, without considering the distribution of data across these partitions.
B
Decrease the number of input partitions to reduce resource consumption, assuming that fewer partitions will naturally balance the data load.
C
Use data repartitioning techniques to redistribute the data evenly across partitions, ensuring a balanced workload and improved query performance.
D
Implement partition pruning to selectively read only the partitions that contain the most data, thereby reducing the overall processing time.