Databricks Certified Data Engineer - Professional

Ultimate access to all questions.

In a scenario where you are processing a large dataset using Spark Structured Streaming, you notice that certain tasks are taking significantly longer to complete than others, leading to inefficiencies in your data processing pipeline. Upon investigation, you identify that this is due to data skew, where some partitions contain significantly more data than others. Considering the need to optimize query performance while adhering to cost constraints and ensuring scalability, which of the following approaches would be the MOST effective in addressing the issue of data skew? Choose the best option from the four provided.

Simulated

Increase the number of input partitions to improve parallelism, without considering the distribution of data across these partitions.

9.5%

Decrease the number of input partitions to reduce resource consumption, assuming that fewer partitions will naturally balance the data load.

Loading comments...