Databricks Certified Data Engineer - Professional

Get started today

Ultimate access to all questions.

In a scenario where you are processing a large dataset using Spark Structured Streaming, you notice that the performance of your queries is not as expected due to over-partitioning. The dataset is expected to grow significantly over time, and you need to ensure that your solution is cost-effective, scalable, and complies with data governance policies. Which of the following strategies would BEST optimize the performance of your queries by addressing the issues caused by over-partitioning, while also considering the future growth of the dataset? Choose one option.

Simulated

Increase the number of input partitions to maximize parallelism and potentially improve query performance, without considering the impact on resource consumption.

6.1%

Decrease the number of input partitions to reduce resource consumption and improve query performance, by finding an optimal balance between parallelism and resource usage.

Comments

Loading comments...

Apply partition pruning techniques to selectively read only the necessary partitions, which improves query performance but does not directly reduce the overhead of over-partitioning.

19.7%