Databricks Certified Data Engineer - Professional

Databricks Certified Data Engineer - Professional

Get started today

Ultimate access to all questions.


In the context of optimizing Spark queries on Azure Databricks, consider a scenario where a data engineer is dealing with a large number of small files stored in Azure Blob Storage. The engineer notices that the queries are running slower than expected. Which of the following best explains the underlying issue and suggests the most effective mitigation strategy? Choose the single best option.




Explanation:

The correct answer is B because it accurately identifies the scanning overhead caused by a large number of small files as a primary reason for degraded query performance. It also suggests practical mitigation strategies such as file consolidation or using Delta Lake's features, which are effective in real-world scenarios. Option A is incorrect because it overlooks the negative impact of increased I/O operations due to small files. Option C is incorrect because over-partitioning can lead to excessive overhead and resource consumption, not improved performance. Option D is incorrect because under-partitioning reduces parallelism, which can lead to longer query execution times, not improved performance.