
Answer-first summary for fast verification
Answer: The scanning overhead from reading a large number of small files leads to excessive I/O operations, significantly degrading query performance. The most effective mitigation strategy is to consolidate small files into larger ones or use Delta Lake's optimized file management features.
The correct answer is B because it accurately identifies the scanning overhead caused by a large number of small files as a primary reason for degraded query performance. It also suggests practical mitigation strategies such as file consolidation or using Delta Lake's features, which are effective in real-world scenarios. Option A is incorrect because it overlooks the negative impact of increased I/O operations due to small files. Option C is incorrect because over-partitioning can lead to excessive overhead and resource consumption, not improved performance. Option D is incorrect because under-partitioning reduces parallelism, which can lead to longer query execution times, not improved performance.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
In the context of optimizing Spark queries on Azure Databricks, consider a scenario where a data engineer is dealing with a large number of small files stored in Azure Blob Storage. The engineer notices that the queries are running slower than expected. Which of the following best explains the underlying issue and suggests the most effective mitigation strategy? Choose the single best option.
A
Small files increase the number of input partitions, which inherently improves query performance by maximizing parallelism without any negative impact.
B
The scanning overhead from reading a large number of small files leads to excessive I/O operations, significantly degrading query performance. The most effective mitigation strategy is to consolidate small files into larger ones or use Delta Lake's optimized file management features.
C
Over-partitioning the data increases the number of input partitions, which improves query performance by ensuring that each task is lightweight and executes quickly.
D
Reducing the number of input partitions by under-partitioning the data will decrease parallelism but improve query performance by minimizing I/O operations.
No comments yet.