
Ultimate access to all questions.
In the context of optimizing Spark queries on Azure Databricks, consider a scenario where a data engineer is dealing with a large number of small files stored in Azure Blob Storage. The engineer notices that the queries are running slower than expected. Which of the following best explains the underlying issue and suggests the most effective mitigation strategy? Choose the single best option.
A
Small files increase the number of input partitions, which inherently improves query performance by maximizing parallelism without any negative impact.
B
The scanning overhead from reading a large number of small files leads to excessive I/O operations, significantly degrading query performance. The most effective mitigation strategy is to consolidate small files into larger ones or use Delta Lake's optimized file management features.
C
Over-partitioning the data increases the number of input partitions, which improves query performance by ensuring that each task is lightweight and executes quickly.
D
Reducing the number of input partitions by under-partitioning the data will decrease parallelism but improve query performance by minimizing I/O operations.