Databricks Certified Data Engineer - Professional

Databricks Certified Data Engineer - Professional

Get started today

Ultimate access to all questions.


In the context of optimizing a data pipeline that ingests large volumes of data into Delta Lake on Microsoft Azure, you are tasked with determining the most appropriate file sizes to enhance both ingestion performance and query efficiency. Considering the need to balance cost, compliance, and scalability, which of the following factors should you prioritize when deciding on the file sizes, and what would be the most reasonable range for these file sizes to ensure optimal performance? Choose the best option.




Explanation:

The correct answer is B because it comprehensively considers the critical factors affecting both ingestion and query performance, including the nature of the queries, available storage, and the data's partitioning scheme. The recommended file size range of 64MB to 1GB is optimal for balancing performance and efficiency. Option A is too narrow, focusing only on data volume and storage costs without considering query performance. Option C is inflexible, applying a one-size-fits-all approach that may not suit all data or query types. Option D overlooks the importance of query performance and scalability, focusing solely on storage capacity.