
Answer-first summary for fast verification
Answer: Use Snappy compression for the files.
## Detailed Explanation ### Understanding the Question Requirements The question specifically asks to **minimize storage costs** for Parquet files stored in Azure Data Lake Storage Gen2. The key constraint is that the solution must work with Azure Synapse Analytics serverless SQL pool for consumption. ### Analysis of Each Option **A: Use Snappy compression for the files** - **Optimal Choice**: Snappy compression provides excellent compression ratios for Parquet files while maintaining fast decompression speeds. - **Storage Impact**: Directly reduces file sizes by 20-50%, directly lowering storage costs in Data Lake Storage Gen2. - **Compatibility**: Fully supported by both Azure Data Factory (when writing Parquet files) and Azure Synapse Analytics serverless SQL pool (when reading). - **Best Practice**: Snappy is the default compression for Parquet in Azure services because it offers the best balance between compression efficiency and query performance. **B: Use OPENROWSET to query the Parquet files** - **Not Suitable**: OPENROWSET is a querying method, not a storage optimization technique. - **Storage Impact**: No effect on actual storage costs since it doesn't modify the underlying files. - **Purpose**: Useful for ad-hoc queries but doesn't address the core requirement of minimizing storage costs. **C: Create an external table that contains a subset of columns from the Parquet files** - **Misleading**: External tables are metadata definitions, not physical storage optimizations. - **Storage Impact**: Zero impact on actual storage costs since the original Parquet files remain unchanged in Data Lake Storage. - **Query Benefit**: May improve query performance by reducing data scanning, but doesn't reduce storage costs. **D: Store all data as string in the Parquet files** - **Counterproductive**: String storage typically increases file sizes compared to appropriate data types. - **Storage Impact**: Would likely **increase** storage costs rather than minimize them. - **Performance Impact**: Also negatively affects query performance due to type conversion overhead. ### Why Option A is the Correct Answer 1. **Direct Storage Reduction**: Compression directly reduces the physical storage footprint in Data Lake Storage Gen2. 2. **Cost-Effective**: Smaller files mean lower storage costs without compromising query performance. 3. **Azure Best Practice**: Microsoft recommends using compression for cost optimization in data lake scenarios. 4. **Synapse Compatibility**: Serverless SQL pool efficiently handles Snappy-compressed Parquet files without performance degradation. ### Key Consideration While some discussions mention that Snappy is the default for Parquet in Azure Data Factory, the question asks what you "should do" to minimize storage costs. Using Snappy compression (whether as default or explicitly configured) is the fundamental approach to achieve this goal.
Ultimate access to all questions.
Author: LeetQuiz Editorial Team
You are implementing a batch dataset in Parquet format. Data files will be produced using Azure Data Factory and stored in Azure Data Lake Storage Gen2. The files will be consumed by an Azure Synapse Analytics serverless SQL pool. You need to minimize storage costs for the solution. What should you do?
A
Use Snappy compression for the files.
B
Use OPENROWSET to query the Parquet files.
C
Create an external table that contains a subset of columns from the Parquet files.
D
Store all data as string in the Parquet files.
No comments yet.