
Answer-first summary for fast verification
Answer: Parquet
## Detailed Analysis Based on the requirements specified in the question, **Parquet (Option B)** is the optimal file format choice for this Azure data engineering scenario. ### Why Parquet Meets All Requirements: **1. Column and Row Skipping Capability:** - Parquet is a columnar storage format that enables efficient column pruning. When Pool1 queries the data, it can read only the specific columns needed for the query, skipping unnecessary columns entirely. - Parquet supports predicate pushdown and row-group skipping through its internal structure, allowing the query engine to skip entire row groups that don't contain relevant data based on query filters. **2. Automatic Column Statistics Creation:** - Azure Synapse Analytics dedicated SQL pool automatically creates and updates statistics for Parquet files when queried through external tables or PolyBase. - The query optimizer uses these statistics to generate efficient execution plans without manual intervention. **3. File Size Minimization:** - Parquet provides excellent compression through column-wise encoding schemes and compression algorithms like Snappy, GZIP, or LZ4. - Columnar storage inherently reduces storage footprint since similar data types within columns compress more effectively than mixed data types in row-based formats. ### Why Other Formats Are Less Suitable: **CSV (Option D):** - Row-based format that requires reading entire rows even when only a few columns are needed - No automatic statistics creation in dedicated SQL pool for external CSV files - Poor compression compared to columnar formats - Larger file sizes due to lack of efficient encoding **JSON (Option A):** - Semi-structured format that requires parsing entire documents - No built-in column skipping capabilities - Larger storage footprint due to repeated field names and text-based nature - Limited automatic statistics support **Avro (Option C):** - Row-based binary format with schema evolution capabilities - Better compression than CSV/JSON but still less efficient than columnar formats for analytical queries - Limited column skipping capabilities compared to Parquet - Not optimized for the column pruning requirements specified ### Best Practice Considerations: Parquet is the industry standard for analytical workloads in Azure data platforms due to its performance characteristics, compression efficiency, and seamless integration with Azure Synapse Analytics query optimization features.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
You have an Azure subscription containing an Azure Blob Storage account named storage1 and an Azure Synapse Analytics dedicated SQL pool named Pool1.
You need to store data in storage1 that will be read by Pool1. The solution must meet the following requirements:
Pool1 to skip columns and rows that are unnecessary in a query.Which type of file should you use?

A
JSON
B
Parquet
C
Avro
D
CSV
No comments yet.