
Explanation:
Parquet (Columnar Format):
Avro (Row-based Format):
ORC (Columnar Format):
JSON (Text-based Format):
Given the requirement to "retrieve multiple rows of records in their entirety" combined with the need to minimize query execution time and minimize data processing, Parquet is the optimal choice for several reasons:
Query Performance: Parquet's columnar storage with predicate pushdown and efficient compression significantly reduces I/O operations, leading to faster query execution times.
Data Processing Efficiency: Parquet's columnar format allows Spark to process data more efficiently through better compression and encoding schemes, minimizing overall data processing requirements.
Spark Integration: Apache Spark has excellent optimization for Parquet format, including automatic schema inference, partition discovery, and efficient predicate pushdown.
Storage Efficiency: Parquet provides superior compression ratios compared to row-based formats, reducing storage costs and I/O overhead.
While Avro might seem suitable for reading entire rows, modern analytical engines like Spark can efficiently reconstruct entire rows from Parquet files while still benefiting from columnar optimizations. The performance advantages of Parquet for analytical workloads in Spark pools outweigh the theoretical benefits of row-based formats for full-row retrieval.
Ultimate access to all questions.
You have an Azure Data Lake Storage Gen2 account named account1 and an Azure Event Hub named Hub1. Data is written to account1 using Event Hubs Capture.
You plan to query account1 using an Apache Spark pool in Azure Synapse Analytics.
You need to create a notebook to ingest the data from account1. The solution must meet the following requirements:
• Retrieve multiple rows of records in their entirety. • Minimize query execution time. • Minimize data processing.
Which data format should you use?
A
Parquet - O. Avro
B
ORC
C
JSON
No comments yet.