
Answer-first summary for fast verification
Answer: Parquet
## Analysis of Output Format Requirements When selecting a Stream Analytics output format for this scenario, we must consider three key requirements: 1. **Compatibility with both Azure Databricks and PolyBase** - The format must be supported by both query engines 2. **Fast query performance** - The format should enable efficient data retrieval 3. **Data type preservation** - The format must maintain original data types without loss ### Evaluation of Each Format: **A: JSON** - ❌ **Not optimal for PolyBase** - PolyBase has limited support for JSON and requires additional configuration - ❌ **Performance concerns** - JSON is text-based and requires parsing, which can slow down queries - ⚠️ **Data type preservation** - JSON can preserve types but requires explicit schema definition **B: Parquet** - ✅ **Excellent PolyBase support** - PolyBase natively supports Parquet format - ✅ **Full Databricks compatibility** - Databricks has robust Parquet support - ✅ **Superior performance** - Columnar storage enables fast querying and efficient compression - ✅ **Strong data type preservation** - Parquet maintains schema and data types in metadata **C: CSV** - ❌ **Poor data type preservation** - CSV treats all data as strings, losing original data types - ⚠️ **Basic PolyBase support** - Supported but requires explicit schema definition - ❌ **Performance limitations** - Row-based format is less efficient for analytical queries **D: Avro** - ❌ **Limited PolyBase support** - PolyBase does not natively support Avro format - ✅ **Good Databricks compatibility** - Databricks supports Avro - ✅ **Excellent data type preservation** - Avro has strong schema evolution capabilities - ⚠️ **Performance considerations** - Row-based format may not be as fast as columnar formats ### Why Parquet is the Optimal Choice: **Cross-Platform Compatibility**: Parquet is the only format that provides excellent native support for both PolyBase and Databricks without requiring additional configuration or schema mapping. **Performance Advantages**: As a columnar storage format, Parquet enables: - Predicate pushdown for faster filtering - Column pruning to read only required data - Efficient compression reducing storage costs - Better parallel processing capabilities **Data Integrity**: Parquet files contain embedded schema information that preserves data types and ensures consistent interpretation across different query engines. **Stream Analytics Integration**: Azure Stream Analytics can efficiently output data in Parquet format to Azure Data Lake Storage, making it a seamless choice for this data pipeline architecture. The combination of broad compatibility, superior query performance, and reliable data type preservation makes Parquet the clear recommendation for this scenario.
Ultimate access to all questions.
No comments yet.
Author: LeetQuiz Editorial Team
You are designing a solution to ingest streaming social media data with Azure Stream Analytics. The data will be stored in Azure Data Lake Storage and later queried by both Azure Databricks and PolyBase in Azure Synapse Analytics.
You need to recommend a Stream Analytics output format that minimizes query errors for both Databricks and PolyBase. The solution must prioritize fast query performance and preserve data type information.
What output format should you recommend?
A
JSON
B
Parquet
C
CSV
D
Avro