
Answer-first summary for fast verification
Answer: Parquet
## Explanation Delta Lake tables primarily store data in **Parquet format**. Here's why: 1. **Parquet is the underlying storage format**: Delta Lake uses Parquet files as the base storage format for data. The "Delta" aspect refers to the transaction log and metadata layer that sits on top of Parquet files, not the underlying data storage format. 2. **Delta Lake architecture**: Delta Lake consists of: - **Parquet files**: Store the actual data - **Transaction log**: Tracks all changes and provides ACID transactions - **Metadata**: Contains schema information, statistics, and other metadata 3. **Benefits of Parquet**: - Columnar storage format - Efficient compression - Schema evolution support - Predicate pushdown capabilities - Compatible with many data processing frameworks 4. **Why not the other options**: - **A. Delta**: This is misleading - "Delta" refers to the entire lakehouse platform/format, not the underlying file format - **B. CSV**: Not used as primary storage due to lack of schema enforcement and poor performance - **D. JSON**: Not used as primary storage due to verbosity and lack of schema enforcement - **E. Proprietary format**: Delta Lake is open source and uses standard Parquet format, not a proprietary Databricks-specific format **Key takeaway**: Delta Lake enhances Parquet files with transaction capabilities and metadata management, but the actual data is stored in Parquet format.
Author: Keng Suppaseth
Ultimate access to all questions.
No comments yet.