Reddit

Parquet is the underlying storage format: Delta Lake uses Parquet files as the base storage format for data. The "Delta" aspect refers to the transaction log and metadata layer that sits on top of Parquet files, not the underlying data storage format.

Delta Lake architecture: Delta Lake consists of:

Parquet files: Store the actual data
Transaction log: Tracks all changes and provides ACID transactions
Metadata: Contains schema information, statistics, and other metadata

Benefits of Parquet:

Columnar storage format
Efficient compression
Schema evolution support
Predicate pushdown capabilities
Compatible with many data processing frameworks

Why not the other options:

A. Delta: This is misleading - "Delta" refers to the entire lakehouse platform/format, not the underlying file format
B. CSV: Not used as primary storage due to lack of schema enforcement and poor performance
D. JSON: Not used as primary storage due to verbosity and lack of schema enforcement
E. Proprietary format: Delta Lake is open source and uses standard Parquet format, not a proprietary Databricks-specific format

Databricks Certified Data Engineer - Associate

Get started today

Get started today

Explanation

Comments (0)

In which of the following file formats is data from Delta Lake tables primarily stored?

Comments (0)