Explanation
When comparing Parquet vs CSV for external tables in Databricks:
Parquet advantages:
- Schema enforcement: Parquet files have a well-defined schema embedded within the file format itself, which includes data types, column names, and metadata.
- Columnar storage: Parquet is a columnar format optimized for analytics workloads.
- Compression: Better compression ratios compared to CSV.
- Schema evolution: Supports schema evolution capabilities.
CSV limitations:
- No embedded schema: CSV files don't contain schema information - schema must be inferred or explicitly defined.
- Type inference issues: Databricks must infer data types from CSV content, which can lead to errors.
- No compression: Typically larger file sizes.
- Parsing overhead: More expensive to parse during query execution.
Why other options are incorrect:
- A: Both Parquet and CSV files can be partitioned in Databricks.
- B: External tables from Parquet files don't automatically become Delta tables; they remain external tables pointing to Parquet files.
- D: While Parquet files can be optimized, this is not the primary benefit over CSV for external tables.
The key benefit is that Parquet's embedded schema eliminates schema inference issues and provides better type safety compared to CSV.