
Answer-first summary for fast verification
Answer: Parquet files have a well-defined schema
**Explanation:** Parquet files have a well-defined schema that is embedded within the file format itself, unlike CSV files which are schema-less and require schema inference or explicit schema definition. This is a key benefit when using CREATE TABLE AS SELECT (CTAS) statements because: 1. **Schema Preservation**: Parquet files store schema metadata (column names, data types) directly in the file, ensuring data integrity and consistency. 2. **No Schema Inference Issues**: With CSV files, Spark must infer the schema by scanning the data, which can lead to errors (e.g., incorrect data type detection, null value handling issues). 3. **Better Performance**: Parquet's columnar format with embedded schema allows for more efficient data processing and querying. 4. **Data Type Support**: Parquet supports complex data types (arrays, structs, maps) that CSV cannot natively represent. While options A and D are also true about Parquet files (they can be partitioned and optimized), the most direct benefit specifically for CTAS operations is the well-defined schema, which eliminates schema-related issues that commonly occur with CSV files.
Author: Keng Suppaseth
Ultimate access to all questions.
No comments yet.
Which of the following describes a benefit of creating an external table from Parquet rather than CSV when using a CREATE TABLE AS SELECT statement?
A
Parquet files can be partitioned
B
CREATE TABLE AS SELECT statements cannot be used on files
C
Parquet files have a well-defined schema
D
Parquet files have the ability to be optimized
E
Parquet files will become Delta tables