
Answer-first summary for fast verification
Answer: Databricks' schema inference engine selects data types that accommodate all observed values, making manual schema declaration a better choice for enforcing data quality and strict typing.
### Why this is correct: * **Broad/Safe Types by Default:** When Databricks infers a schema (using Auto Loader or `spark.read.json`), it examines a sample of data and chooses the most permissive type possible (often `StringType`) to ensure all observed values fit. This prevents ingestion failures but may not be optimal for data integrity. * **Data Quality Assurance:** By manually declaring the schema, you can enforce strict types (such as `IntegerType` or `TimestampType`). This acts as a quality gate, ensuring that the 45 fields required for downstream tasks are formatted correctly and preventing schema drift from silently corrupting your Silver layer. ### Why the other options are incorrect: * **Option A:** While the Tungsten engine is highly optimized, using `StringType` for all fields is not efficient. It can lead to increased storage size and slower performance for filters and aggregations compared to specific numeric or date types. * **Option C:** Delta Lake manages schema evolution through its transaction log, not by manual edits to Parquet footers. Directly modifying Parquet metadata is not a supported or safe way to evolve Delta tables. * **Option D:** Schema inference only looks at the source data; it has no context regarding the specific needs or constraints of downstream dashboards and models. Only manual declaration ensures the schema meets the requirements of the consumers.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
No comments yet.
A data engineer is designing a Silver-layer table, silver_device_recordings, to ingest complex, nested JSON data containing 100 unique fields. Downstream production dashboards and machine learning models only utilize 45 of these fields. When deciding whether to use manual schema declaration or schema inference, which of the following statements is most relevant to the decision-making process?
A
Because Databricks uses Tungsten encoding to optimize string data, storing all nested JSON as string types is consistently the most efficient approach for query performance.
B
Databricks' schema inference engine selects data types that accommodate all observed values, making manual schema declaration a better choice for enforcing data quality and strict typing.
C
Since Delta Lake uses the Parquet storage format, schema evolution is typically performed by directly modifying the metadata in the file footers of existing data files.
D
Schema inference and evolution on Databricks are designed to guarantee that inferred types will automatically align with the data type requirements of downstream consumers.