
Answer-first summary for fast verification
Answer: Databricks' schema inference selects types broad enough to accommodate all observed data, so manual schema definition provides superior data quality assurance and stricter typing.
### Explanation **Correct Answer: D** When Databricks infers a schema (such as when using Auto Loader or `spark.read.json`), it samples the data and selects the most permissive types to avoid ingestion failures. This often results in fields being typed as `StringType` even when they are logically integers or dates. By manually declaring the schema for the 45 relevant fields, the engineer ensures data quality by enforcing strict types and catching schema drift or mismatches before they reach downstream production systems. **Incorrect Options:** * **A:** You cannot modify data types by simply 'hacking' or editing Parquet file footers. Delta Lake manages schema evolution through its transaction log and, when necessary, file rewrites. * **B:** While Tungsten optimizes performance, storing all data as strings is inefficient. It leads to increased storage size and slower performance for numeric, boolean, or temporal operations compared to using appropriate native types. * **C:** Schema inference is purely data-driven and has no awareness of the specific requirements of downstream dashboards or models. Manual intervention is required to ensure types meet consumer expectations.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
No comments yet.
A data engineer is designing the schema for a Silver-layer table, silver_device_recordings, which processes highly nested JSON data containing 100 unique fields. Only 45 of these fields are required for downstream production models and dashboards. Given the complexity and volume of the fields, which of the following statements is most relevant to the engineer's decision on whether to use schema inference or manual declaration?
A
Since Delta Lake uses Parquet storage, data types can be evolved and modified by directly editing the file footer metadata.
B
Databricks' Tungsten engine is optimized for string data, making the use of string types for all JSON fields the most computationally efficient approach.
C
Schema inference and evolution automatically ensure that the resulting data types will always align with the requirements of downstream consumers.
D
Databricks' schema inference selects types broad enough to accommodate all observed data, so manual schema definition provides superior data quality assurance and stricter typing.