
Ultimate access to all questions.
A data engineer is designing a Silver-layer table, silver_device_recordings, to ingest complex, nested JSON data containing 100 unique fields. Downstream production dashboards and machine learning models only utilize 45 of these fields. When deciding whether to use manual schema declaration or schema inference, which of the following statements is most relevant to the decision-making process?
A
Because Databricks uses Tungsten encoding to optimize string data, storing all nested JSON as string types is consistently the most efficient approach for query performance.
B
Databricks' schema inference engine selects data types that accommodate all observed values, making manual schema declaration a better choice for enforcing data quality and strict typing.
C
Since Delta Lake uses the Parquet storage format, schema evolution is typically performed by directly modifying the metadata in the file footers of existing data files.
D
Schema inference and evolution on Databricks are designed to guarantee that inferred types will automatically align with the data type requirements of downstream consumers.