
Explanation:
spark.read.json), it examines a sample of data and chooses the most permissive type possible (often StringType) to ensure all observed values fit. This prevents ingestion failures but may not be optimal for data integrity.IntegerType or TimestampType). This acts as a quality gate, ensuring that the 45 fields required for downstream tasks are formatted correctly and preventing schema drift from silently corrupting your Silver layer.StringType for all fields is not efficient. It can lead to increased storage size and slower performance for filters and aggregations compared to specific numeric or date types.Ultimate access to all questions.
A data engineer is designing a Silver-layer table, silver_device_recordings, to ingest complex, nested JSON data containing 100 unique fields. Downstream production dashboards and machine learning models only utilize 45 of these fields. When deciding whether to use manual schema declaration or schema inference, which of the following statements is most relevant to the decision-making process?
A
Because Databricks uses Tungsten encoding to optimize string data, storing all nested JSON as string types is consistently the most efficient approach for query performance.
B
Databricks' schema inference engine selects data types that accommodate all observed values, making manual schema declaration a better choice for enforcing data quality and strict typing.
C
Since Delta Lake uses the Parquet storage format, schema evolution is typically performed by directly modifying the metadata in the file footers of existing data files.
D
Schema inference and evolution on Databricks are designed to guarantee that inferred types will automatically align with the data type requirements of downstream consumers.
No comments yet.