
Ultimate access to all questions.
A data engineer has developed a data pipeline to ingest data from a JSON source using Auto Loader, but the engineer has not provided any type inference or schema hints in their pipeline. Upon reviewing the data, the data engineer has noticed that all of the columns in the target table are of the string type despite some of the fields only including float or boolean values.
Why has Auto Loader inferred all of the columns to be of the string type?
A
Auto Loader cannot infer the schema of ingested data
B
JSON data is a text-based format
C
Auto Loader only works with string data
D
All of the fields had at least one null value
Explanation:
Auto Loader's schema inference works by sampling a subset of files to determine column types. When Auto Loader encounters null values in the sampled data for certain columns, it defaults to the string type for those columns as a safe fallback. This is because:
Null values don't provide type information - When Auto Loader samples files and finds null values in a column, it cannot determine the intended data type (boolean, integer, float, etc.) from null alone.
String is the most flexible type - String can accommodate any data format, so Auto Loader chooses string as the default to avoid data loss or parsing errors.
This is a known behavior - In Databricks Auto Loader, if the initial sample contains null values for certain fields, those columns will be inferred as string type rather than their actual intended types.
Why other options are incorrect:
Solution: To fix this issue, the data engineer should:
cloudFiles.schemaHints optioncloudFiles.schemaEvolutionMode to control how schema changes are handledschema option if the data structure is known