
Ultimate access to all questions.
Explanation:
The correct answer is B: JSON data is a text-based format.
Databricks Auto Loader supports schema inference for various file formats, but its default behavior differs based on whether the format embeds type information:
Formats like Parquet or Avro encode data types in the file metadata/schema, so Auto Loader can infer precise types (e.g., float, boolean, int, struct, etc.).
JSON (along with CSV and XML) is a text-based, schema-less format. It does not store explicit type metadata — numbers, booleans, nulls, objects, and arrays are all represented as text.
Because of this, when no schema is explicitly provided and no type-inference options (such as cloudFiles.inferColumnTypes = true) or schema hints are used, Auto Loader defaults to inferring all columns as string (including nested fields). This safe default prevents type-related errors or data corruption during ingestion and schema evolution, especially with evolving or inconsistent JSON data.
In your scenario:
The engineer used Auto Loader on JSON without any type inference, schema hints, or an explicit schema.
As a result, even fields containing only floats or booleans were read as strings.
To get more precise types (e.g., float, boolean), you would enable:
.option("cloudFiles.inferColumnTypes", "true")
.option("cloudFiles.inferColumnTypes", "true")
or provide schema hints for specific columns. You can also supply a full schema upfront with .schema(...).
A: Auto Loader cannot infer the schema of ingested data — False. Auto Loader does infer schemas (it samples files and tracks them in the schemaLocation). It just defaults to string for text-based formats like JSON.
C: Auto Loader only works with string data — False. It supports typed data (especially with options or for binary formats) and can evolve schemas with proper configuration.
D: All of the fields had at least one null value — False. Nulls are common in JSON and do not force string inference. The root cause is the format itself, not the presence of nulls.
This is a common exam topic for the Databricks Certified Data Engineer - Associate, as it tests understanding of Auto Loader's safe defaults for semi-structured data and when/how to override them for better type fidelity.
A data engineer has developed a data pipeline to ingest data from a JSON source using Auto Loader, but the engineer has not provided any type inference or schema hints in their pipeline. Upon reviewing the data, the data engineer has noticed that all of the columns in the target table are of the string type despite some of the fields only including float or boolean values.
Why has Auto Loader inferred all of the columns to be of the string type?
A
Auto Loader cannot infer the schema of ingested data
B
JSON data is a text-based format
C
Auto Loader only works with string data
D
All of the fields had at least one null value