
Answer-first summary for fast verification
Answer: The pipeline configuration does not include a schema definition, preventing Auto Loader from accurately inferring data types from the JSON source.
Auto Loader is designed to infer data types from the source data accurately. However, without a schema definition provided in the pipeline configuration, Auto Loader defaults to inferring all data as STRING to ensure compatibility and prevent data loss. To resolve this issue, the team should define a schema that specifies the correct data types for each field in the JSON source. This approach ensures that Auto Loader can accurately infer data types, leading to more efficient data processing and accurate downstream analytics.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
No comments yet.
In a data engineering project, the team is utilizing Auto Loader to ingest data from a JSON source into a Databricks environment. They observe that despite the JSON source containing a mix of data types including integers, booleans, and strings, Auto Loader is inferring all data as STRING. This has led to data processing inefficiencies and inaccuracies in downstream analytics. Considering the need for accurate data type inference to ensure data quality and processing efficiency, which of the following is the MOST LIKELY reason for this behavior and the BEST solution to resolve it? Choose one option.
A
Auto Loader lacks the capability to infer data types from JSON sources, requiring manual data type specification for each field.
B
The JSON source's data is malformed or lacks explicit type definitions, forcing Auto Loader to default all data to STRING type.
C
Auto Loader's default setting is to infer all data as STRING to maximize compatibility across different data sources and formats.
D
The pipeline configuration does not include a schema definition, preventing Auto Loader from accurately inferring data types from the JSON source.