Databricks Certified Data Engineer - Professional

Databricks Certified Data Engineer - Professional

Get started today

Ultimate access to all questions.


When using AutoLoader in a streaming application to ingest new files from an S3 location, it infers the schema based on the first 50 GB of data or the first 1000 files, whichever is less. Which configuration should be adjusted to change the default number of files used for schema inference to 500 for all future queries?





Explanation:

AutoLoader uses the first 50 GB or first 1000 files to infer the schema, whichever condition is met first. To change the default number of files used for schema inference, the configuration spark.databricks.cloudFiles.schemaInference.sampleSize.numFiles should be adjusted. This configuration sets the limit on the number of files to infer the schema, with a default value of 1000. Adjusting this to 500 will ensure that only the first 500 files are used for schema inference in future queries.