
Ultimate access to all questions.
Deep dive into the quiz with AI chat providers.
We prepare a focused prompt with your quiz and certificate details so each AI can offer a more tailored, in-depth explanation.
Question 27 A data engineer is designing a data pipeline. The source system generates files in a shared directory that is also used by other processes. As a result, the files should be kept as is and will accumulate in the directory. The data engineer needs to identify which files are new since the previous run in the pipeline, and set up the pipeline to only ingest those new files with each run.
Which of the following tools can the data engineer use to solve this problem?
A
Databricks SQL
B
Delta Lake
C
Unity Catalog
D
Data Explorer
E
Auto Loader
Explanation:
Auto Loader is specifically designed to solve this exact problem. Here's why:
# Example Auto Loader code for incremental file processing
df = spark.readStream.format("cloudFiles") \
.option("cloudFiles.format", "json") \
.option("cloudFiles.includeExistingFiles", "false") \
.load("path/to/shared/directory")
# Example Auto Loader code for incremental file processing
df = spark.readStream.format("cloudFiles") \
.option("cloudFiles.format", "json") \
.option("cloudFiles.includeExistingFiles", "false") \
.load("path/to/shared/directory")
Auto Loader's ability to track processed files and only ingest new ones makes it the perfect solution for this scenario where files accumulate in a shared directory and need to be processed incrementally.