
Answer-first summary for fast verification
Answer: Auto Loader
## Explanation Auto Loader is the correct tool for this scenario because it is specifically designed to incrementally ingest new files from cloud storage or file systems. Here's why: **Key Features of Auto Loader:** 1. **Incremental File Processing**: Auto Loader automatically tracks which files have been processed and only loads new files in subsequent runs. 2. **File Tracking**: It maintains state information about processed files, allowing it to identify new files since the last run. 3. **Efficient Processing**: It's optimized for streaming and incremental data ingestion patterns. 4. **Cloud Storage Support**: Works with various cloud storage systems (AWS S3, Azure Blob Storage, etc.). **Why other options are incorrect:** - **Unity Catalog**: A unified governance solution for data and AI assets, not designed for incremental file ingestion. - **Delta Lake**: An open-source storage layer that provides ACID transactions, but doesn't inherently track new files in a directory. - **Databricks SQL**: A SQL analytics service for running queries, not for incremental file ingestion. - **Data Explorer**: A tool for exploring and visualizing data, not for pipeline file ingestion. **How Auto Loader solves this problem:** 1. The data engineer can configure Auto Loader to monitor the shared directory. 2. On the first run, Auto Loader processes all existing files and records their metadata. 3. On subsequent runs, Auto Loader checks the directory, identifies files that weren't processed before, and only ingests those new files. 4. The processed files remain in the directory untouched, as required by the scenario. This makes Auto Loader the ideal solution for incremental file ingestion in data pipelines where source files accumulate and need to be processed only once.
Author: Keng Suppaseth
Ultimate access to all questions.
A data engineer is designing a data pipeline. The source system generates files in a shared directory that is also used by other processes. As a result, the files should be kept as is and will accumulate in the directory. The data engineer needs to identify which files are new since the previous run in the pipeline, and set up the pipeline to only ingest those new files with each run. Which of the following tools can the data engineer use to solve this problem?
A
Unity Catalog
B
Delta Lake
C
Databricks SQL
D
Data Explorer
E
Auto Loader
No comments yet.