
Answer-first summary for fast verification
Answer: Auto Loader
## Explanation Auto Loader is specifically designed for incremental data ingestion scenarios where files accumulate in a directory and you need to process only new files since the last run. Here's why: ### Key Features of Auto Loader: 1. **Incremental File Processing**: Auto Loader automatically tracks which files have been processed and only loads new files in subsequent runs. 2. **File Notification Mode**: Uses cloud-native file notification services (like AWS SQS, Azure Event Grid, or GCP Pub/Sub) to efficiently detect new files without directory listing. 3. **Directory Listing Mode**: Falls back to directory listing when file notification isn't available. 4. **State Management**: Maintains state about processed files, ensuring idempotent processing. ### Why Other Options Are Incorrect: - **A. Databricks SQL**: Primarily for querying and analyzing data, not for incremental file ingestion. - **B. Delta Lake**: Provides ACID transactions and versioning for data lakes, but doesn't inherently solve the incremental file detection problem. - **C. Unity Catalog**: A unified governance solution for data and AI assets, not for incremental file ingestion. - **D. Data Explorer**: A tool for exploring and visualizing data, not for pipeline orchestration or incremental ingestion. ### Use Case Fit: The scenario describes exactly what Auto Loader is designed for: a shared directory where files accumulate and need to be processed incrementally while preserving existing files. Auto Loader's ability to track processed files and only ingest new ones makes it the perfect solution for this requirement.
Author: Keng Suppaseth
Ultimate access to all questions.
A data engineer is designing a data pipeline. The source system generates files in a shared directory that is also used by other processes. As a result, the files should be kept as is and will accumulate in the directory. The data engineer needs to identify which files are new since the previous run in the pipeline, and set up the pipeline to only ingest those new files with each run.
Which of the following tools can the data engineer use to solve this problem?
A
Databricks SQL
B
Delta Lake
C
Unity Catalog
D
Data Explorer
E
Auto Loader
No comments yet.