
Answer-first summary for fast verification
Answer: Auto Loader
## Detailed Explanation ### **Why Auto Loader (Option C) is the Correct Choice** Auto Loader is specifically designed for incremental file processing scenarios in Azure Databricks and perfectly addresses all the stated requirements: **1. Incremental Processing**: Auto Loader uses the `cloudFiles` source to automatically detect and process new files as they arrive in Azure Data Lake Storage Gen2. It maintains state information about processed files, ensuring only new files are processed without manual intervention. **2. Minimized Implementation & Maintenance**: Auto Loader significantly reduces implementation complexity by: - Automatically handling file discovery and tracking - Providing built-in mechanisms for file processing - Eliminating the need for custom file monitoring solutions - Reducing operational overhead through automated file management **3. Cost Optimization for Millions of Files**: Auto Loader offers two processing modes: - **Directory Listing Mode**: Efficient for smaller workloads - **File Notification Mode**: Leverages Azure Event Grid for optimal performance with millions of files, minimizing scanning costs and improving efficiency **4. Schema Inference & Evolution**: Auto Loader provides robust schema handling capabilities: - Automatically infers schema from incoming data files - Supports schema evolution (drift) to accommodate changing data structures - Maintains data integrity while adapting to schema changes over time ### **Analysis of Other Options** **A. COPY INTO**: - Primarily designed for batch loading operations - Does not provide native incremental processing capabilities - Requires manual implementation for detecting new files - Lacks built-in support for structured streaming **B. Azure Data Factory**: - While capable of orchestrating data movement, it's not a native structured streaming source - Would require complex pipeline configurations for incremental processing - Higher implementation and maintenance overhead compared to Auto Loader - Less optimized for real-time streaming scenarios in Databricks **D. Apache Spark FileStreamSource**: - Basic streaming source that requires significant manual implementation - Lacks built-in schema inference and evolution capabilities - Higher maintenance effort for tracking file states - Less optimized for cloud storage scenarios compared to Auto Loader ### **Conclusion** Auto Loader is the optimal choice because it's specifically engineered for cloud-based incremental file processing scenarios in Azure Databricks. It provides a comprehensive solution that minimizes operational complexity while maximizing efficiency and cost-effectiveness for processing large volumes of files with evolving schemas.
Ultimate access to all questions.
No comments yet.
Author: LeetQuiz Editorial Team
You have an Azure Databricks workspace and an Azure Data Lake Storage Gen2 account named storage1. New files are uploaded daily to storage1.
You need to recommend a solution to configure storage1 as a structured streaming source that meets the following requirements:
What should you include in the recommendation?
A
COPY INTO
B
Azure Data Factory
C
Auto Loader
D
Apache Spark FileStreamSource