Ultimate access to all questions.
A data engineer is tasked with creating a data pipeline that ingests data from a high-velocity source system, generating millions of files daily stored in cloud storage. The goal is to identify and ingest only new files since the last pipeline run incrementally, while also accommodating expected schema changes over time. Which technique should the data engineer use to address these requirements?
Explanation:
Auto Loader is the optimal choice for ingesting millions of files efficiently, offering advantages in schema inference and evolution over COPY INTO, which is more suited for thousands of files. While COPY INTO allows for easier management of re-uploaded file subsets, Auto Loader's scalability and efficiency at handling large volumes of data and schema changes make it the correct answer. Options like MERGE, Databricks SQL, and Delta Lake are not relevant for this scenario.