
Answer-first summary for fast verification
Answer: Utilize a `MERGE INTO` operation with a `WHEN NOT MATCHED THEN INSERT` clause based on a unique identifier.
### Explanation **Correct Answer: C** The most effective and efficient method for ensuring idempotency when loading data into Delta Lake is using the **`MERGE INTO`** operation. By specifying a match condition on a unique key and using only the **`WHEN NOT MATCHED THEN INSERT`** clause, the engine will only append records whose keys do not already exist in the target table. This ensures that even if late-arriving or duplicate records are processed, the target table remains clean. **Why other options are incorrect:** * **Schema Enforcement:** This feature is designed to ensure data type and column name consistency. It does not enforce data uniqueness or primary key constraints. * **VACUUM:** This command is a maintenance operation used to remove old data files no longer referenced by the Delta table's current version to free up storage. It does not perform data deduplication. * **Full outer join + Overwrite:** While this could theoretically remove duplicates, it is highly inefficient for large datasets as it requires a full table scan and a complete rewrite of the target table for every incoming batch, whereas `MERGE` is optimized for incremental updates.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
No comments yet.
A data engineer is architecting a pipeline that must handle late-arriving records containing potential duplicates. Beyond deduplicating within each incoming batch, which strategy should be used to ensure that records already stored in a target Delta table are not duplicated during ingestion?
A
Perform a full outer join between the incoming batch and the target table on a unique key, followed by a full table overwrite.
B
Enable Delta Lake schema enforcement to automatically identify and block records with duplicate keys.
C
Utilize a MERGE INTO operation with a WHEN NOT MATCHED THEN INSERT clause based on a unique identifier.
D
Execute the VACUUM command on the Delta table after every batch to remove redundant data entries from the transaction log.