
Answer-first summary for fast verification
Answer: Use a `MERGE INTO` operation with a `WHEN NOT MATCHED` clause based on a unique key.
### Explanation **Correct Choice: Use a `MERGE INTO` operation with a `WHEN NOT MATCHED` clause based on a unique key.** * **Atomic Idempotency**: Delta Lake's `MERGE INTO` statement allows for complex conditional logic within a single transaction. By specifying a `WHEN NOT MATCHED THEN INSERT` clause based on a primary or unique key, the system checks incoming records against the target table. If the key already exists, the record is ignored; if not, it is inserted. This ensures that late-arriving duplicates or re-run batches do not result in duplicate entries in the target table. * **Efficiency**: This method is highly efficient for incremental loads because it only targets new records rather than rewriting the entire table. **Why other options are incorrect:** * **Schema Enforcement**: Delta Lake schema enforcement (or schema evolution) is designed to ensure data type and column name consistency. It does not validate data uniqueness or enforce primary key constraints. * **VACUUM**: The `VACUUM` command is used for data retention and storage optimization by deleting old data files that are no longer in the current table state. It has no functionality related to data deduplication. * **Full Outer Join + Overwrite**: While a join can identify duplicates, performing a full outer join and overwriting the table on every batch is computationally expensive and non-performant for large production datasets, especially compared to the incremental nature of a `MERGE` operation.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
No comments yet.
A data engineer is designing a pipeline to handle late-arriving and duplicate records. Beyond de-duplicating data within the current micro-batch, which technique effectively prevents duplicate records from being inserted into an existing Delta table by checking against previously processed data?
A
Perform a full outer join on a unique key and overwrite the entire target table with the result.
B
Enable Delta Lake schema enforcement to block duplicate records during the write operation.
C
Use a MERGE INTO operation with a WHEN NOT MATCHED clause based on a unique key.
D
Execute the VACUUM command on the Delta table after each batch to remove stale duplicates.