
Explanation:
Correct Answer: C
The most effective and efficient method for ensuring idempotency when loading data into Delta Lake is using the MERGE INTO operation. By specifying a match condition on a unique key and using only the WHEN NOT MATCHED THEN INSERT clause, the engine will only append records whose keys do not already exist in the target table. This ensures that even if late-arriving or duplicate records are processed, the target table remains clean.
Why other options are incorrect:
MERGE is optimized for incremental updates.Ultimate access to all questions.
A data engineer is architecting a pipeline that must handle late-arriving records containing potential duplicates. Beyond deduplicating within each incoming batch, which strategy should be used to ensure that records already stored in a target Delta table are not duplicated during ingestion?
A
Perform a full outer join between the incoming batch and the target table on a unique key, followed by a full table overwrite.
B
Enable Delta Lake schema enforcement to automatically identify and block records with duplicate keys.
C
Utilize a MERGE INTO operation with a WHEN NOT MATCHED THEN INSERT clause based on a unique identifier.
D
Execute the VACUUM command on the Delta table after every batch to remove redundant data entries from the transaction log.
No comments yet.