
Answer-first summary for fast verification
Answer: Implement an 'insert-only' `MERGE` operation with a matching condition on a unique key.
The most effective way to deduplicate incoming records against existing data in a Delta table is by using a `MERGE` operation. By defining a matching condition on a unique key (e.g., `ON source.id = target.id`) and using the `WHEN NOT MATCHED THEN INSERT *` clause, the operation will only insert records that do not already exist in the target table. This is often referred to as an 'insert-only' merge because it ignores updates to existing rows. **Why other options are incorrect:** * **VACUUM**: This command is used for data retention and cleaning up old file versions; it has no impact on data deduplication logic. * **Schema Enforcement**: This ensures that incoming data matches the expected schema (column names/types) but does not check for duplicate primary keys or values. * **delta.deduplicate**: This is not a valid configuration property in Delta Lake for write-time deduplication. * **Full Outer Join / Overwrite**: This approach is computationally expensive, prone to errors, and lacks the atomicity and efficiency provided by the `MERGE` statement.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
No comments yet.
A data engineer is designing an ETL workflow to handle late-arriving and potentially duplicate records from a single data source. While batch-level deduplication is feasible, the engineer needs a method to deduplicate incoming data against records already residing in the target Delta table. Which approach allows the engineer to deduplicate data against previously processed records during the insertion process?
A
Configure the table property delta.deduplicate to true.
B
Execute a VACUUM operation on the Delta table after each batch completes.
C
Utilize Delta Lake schema enforcement to prevent the insertion of duplicate records.
D
Perform a full outer join on a unique key and overwrite existing data.
E
Implement an 'insert-only' MERGE operation with a matching condition on a unique key.