
Explanation:
The most effective way to deduplicate incoming records against existing data in a Delta table is by using a MERGE operation.
By defining a matching condition on a unique key (e.g., ON source.id = target.id) and using the WHEN NOT MATCHED THEN INSERT * clause, the operation will only insert records that do not already exist in the target table. This is often referred to as an 'insert-only' merge because it ignores updates to existing rows.
Why other options are incorrect:
MERGE statement.Ultimate access to all questions.
A data engineer is designing an ETL workflow to handle late-arriving and potentially duplicate records from a single data source. While batch-level deduplication is feasible, the engineer needs a method to deduplicate incoming data against records already residing in the target Delta table. Which approach allows the engineer to deduplicate data against previously processed records during the insertion process?
A
Configure the table property delta.deduplicate to true.
B
Execute a VACUUM operation on the Delta table after each batch completes.
C
Utilize Delta Lake schema enforcement to prevent the insertion of duplicate records.
D
Perform a full outer join on a unique key and overwrite existing data.
E
Implement an 'insert-only' MERGE operation with a matching condition on a unique key.
No comments yet.