Databricks Certified Data Engineer - Associate

Databricks Certified Data Engineer - Associate

Get started today

Ultimate access to all questions.


Given a scenario where you need to load data from an external source into a Delta Lake table, but the data might contain duplicates that should not be written to the table. Which SQL command would you use to ensure that only new, non-duplicate records are added to the table?




Explanation:

MERGE INTO in Delta Lake allows you to conditionally insert, update, or delete records based on a matching condition between the source and target tables. For example:

MERGE INTO target_table AS t
USING source_table AS s
ON t.id = s.id
WHEN NOT MATCHED THEN
  INSERT *

Reference: From the Databricks documentation on MERGE INTO: https://docs.databricks.com/aws/en/delta/merge#data-deduplication-when-writing-into-delta-tables The MERGE INTO statement allows you to merge a set of updates, insertions, and deletions into a Delta table based on a source table. This is useful for deduplication and upserts.