You are designing a data pipeline in Azure Databricks that requires inserting data into a target Delta table from multiple streaming sources. The pipeline must ensure that data is deduplicated before being written to the target table to maintain data integrity. Additionally, the solution must be cost-effective and scalable to handle increasing data volumes over time. Considering these requirements, which command should you use and why? (Choose one correct answer from the options below.)

Simulated

CREATE OR REPLACE TABLE, because it allows you to create a new table with the deduplicated data, but it does not efficiently handle streaming data or scale with increasing data volumes.

4.3%

INSERT OVERWRITE, because it allows you to overwrite the target table with the deduplicated data, but this approach may not be cost-effective or scalable due to the overhead of rewriting the entire table.

8.0%

MERGE, because it is specifically designed to handle deduplication upon writing from multiple sources, including streaming data, and it efficiently scales with increasing data volumes by only processing changes.

77.4%

COPY INTO, because it prevents duplication of data in the target table when inserting from multiple sources, but it lacks the ability to efficiently deduplicate streaming data and may not scale as well as other options.

10.4%

Databricks Certified Data Engineer - Associate

Comments

Get started today