
Answer-first summary for fast verification
Answer: MERGE, because it is specifically designed to handle deduplication upon writing from multiple sources, including streaming data, and it efficiently scales with increasing data volumes by only processing changes.
Option C is the correct answer because the MERGE command is specifically designed for scenarios requiring deduplication from multiple sources, including streaming data. It efficiently processes only the changes, making it scalable and cost-effective for handling increasing data volumes. Option A is incorrect because CREATE OR REPLACE TABLE is not suitable for streaming data or scalable deduplication. Option B is incorrect because INSERT OVERWRITE involves rewriting the entire table, which is not cost-effective or scalable. Option D is incorrect because COPY INTO does not efficiently handle deduplication for streaming data and may not scale as effectively as MERGE.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
No comments yet.
You are designing a data pipeline in Azure Databricks that requires inserting data into a target Delta table from multiple streaming sources. The pipeline must ensure that data is deduplicated before being written to the target table to maintain data integrity. Additionally, the solution must be cost-effective and scalable to handle increasing data volumes over time. Considering these requirements, which command should you use and why? (Choose one correct answer from the options below.)
A
CREATE OR REPLACE TABLE, because it allows you to create a new table with the deduplicated data, but it does not efficiently handle streaming data or scale with increasing data volumes.
B
INSERT OVERWRITE, because it allows you to overwrite the target table with the deduplicated data, but this approach may not be cost-effective or scalable due to the overhead of rewriting the entire table.
C
MERGE, because it is specifically designed to handle deduplication upon writing from multiple sources, including streaming data, and it efficiently scales with increasing data volumes by only processing changes.
D
COPY INTO, because it prevents duplication of data in the target table when inserting from multiple sources, but it lacks the ability to efficiently deduplicate streaming data and may not scale as well as other options.