
Answer-first summary for fast verification
Answer: Use `MERGE INTO` to upsert the most recent entry for each `customer_id` into the table
The `MERGE INTO` command is designed for upserting data from a source into a target Delta table, supporting inserts, updates, and deletes. This makes it the ideal choice for ensuring only the most recent information per `customer_id` is stored, as it can handle the complexity of CDC feeds efficiently. References: [Delta Lake Merge Documentation](https://docs.databricks.com/delta/merge.html), [SQL Language Manual for Delta Merge Into](https://docs.databricks.com/sql/language-manual/delta-merge-into.html).
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
No comments yet.
The data engineering team aims to construct a pipeline that processes customer data via a Change Data Capture (CDC) feed from a source system. This CDC feed includes both the data records and metadata, indicating actions like insertions, updates, or deletions, alongside a timestamp column (update_time) that orders these changes. Each record is uniquely identified by a customer_id. Given that a single batch may contain multiple changes for the same customer with different update_time values, the team's goal is to store only the most recent information per customer in a target Delta Lake table. Which solution best fulfills these requirements?
A
Enable Delta Lake's Change Data Feed (CDF) on the target table to automatically merge the received CDC feed
B
Use the dropDuplicates function to remove duplicates by customer_id, then merge the duplicate records into the table
C
Use MERGE INTO with SEQUENCE BY clause on the update_time for ordering how operations should be applied
D
Use MERGE INTO to upsert the most recent entry for each customer_id into the table
E
Use the option mergeSchema when writing the CDC data into the table to automatically merge the changed data with its most recent schema