
Answer-first summary for fast verification
Answer: ```sql MERGE INTO customer_purchases AS target USING ( SELECT customer_id, MAX(purchase_date) AS latest_purchase_date FROM customer_purchases GROUP BY customer_id ) AS source ON target.customer_id = source.customer_id AND target.purchase_date < source.latest_purchase_date WHEN MATCHED THEN DELETE; ```
This solution effectively utilizes the MERGE statement for deduplication within Delta Lake. It involves merging the `customer_purchases` table with a subquery that identifies the latest `purchase_date` for each `customer_id`. The ON clause matches records in the target table with those in the source based on `customer_id` and where the target's `purchase_date` is earlier than the source's `latest_purchase_date`, indicating the presence of newer records. The `WHEN MATCHED THEN DELETE` clause ensures that only the most recent purchase record for each customer remains, thereby deduplicating the table according to the specified criteria.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
No comments yet.
You have a Delta Lake table named customer_purchases with duplicate records based on customer_id. Your goal is to deduplicate these records by retaining only the latest purchase record for each customer, determined by purchase_date. How would you use the MERGE statement in Databricks SQL to accomplish this task?
A
MERGE INTO customer_purchases AS target
USING customer_purchases AS source
ON target.customer_id = source.customer_id
WHEN MATCHED AND target.purchase_date < source.purchase_date THEN DELETE;
MERGE INTO customer_purchases AS target
USING customer_purchases AS source
ON target.customer_id = source.customer_id
WHEN MATCHED AND target.purchase_date < source.purchase_date THEN DELETE;
B
MERGE INTO customer_purchases AS target
USING (
SELECT customer_id, purchase_date
FROM customer_purchases
WHERE purchase_date = (SELECT MAX(purchase_date) FROM customer_purchases GROUP BY customer_id)
) AS source
ON target.customer_id = source.customer_id
WHEN MATCHED THEN UPDATE SET target.* = source.*
WHEN NOT MATCHED THEN INSERT *;
MERGE INTO customer_purchases AS target
USING (
SELECT customer_id, purchase_date
FROM customer_purchases
WHERE purchase_date = (SELECT MAX(purchase_date) FROM customer_purchases GROUP BY customer_id)
) AS source
ON target.customer_id = source.customer_id
WHEN MATCHED THEN UPDATE SET target.* = source.*
WHEN NOT MATCHED THEN INSERT *;
C
MERGE INTO customer_purchases AS target
USING (
SELECT customer_id, MAX(purchase_date) AS latest_purchase_date
FROM customer_purchases
GROUP BY customer_id
) AS source
ON target.customer_id = source.customer_id
WHEN MATCHED AND target.purchase_date < source.latest_purchase_date THEN DELETE
WHEN NOT MATCHED THEN INSERT *;
MERGE INTO customer_purchases AS target
USING (
SELECT customer_id, MAX(purchase_date) AS latest_purchase_date
FROM customer_purchases
GROUP BY customer_id
) AS source
ON target.customer_id = source.customer_id
WHEN MATCHED AND target.purchase_date < source.latest_purchase_date THEN DELETE
WHEN NOT MATCHED THEN INSERT *;
D
MERGE INTO customer_purchases AS target
USING (
SELECT customer_id, MAX(purchase_date) AS latest_purchase_date
FROM customer_purchases
GROUP BY customer_id
) AS source
ON target.customer_id = source.customer_id
AND target.purchase_date < source.latest_purchase_date
WHEN MATCHED THEN DELETE;
MERGE INTO customer_purchases AS target
USING (
SELECT customer_id, MAX(purchase_date) AS latest_purchase_date
FROM customer_purchases
GROUP BY customer_id
) AS source
ON target.customer_id = source.customer_id
AND target.purchase_date < source.latest_purchase_date
WHEN MATCHED THEN DELETE;