
Explanation:
This solution effectively utilizes the MERGE statement for deduplication within Delta Lake. It involves merging the customer_purchases table with a subquery that identifies the latest purchase_date for each customer_id. The ON clause matches records in the target table with those in the source based on customer_id and where the target's purchase_date is earlier than the source's latest_purchase_date, indicating the presence of newer records. The WHEN MATCHED THEN DELETE clause ensures that only the most recent purchase record for each customer remains, thereby deduplicating the table according to the specified criteria.
Ultimate access to all questions.
No comments yet.
You have a Delta Lake table named customer_purchases with duplicate records based on customer_id. Your goal is to deduplicate these records by retaining only the latest purchase record for each customer, determined by purchase_date. How would you use the MERGE statement in Databricks SQL to accomplish this task?
A
MERGE INTO customer_purchases AS target
USING customer_purchases AS source
ON target.customer_id = source.customer_id
WHEN MATCHED AND target.purchase_date < source.purchase_date THEN DELETE;
MERGE INTO customer_purchases AS target
USING customer_purchases AS source
ON target.customer_id = source.customer_id
WHEN MATCHED AND target.purchase_date < source.purchase_date THEN DELETE;
B
MERGE INTO customer_purchases AS target
USING (
SELECT customer_id, purchase_date
FROM customer_purchases
WHERE purchase_date = (SELECT MAX(purchase_date) FROM customer_purchases GROUP BY customer_id)
) AS source
ON target.customer_id = source.customer_id
WHEN MATCHED THEN UPDATE SET target.* = source.*
WHEN NOT MATCHED THEN INSERT *;
MERGE INTO customer_purchases AS target
USING (
SELECT customer_id, purchase_date
FROM customer_purchases
WHERE purchase_date = (SELECT MAX(purchase_date) FROM customer_purchases GROUP BY customer_id)
) AS source
ON target.customer_id = source.customer_id
WHEN MATCHED THEN UPDATE SET target.* = source.*
WHEN NOT MATCHED THEN INSERT *;
C
MERGE INTO customer_purchases AS target
USING (
SELECT customer_id, MAX(purchase_date) AS latest_purchase_date
FROM customer_purchases
GROUP BY customer_id
) AS source
ON target.customer_id = source.customer_id
WHEN MATCHED AND target.purchase_date < source.latest_purchase_date THEN DELETE
WHEN NOT MATCHED THEN INSERT *;
MERGE INTO customer_purchases AS target
USING (
SELECT customer_id, MAX(purchase_date) AS latest_purchase_date
FROM customer_purchases
GROUP BY customer_id
) AS source
ON target.customer_id = source.customer_id
WHEN MATCHED AND target.purchase_date < source.latest_purchase_date THEN DELETE
WHEN NOT MATCHED THEN INSERT *;
D
MERGE INTO customer_purchases AS target
USING (
SELECT customer_id, MAX(purchase_date) AS latest_purchase_date
FROM customer_purchases
GROUP BY customer_id
) AS source
ON target.customer_id = source.customer_id
AND target.purchase_date < source.latest_purchase_date
WHEN MATCHED THEN DELETE;
MERGE INTO customer_purchases AS target
USING (
SELECT customer_id, MAX(purchase_date) AS latest_purchase_date
FROM customer_purchases
GROUP BY customer_id
) AS source
ON target.customer_id = source.customer_id
AND target.purchase_date < source.latest_purchase_date
WHEN MATCHED THEN DELETE;