
Answer-first summary for fast verification
Answer: SELECT SUM(cnt) FROM ( SELECT customer_id, COUNT(*) as cnt FROM dataset GROUP BY customer_id HAVING COUNT(*) > 1 )
Option A is the correct answer because it efficiently groups the data by 'customer_id', counts the occurrences of each 'customer_id', and then filters for those with counts greater than 1, which indicates a violation of the uniqueness constraint. It then counts these violating rows, providing the exact number of violations. This approach is both scalable and minimizes compute resources by focusing only on the violating entries. Option B incorrectly counts the total number of unique 'customer_id' values, which does not address the uniqueness violation. Option C, while attempting to identify violations, does so in a less efficient manner by using an IN clause with a subquery, which can be more resource-intensive. Option D is irrelevant as it counts unique combinations of 'customer_id' and 'order_id', not addressing the uniqueness constraint on 'customer_id' alone.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
In a scenario where you are working with a large dataset in Azure Databricks that contains 'customer_id' and 'order_id' columns, your task is to ensure data integrity by validating that each 'customer_id' is unique across all rows. This validation is critical for a downstream reporting process that relies on the uniqueness of 'customer_id' for accurate customer analytics. Given the constraints of minimizing compute resources and ensuring the solution is scalable for datasets of varying sizes, which of the following Spark SQL queries would you use to accurately count the number of rows that violate the uniqueness constraint of 'customer_id'? Choose the best option.
A
SELECT SUM(cnt) FROM ( SELECT customer_id, COUNT() as cnt FROM dataset GROUP BY customer_id HAVING COUNT() > 1 )
B
SELECT COUNT(DISTINCT customer_id) FROM dataset
C
SELECT COUNT() FROM dataset WHERE customer_id IN (SELECT customer_id FROM dataset GROUP BY customer_id HAVING COUNT() > 1)
D
SELECT COUNT(DISTINCT customer_id, order_id) FROM dataset
No comments yet.