
Ultimate access to all questions.
In a scenario where you are working with a large dataset in Azure Databricks that contains 'customer_id' and 'order_id' columns, your task is to ensure data integrity by validating that each 'customer_id' is unique across all rows. This validation is critical for a downstream reporting process that relies on the uniqueness of 'customer_id' for accurate customer analytics. Given the constraints of minimizing compute resources and ensuring the solution is scalable for datasets of varying sizes, which of the following Spark SQL queries would you use to accurately count the number of rows that violate the uniqueness constraint of 'customer_id'? Choose the best option._
A
SELECT SUM(cnt) FROM ( SELECT customer_id, COUNT() as cnt FROM dataset GROUP BY customer_id HAVING COUNT() > 1 )
B
SELECT COUNT(DISTINCT customer_id) FROM dataset_
C
SELECT COUNT() FROM dataset WHERE customer_id IN (SELECT customer_id FROM dataset GROUP BY customer_id HAVING COUNT() > 1)_
D
SELECT COUNT(DISTINCT customer_id, order_id) FROM dataset