
Databricks Certified Data Engineer - Associate
Get started today
Ultimate access to all questions.
In a scenario where you are working with a large dataset in Azure Databricks that contains 'product_name' and 'category' columns, your task is to ensure data integrity by validating that each 'product_name' is associated with only one unique 'category' value. Considering the need for efficiency and accuracy in a production environment, which of the following Spark SQL queries would you use to identify any 'product_name' that violates this uniqueness constraint by being associated with more than one 'category'? Choose the best option.
In a scenario where you are working with a large dataset in Azure Databricks that contains 'product_name' and 'category' columns, your task is to ensure data integrity by validating that each 'product_name' is associated with only one unique 'category' value. Considering the need for efficiency and accuracy in a production environment, which of the following Spark SQL queries would you use to identify any 'product_name' that violates this uniqueness constraint by being associated with more than one 'category'? Choose the best option.
Explanation:
Option A is the correct choice because it efficiently groups the data by 'product_name', calculates the count of distinct 'category' values for each 'product_name', and filters to show only those 'product_name's with more than one distinct 'category', thus identifying violations of the uniqueness constraint. Option B fails to identify violations as it only returns the maximum 'category' value per 'product_name'. Option C incorrectly groups by both 'product_name' and 'category', missing cases where a 'product_name' is associated with multiple 'category' values. Option D, while attempting to use a subquery to find violations, does not correctly return the rows that violate the constraint.