
Databricks Certified Data Engineer - Associate
Get started today
Ultimate access to all questions.
In a data engineering project using Databricks, you are working with a DataFrame 'df' that contains customer information with columns 'id', 'name', and 'age'. The project requires you to analyze the data quality by identifying the number of rows with NULL values in any column and the number of rows with complete data (non-NULL values in all columns). Considering the importance of accurate data quality metrics for downstream processing, which of the following Spark SQL queries would you use to efficiently compute these metrics? Choose the best option that provides both counts accurately.
In a data engineering project using Databricks, you are working with a DataFrame 'df' that contains customer information with columns 'id', 'name', and 'age'. The project requires you to analyze the data quality by identifying the number of rows with NULL values in any column and the number of rows with complete data (non-NULL values in all columns). Considering the importance of accurate data quality metrics for downstream processing, which of the following Spark SQL queries would you use to efficiently compute these metrics? Choose the best option that provides both counts accurately.
Explanation:
Option C is the correct answer because it accurately calculates the number of rows with NULL values in any column by using the COUNT_IF function with the condition 'id IS NULL OR name IS NULL OR age IS NULL'. It then subtracts this count from the total number of rows (COUNT(*)) to determine the number of rows with non-NULL values in all columns. Additionally, it directly counts the rows with non-NULL values in all columns using COUNT_IF with the condition 'id IS NOT NULL AND name IS NOT NULL AND age IS NOT NULL', providing a comprehensive and efficient solution for data quality analysis.