
Databricks Certified Data Engineer - Associate
Get started today
Ultimate access to all questions.
In a scenario where you are working with a large dataset in Azure Databricks, you have a DataFrame 'df' with columns 'id', 'name', and 'age'. Your task is to optimize the data processing by filtering out rows where 'age' is NULL and then counting the number of remaining rows. Considering the importance of performance and correctness in a production environment, which of the following Spark SQL queries correctly creates a Common Table Expression (CTE) named 'filtered_data' that filters out rows where 'age' is NULL and then uses the CTE to count the number of rows? Choose the single best option.
In a scenario where you are working with a large dataset in Azure Databricks, you have a DataFrame 'df' with columns 'id', 'name', and 'age'. Your task is to optimize the data processing by filtering out rows where 'age' is NULL and then counting the number of remaining rows. Considering the importance of performance and correctness in a production environment, which of the following Spark SQL queries correctly creates a Common Table Expression (CTE) named 'filtered_data' that filters out rows where 'age' is NULL and then uses the CTE to count the number of rows? Choose the single best option.
Explanation:
The correct answer is A, as it accurately creates a CTE named 'filtered_data' that filters out rows where 'age' is NULL using the 'IS NOT NULL' condition, which is the correct way to check for NULL values in SQL. It then counts the number of rows in the CTE. Option B and C incorrectly use 'EXCEPT' which is not valid syntax for filtering columns in this context. Option D incorrectly checks for the string 'NULL' instead of the NULL value, which would not correctly filter out NULL ages.