
Answer-first summary for fast verification
Answer: SELECT COUNT_IF(quantity IS NULL), COUNT(DISTINCT product) FROM df
The correct answer is B. This option accurately uses the COUNT_IF function to count the number of rows where 'quantity' is NULL, directly addressing the first part of the task. For the second part, it employs COUNT(DISTINCT product) to count the number of unique products with non-NULL quantities. This approach is efficient and directly queries the DataFrame without unnecessary filtering or grouping, which could complicate or inaccurately reflect the desired metrics. Options A, C, and D either miscalculate the counts or introduce unnecessary operations that do not align with the task's requirements.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
No comments yet.
In a scenario where you are analyzing sales data stored in a DataFrame 'df' with columns 'id', 'product', and 'quantity', you are tasked with identifying two key metrics: the number of rows where 'quantity' is NULL (indicating missing data) and the number of unique products that have non-NULL quantities (to understand product diversity). Considering the need for accuracy and efficiency in your query, which of the following Spark SQL queries correctly accomplishes this task? Choose the best option from the four provided.
A
SELECT COUNT(*) - COUNT_IF(quantity IS NULL, TRUE), COUNT(DISTINCT product) FROM df
B
SELECT COUNT_IF(quantity IS NULL), COUNT(DISTINCT product) FROM df
C
SELECT COUNT_IF(quantity IS NULL), COUNT(DISTINCT product) FROM df WHERE quantity IS NOT NULL
D
SELECT COUNT_IF(quantity IS NULL), COUNT(DISTINCT product) FROM df GROUP BY product