
Answer-first summary for fast verification
Answer: SELECT COUNT(*), COUNT_IF(amount IS NULL), COUNT(DISTINCT product_id) FROM df
Option C is the correct answer because it accurately uses COUNT(*) to count all transactions, COUNT_IF(amount IS NULL) to count transactions with NULL 'amount' values (a direct way to identify missing data), and COUNT(DISTINCT product_id) to count unique product IDs, which is essential for understanding product diversity. The other options either incorrectly count non-NULL 'amount' values (Option B), use an incorrect syntax for COUNT_IF (Option A), or fail to count distinct product IDs (Option D), leading to inaccurate analyses.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
In a data engineering project using Azure Databricks, you are working with a DataFrame named 'df' that contains transaction data. The DataFrame includes columns for 'transaction_id', 'product_id', and 'amount'. Your task is to analyze this data to understand transaction volumes, data quality issues, and product diversity. Specifically, you need to write a Spark SQL query that calculates: (1) the total number of transactions, (2) the number of transactions with NULL values in the 'amount' column (indicating missing data), and (3) the number of unique 'product_id' values (to assess product diversity). Considering the importance of accurate data analysis for decision-making, which of the following queries correctly accomplishes these tasks? Choose the best option from the four provided.
A
SELECT COUNT(*), COUNT_IF(amount IS NULL, TRUE), COUNT(DISTINCT product_id) FROM df
B
SELECT COUNT(*), COUNT(amount), COUNT(DISTINCT product_id) FROM df
C
SELECT COUNT(*), COUNT_IF(amount IS NULL), COUNT(DISTINCT product_id) FROM df
D
SELECT COUNT(*), COUNT_IF(amount IS NULL), COUNT(product_id) FROM df
No comments yet.