
Answer-first summary for fast verification
Answer: SELECT customer_id, SUM(transaction_amount) as total_amount FROM dataset GROUP BY customer_id
Option A is the correct answer because it efficiently groups the data by 'customer_id' and calculates the sum of 'transaction_amount' for each unique 'customer_id', which is scalable and minimizes computational costs by aggregating data in a single pass. Option B fails to calculate the total transaction amount. Option C unnecessarily groups by both 'customer_id' and 'transaction_amount', leading to incorrect totals and increased computational overhead. Option D attempts to filter unique customers but incorrectly uses the HAVING clause, which would exclude customers with multiple transactions, thus not meeting the requirement to calculate total spending for each unique customer.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
No comments yet.
In a scenario where you are analyzing customer transaction data to identify unique customers and their total spending, you are given a dataset with 'customer_id' and 'transaction_amount' columns. Your task is to write a Spark SQL query that creates a new table containing only the rows with unique 'customer_id' values and the total transaction amount for each unique 'customer_id'. Consider the following constraints: the solution must be scalable to handle large datasets efficiently, and it should minimize computational costs. Which of the following queries best meets these requirements? (Choose one option)
A
SELECT customer_id, SUM(transaction_amount) as total_amount FROM dataset GROUP BY customer_id
B
SELECT DISTINCT customer_id, transaction_amount FROM dataset
C
SELECT customer_id, transaction_amount FROM dataset GROUP BY customer_id, transaction_amount
D
SELECT customer_id, SUM(transaction_amount) as total_amount FROM dataset HAVING COUNT(*) = 1