In a data engineering project, you are tasked with transforming a large DataFrame 'df' with the schema (date: string, category: string, value: double) from a long format to a wide format. The wide format should have each category ('electronics', 'groceries', 'clothing') as its own column, with the sum of 'value' for each category. The solution must be optimized for performance to handle large datasets efficiently and ensure data accuracy. Additionally, the solution should minimize resource usage to keep costs low. Which of the following Spark SQL queries best accomplishes this task? Choose the best option.

Simulated

SELECT date, SUM(value) AS total_value FROM df GROUP BY date

3.8%

SELECT date, SUM(value) AS electronics, SUM(value) AS groceries, SUM(value) AS clothing FROM df PIVOT (SUM(value) FOR category IN ('electronics', 'groceries', 'clothing'))

49.9%

SELECT date, electronics, groceries, clothing FROM df PIVOT (value FOR category IN ('electronics', 'groceries', 'clothing'))

21.2%

SELECT date, MAX(value) AS electronics, MAX(value) AS groceries, MAX(value) AS clothing FROM df PIVOT (value FOR category IN ('electronics', 'groceries', 'clothing'))

14.3%

SELECT date, SUM(value) AS electronics, SUM(value) AS groceries, SUM(value) AS clothing FROM df GROUP BY date, category

10.8%

Databricks Certified Data Engineer - Associate

Get started today

Comments