Ultimate access to all questions.
As a Data Engineer at a retail company utilizing Databricks for data processing, you are tasked with optimizing a Spark SQL query to calculate the total cost for each product in a DataFrame 'df' with columns 'id', 'product', 'quantity', and 'price'. The company mandates applying a 10% discount for 'Electronics' and a 20% discount for 'Groceries'. The solution must not only be cost-effective and scalable for large datasets but also adhere to the company's policy of minimizing computational overhead. Considering these requirements, which of the following queries would you choose to implement? (Choose two correct options.)
Explanation:
Option B is correct because it accurately applies the specified discounts using a CASE/WHEN statement without unnecessary aggregation, making it scalable and cost-effective. Option E is also correct as it efficiently applies discounts only to the relevant products and combines the results with the non-discounted products using UNION ALL, which can be more efficient for large datasets or when discounts apply to a small subset of products. Options A and C do not apply the correct discounts, and Option D unnecessarily uses SUM and GROUP BY, which could degrade performance on large datasets.