
Databricks Certified Data Engineer - Associate
Get started today
Ultimate access to all questions.
In a scenario where you are tasked with analyzing employee salary data to optimize departmental budgets, you have a DataFrame 'df' with columns 'employee_id', 'department_id', and 'salary'. Your goal is to calculate the average salary for each department, ensuring that departments with a NULL 'department_id' are excluded from the analysis to maintain data integrity. Considering the importance of accurate data for decision-making, which of the following Spark SQL queries would you use to achieve this task efficiently? Choose the best option from the four provided below.
In a scenario where you are tasked with analyzing employee salary data to optimize departmental budgets, you have a DataFrame 'df' with columns 'employee_id', 'department_id', and 'salary'. Your goal is to calculate the average salary for each department, ensuring that departments with a NULL 'department_id' are excluded from the analysis to maintain data integrity. Considering the importance of accurate data for decision-making, which of the following Spark SQL queries would you use to achieve this task efficiently? Choose the best option from the four provided below.
Explanation:
The correct answer is A, as it efficiently uses the WHERE clause to filter out departments with a NULL 'department_id' before grouping the data by 'department_id' and calculating the average salary for each department with the AVG(salary) function. This approach ensures that only relevant data is processed, optimizing performance and maintaining data integrity. Option B uses HAVING, which filters after grouping and is less efficient. Option C attempts to use EXCEPT, which is not a standard SQL way to exclude NULLs and is inefficient. Option D lacks the GROUP BY clause, making it incorrect as it does not group the results by department_id.