
Answer-first summary for fast verification
Answer: SELECT store_id, employees, FILTER (employees, i -> i.years_exp > 5) AS exp_employees FROM stores;
## Explanation **Option A** is correct because it properly uses the FILTER higher-order function with the correct syntax: - `FILTER(employees, i -> i.years_exp > 5)` correctly applies the filter to the `employees` array column - The lambda expression `i -> i.years_exp > 5` properly references each element in the array as `i` and accesses the `years_exp` field - The result is assigned to the new column `exp_employees` **Why other options are incorrect:** **Option B**: Incorrect syntax - `FILTER(exp_employees, years_exp > 5)` tries to filter a column that doesn't exist yet (`exp_employees` is the output column, not the input). Also missing the lambda expression. **Option C**: Incorrect syntax - `FILTER(employees, years_exp > 5)` is missing the lambda expression needed to reference array elements. **Option D**: Incorrect approach - Uses CASE statement which doesn't properly handle array filtering. CASE WHEN would need to be used with array functions like TRANSFORM or FILTER, not directly like this. **Option E**: Incorrect - Similar to Option B, tries to filter `exp_employees` which doesn't exist yet as an input column. **Key concepts:** - The FILTER higher-order function in Spark SQL filters an array using a lambda expression - The syntax is `FILTER(array, lambda_expression)` where the lambda expression returns a boolean - The lambda parameter (e.g., `i`) represents each element in the array - For struct elements in arrays, you need to access fields using dot notation (e.g., `i.years_exp`)
Author: Keng Suppaseth
Ultimate access to all questions.
No comments yet.
A data engineer needs to apply custom logic to identify employees with more than 5 years of experience in array column e_employees in table stores. The custom logic should create a new column exp_employees that is an array of all of the employees with more than 5 years of experience for each row. In order to apply this custom logic at scale, the data engineer wants to use the FILTER higher-order function. Which of the following code blocks successfully completes this task?
A
SELECT store_id, employees, FILTER (employees, i -> i.years_exp > 5) AS exp_employees FROM stores;
B
SELECT store_id, employees, FILTER (exp_employees, years_exp > 5) AS exp_employees FROM stores;
C
SELECT store_id, employees, FILTER (employees, years_exp > 5) AS exp_employees FROM stores;
D
SELECT store_id, employees, CASE WHEN employees.years_exp > 5 THEN employees ELSE NULL END AS exp_employees FROM stores;
E
SELECT store_id, employees, FILTER (exp_employees, i -> i.years_exp > 5) AS exp_employees FROM stores;