Databricks Certified Data Engineer - Associate

Databricks Certified Data Engineer - Associate

Get started today

Ultimate access to all questions.


When working with a complex struct data type in SQL that contains employee and sales data, how would you calculate the total sales made by all employees? The table is defined with batchId INT, performance ARRAY<STRUCT>, and insertDate TIMESTAMP. Here's a sample of the performance column data:

[
  { "employeeId": 1234, "sales": 10000 },
  { "employeeId": 3232, "sales": 30000 }
]

Given the following SQL to create the table sales:

create or replace table sales as
select 1 as batchId,
from_json('[{ "employeeId":1234,"sales" : 10000 },{ "employeeId":3232,"sales" : 30000 }]',
'ARRAY<STRUCT>') as performance,
current_timestamp() as insertDate
union all
select 2 as batchId,
from_json('[{ "employeeId":1235,"sales" : 10500 },{ "employeeId":3233,"sales" : 32000 }]',
'ARRAY<STRUCT>') as performance,
current_timestamp() as insertDate




Explanation:

The correct approach is to use the aggregate function combined with flatten and collect_list to sum up all sales values from the performance column. This method correctly accesses the sales values within the struct data type using the . notation (performance.sales).

Note: Option C is incorrect because it attempts to use SLICE on the performance column, which is not applicable here. Additionally, the use of : (as in performance:sales) is only appropriate for JSON data, not struct data types. Understanding the distinction between JSON and struct data types is crucial for the exam.

Alternative solutions include using reduce instead of aggregate or combining explode and sum functions for the same result. For example:

select reduce(flatten(collect_list(performance.sales)), 0, (x, y) -> x + y) as total_sales from sales

Or:

with cte as (
  select explode(flatten(collect_list(performance.sales))) sales from sales
)
select sum(sales) from cte