
Ultimate access to all questions.
A junior data engineer has ingested a JSON file into a table raw_table with the following schema:
cart_id STRING,
items ARRAY<item_id:STRING>
cart_id STRING,
items ARRAY<item_id:STRING>
The junior data engineer would like to unnest the items column in raw_table to result in a new table with the following schema:
cart_id STRING,
item_id STRING
cart_id STRING,
item_id STRING
Which of the following commands should the junior data engineer run to complete this task?
A
SELECT cart_id, filter(items) AS item_id FROM raw_table;
B
SELECT cart_id, flatten(items) AS item_id FROM raw_table;
C
SELECT cart_id, reduce(items) AS item_id FROM raw_table;
D
SELECT cart_id, explode(items) AS item_id FROM raw_table;
E
SELECT cart_id, slice(items) AS item_id FROM raw_table;
Explanation:
The correct answer is D because:
explode() function is specifically designed to unnest array columns in Spark SQL/Databricks. It takes an array column as input and returns a new row for each element in the array, while duplicating the other column values.
How explode() works:
raw_table, explode(items) creates multiple rowsitems array becomes a separate rowcart_id is repeated for each item in the arraycart_id STRING, items ARRAY<item_id:STRING> to cart_id STRING, item_id STRINGWhy other options are incorrect:
filter(): Used to filter elements from an array based on a condition, not to unnest arraysflatten(): Combines multiple arrays into a single array, but doesn't create separate rowsreduce(): Aggregates array elements using a binary function, not for unnestingslice(): Returns a subset of an array, not for unnestingExample transformation:
If raw_table has:
cart_id: "cart1", items: ["item1", "item2", "item3"]
cart_id: "cart2", items: ["item4"]
cart_id: "cart1", items: ["item1", "item2", "item3"]
cart_id: "cart2", items: ["item4"]
After SELECT cart_id, explode(items) AS item_id FROM raw_table;:
cart_id: "cart1", item_id: "item1"
cart_id: "cart1", item_id: "item2"
cart_id: "cart1", item_id: "item3"
cart_id: "cart2", item_id: "item4"
cart_id: "cart1", item_id: "item1"
cart_id: "cart1", item_id: "item2"
cart_id: "cart1", item_id: "item3"
cart_id: "cart2", item_id: "item4"
This is a common pattern in data transformation where you need to flatten nested array structures into a tabular format for further processing or analysis.