
Ultimate access to all questions.
A data engineer has ingested a JSON file into a table raw_table with the following schema:
transaction_id STRING,
payload ARRAY<customer_id:STRING, date:TIMESTAMP, store_id:STRING>
transaction_id STRING,
payload ARRAY<customer_id:STRING, date:TIMESTAMP, store_id:STRING>
The data engineer wants to efficiently extract the date of each transaction into a table with the following schema:
transaction_id STRING,
date TIMESTAMP
transaction_id STRING,
date TIMESTAMP
Which of the following commands should the data engineer run to complete this task?
A
SELECT transaction_id, explode(payload) FROM raw_table;
B
SELECT transaction_id, payload.date FROM raw_table;
C
SELECT transaction_id, date FROM raw_table;
Explanation:
The correct answer is B because:
Schema Analysis: The payload field is an ARRAY<STRUCT<customer_id:STRING, date:TIMESTAMP, store_id:STRING>>. This means payload is an array of structs, where each struct has fields customer_id, date, and store_id.
Accessing Struct Fields: In Spark SQL, when you have an array of structs, you can access struct fields using dot notation. The expression payload.date extracts the date field from each struct in the array.
Why Option A is incorrect: EXPLODE(payload) would create a new row for each element in the array, which would produce multiple rows per transaction_id. This is not what's needed - we want one row per transaction_id with the date extracted.
Why Option C is incorrect: Simply selecting date without referencing the payload array would not work because date is not a top-level column in the schema. It's nested inside the payload array.
How it works: When you use payload.date on an array of structs, Spark SQL extracts the date field from each struct in the array. Since there's presumably only one struct per transaction in this scenario, it returns the single date value.
Example:
If payload contains [{"customer_id": "123", "date": "2023-01-01", "store_id": "store1"}], then:
payload.date returns ["2023-01-01"] (an array with one timestamp)Alternative approach: If there were multiple structs in the array and you wanted the first date, you could use payload[0].date or element_at(payload, 1).date.