
Databricks Certified Data Engineer - Professional
Get started today
Ultimate access to all questions.
An hourly batch job ingests data files from a cloud object storage container, with each batch containing all records generated by the source system within a given hour. The job is delayed sufficiently to account for late-arriving data. The schema includes: user_id BIGINT, username STRING, user_utc STRING, user_region STRING, last_login BIGINT, auto_pay BOOLEAN, last_updated BIGINT
, where user_id
is the unique key.
All new records are loaded into the account_history
table, which retains the complete history in the same schema. The account_current
table is a Type 1 table storing only the latest record per user_id
.
Given millions of user accounts and tens of thousands of hourly records, what is the most efficient method to update the account_current
table during each batch job?
An hourly batch job ingests data files from a cloud object storage container, with each batch containing all records generated by the source system within a given hour. The job is delayed sufficiently to account for late-arriving data. The schema includes: user_id BIGINT, username STRING, user_utc STRING, user_region STRING, last_login BIGINT, auto_pay BOOLEAN, last_updated BIGINT
, where user_id
is the unique key.
All new records are loaded into the account_history
table, which retains the complete history in the same schema. The account_current
table is a Type 1 table storing only the latest record per user_id
.
Given millions of user accounts and tens of thousands of hourly records, what is the most efficient method to update the account_current
table during each batch job?
Explanation:
The question requires efficiently updating a Type 1 table (account_current) with the latest records from a large account_history table. Option C is correct because it filters the most recent hour's data (ensuring only new records are processed), groups by user_id to select the max last_updated (ensuring the latest record per user), and uses a MERGE operation to update/insert into account_current. This approach minimizes data processing to only the new hourly batch, making it efficient. Options A, B, D, and E are incorrect: A uses streaming for a batch job, B scans the entire history inefficiently, D's versioning approach may not handle user-level updates correctly, and E deduplicates on the wrong field (username instead of user_id).