Databricks Certified Data Engineer - Professional

Databricks Certified Data Engineer - Professional

Get started today

Ultimate access to all questions.


An hourly batch job ingests data files from a cloud object storage container, with each batch containing all records generated by the source system within a given hour. The job is delayed sufficiently to account for late-arriving data. The schema includes: user_id BIGINT, username STRING, user_utc STRING, user_region STRING, last_login BIGINT, auto_pay BOOLEAN, last_updated BIGINT, where user_id is the unique key.

All new records are loaded into the account_history table, which retains the complete history in the same schema. The account_current table is a Type 1 table storing only the latest record per user_id.

Given millions of user accounts and tens of thousands of hourly records, what is the most efficient method to update the account_current table during each batch job?





Explanation:

The question requires efficiently updating a Type 1 table (account_current) with the latest records from a large account_history table. Option C is correct because it filters the most recent hour's data (ensuring only new records are processed), groups by user_id to select the max last_updated (ensuring the latest record per user), and uses a MERGE operation to update/insert into account_current. This approach minimizes data processing to only the new hourly batch, making it efficient. Options A, B, D, and E are incorrect: A uses streaming for a batch job, B scans the entire history inefficiently, D's versioning approach may not handle user-level updates correctly, and E deduplicates on the wrong field (username instead of user_id).