Ultimate access to all questions.
An hourly batch job ingests data files from a cloud object storage container, with each batch containing all records generated by the source system within a given hour. The job is delayed sufficiently to account for late-arriving data. The schema includes: user_id BIGINT, username STRING, user_utc STRING, user_region STRING, last_login BIGINT, auto_pay BOOLEAN, last_updated BIGINT
, where user_id
is the unique key.
All new records are loaded into the account_history
table, which retains the complete history in the same schema. The account_current
table is a Type 1 table storing only the latest record per user_id
.
Given millions of user accounts and tens of thousands of hourly records, what is the most efficient method to update the account_current
table during each batch job?