
Answer-first summary for fast verification
Answer: Filter the `account_history` table for records within the most recent hourly window, deduplicate using `MAX(last_login)` per `user_id`, and execute a `MERGE` statement to upsert these changes into the `account_current` table.
The most efficient approach is **Option B**. Since the source system guarantees no late-arriving data, we can significantly reduce compute costs by filtering the `account_history` table for only the latest hourly window instead of scanning millions of rows. * **Efficiency**: By filtering for the current hour, you only process tens of thousands of rows rather than the entire history (millions of rows). * **Deduplication**: Using an aggregation like `MAX(last_login)` within that filtered slice ensures that if a user has multiple updates within the hour, only the final state is merged into the target. * **UPSERT Semantics**: Delta Lake's `MERGE` command allows for efficient updates and inserts in a single ACID-compliant operation. **Why other options are incorrect:** * **Option A**: Scanning the entire history every hour is highly inefficient and expensive as the dataset grows. * **Option C**: While Auto Loader is useful for ingestion, it adds unnecessary architectural complexity here since the files are already landed and processed by a batch job. * **Option D**: Merging on `username` is incorrect because the requirement is to update by `user_id`. Primary keys should be immutable; human-entered strings like usernames can change or collide. * **Option E**: Delta version history or Change Data Feed (CDF) would be over-engineering for this scenario, as a simple timestamp filter on the existing batch load is more direct and easier to maintain.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
No comments yet.
A data engineering team manages an hourly batch job that ingests source system records into a Lakehouse. Given that the ingestion process ensures no data arrives late, and the target table contains millions of accounts but only receives tens of thousands of updates per hour, which approach is the most efficient for updating the account_current table with the latest value for each unique user_id?
A
Perform a full overwrite of the account_current table during each batch by querying the entire account_history table, grouping by user_id, and selecting the maximum last_updated value.
B
Filter the account_history table for records within the most recent hourly window, deduplicate using MAX(last_login) per user_id, and execute a MERGE statement to upsert these changes into the account_current table.
C
Utilize Auto Loader to monitor the account_history directory and configure a Structured Streaming trigger to process newly detected files into the account_current table using a batch update.
D
Filter the account_history table for the most recent hour's records, ensure deduplication based on username, and then execute a MERGE statement to update the account_current table.
E
Use Delta Lake version history to identify the difference between the two most recent versions of the account_history table and apply those specific changes to the account_current table.