
Answer-first summary for fast verification
Answer: Configure a Structured Streaming read against the reviews_raw table using the trigger once execution mode to process new records as a batch job., Use Delta Lake version history to get the difference between the latest version of reviews_raw and one version prior, then write these records to the next table.
The goal is to minimize compute costs by processing only new data appended to `reviews_raw` each night. Options B and C achieve this efficiently: - **B**: Structured Streaming with `trigger once` leverages Delta Lake's transaction log to incrementally process new records since the last run, avoiding full table scans. - **C**: Using Delta Lake version history directly accesses the newly appended data by comparing the latest and prior versions, ensuring only the delta is processed. Other options are less efficient: A and D require scanning large portions of the table, while E reprocesses all data. Thus, B and C are optimal for cost reduction.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
No comments yet.
A nightly batch job ingests data files from a cloud object storage container with a nested directory structure (YYYY/MM/DD), where each date's data contains records processed by the source system on that date (some records may be delayed due to moderator approval). Each record represents a user review with the schema:
user_id STRING,
review_id BIGINT,
product_id BIGINT,
review_timestamp TIMESTAMP,
review_text STRING
user_id STRING,
review_id BIGINT,
product_id BIGINT,
review_timestamp TIMESTAMP,
review_text STRING
The ingestion job appends the previous day's data to a target table reviews_raw (same schema as source). The next pipeline step performs a batch write to propagate only new records from reviews_raw to a deduplicated, validated, and enriched table.
Which solution minimizes compute costs for propagating this batch of data?
A
Perform a batch read on the reviews_raw table and perform an insert-only merge using the natural composite key user_id, review_id, product_id, review_timestamp.
B
Configure a Structured Streaming read against the reviews_raw table using the trigger once execution mode to process new records as a batch job.
C
Use Delta Lake version history to get the difference between the latest version of reviews_raw and one version prior, then write these records to the next table.
D
Filter all records in the reviews_raw table based on the review_timestamp; batch append those records produced in the last 48 hours.
E
Reprocess all records in reviews_raw and overwrite the next table in the pipeline.