
Ultimate access to all questions.
A nightly batch job ingests data files from a cloud object storage container with a nested directory structure (YYYY/MM/DD), where each date's data contains records processed by the source system on that date (some records may be delayed due to moderator approval). Each record represents a user review with the schema:
user_id STRING,
review_id BIGINT,
product_id BIGINT,
review_timestamp TIMESTAMP,
review_text STRING
user_id STRING,
review_id BIGINT,
product_id BIGINT,
review_timestamp TIMESTAMP,
review_text STRING
The ingestion job appends the previous day's data to a target table reviews_raw (same schema as source). The next pipeline step performs a batch write to propagate only new records from reviews_raw to a deduplicated, validated, and enriched table.
Which solution minimizes compute costs for propagating this batch of data?_
A
Perform a batch read on the reviews_raw table and perform an insert-only merge using the natural composite key user_id, review_id, product_id, review_timestamp._
B
Configure a Structured Streaming read against the reviews_raw table using the trigger once execution mode to process new records as a batch job._
C
Use Delta Lake version history to get the difference between the latest version of reviews_raw and one version prior, then write these records to the next table._
D
Filter all records in the reviews_raw table based on the review_timestamp; batch append those records produced in the last 48 hours.
E
Reprocess all records in reviews_raw and overwrite the next table in the pipeline._