Ultimate access to all questions.
A nightly batch job ingests data files from a cloud object storage container with a nested directory structure (YYYY/MM/DD), where each date's data contains records processed by the source system on that date (some records may be delayed due to moderator approval). Each record represents a user review with the schema:
user_id STRING,
review_id BIGINT,
product_id BIGINT,
review_timestamp TIMESTAMP,
review_text STRING
The ingestion job appends the previous day's data to a target table reviews_raw
(same schema as source). The next pipeline step performs a batch write to propagate only new records from reviews_raw
to a deduplicated, validated, and enriched table.
Which solution minimizes compute costs for propagating this batch of data?