Ultimate access to all questions.
The data engineering team aims to construct a pipeline that processes customer data via a Change Data Capture (CDC) feed from a source system. This CDC feed includes both the data records and metadata, indicating actions like insertions, updates, or deletions, alongside a timestamp column (update_time
) that orders these changes. Each record is uniquely identified by a customer_id
. Given that a single batch may contain multiple changes for the same customer with different update_time
values, the team's goal is to store only the most recent information per customer in a target Delta Lake table. Which solution best fulfills these requirements?