Ultimate access to all questions.
An upstream system generates change data capture (CDC) logs that are stored in a cloud object storage directory. Each log entry specifies the change type (insert, update, or delete) along with the post-change values for all fields. The source table has a primary key field named pk_id
.
The data governance team requires a complete historical record of all valid values from the source system for auditing. For analytical purposes, only the latest value for each record needs to be retained. The Databricks job ingests these records hourly, but individual records may have undergone multiple changes within that hour.
What solution fulfills these requirements?