Ultimate access to all questions.
An upstream system generates change data capture (CDC) logs that are stored in a cloud object storage directory. Each log entry specifies the change type (insert, update, or delete) along with the post-change field values. The source table has a primary key field named pk_id
.
For auditing, the data governance team requires a complete history of all valid values from the source system. For analytics, only the latest value for each record must be retained. The Databricks job ingests these records hourly, but individual records may have undergone multiple changes within that hour.
Which solution fulfills these requirements?