
Ultimate access to all questions.
An upstream system generates change data capture (CDC) logs that are stored in a cloud object storage directory. Each log entry specifies the change type (insert, update, or delete) along with the post-change field values. The source table has a primary key field named pk_id.
For auditing, the data governance team requires a complete history of all valid values from the source system. For analytics, only the latest value for each record must be retained. The Databricks job ingests these records hourly, but individual records may have undergone multiple changes within that hour.
Which solution fulfills these requirements?_
A
Iterate through an ordered set of changes to the table, applying each in turn to create the current state of the table, (insert, update, delete), timestamp of change, and the values.
B
Use merge into to insert, update, or delete the most recent entry for each pk_id into a table, then propagate all changes throughout the system._
C
Deduplicate records in each batch by pk_id and overwrite the target table._
D
Use Delta Lake’s change data feed to automatically process CDC data from an external system, propagating all changes to all dependent tables in the Lakehouse.