Databricks Certified Data Engineer - Professional

Get started today

Ultimate access to all questions.

An upstream system generates change data capture (CDC) logs that are stored in a cloud object storage directory. Each log entry specifies the change type (insert, update, or delete) along with the post-change field values. The source table has a primary key field named `pk_id`.

For auditing, the data governance team requires a complete history of all valid values from the source system. For analytics, only the latest value for each record must be retained. The Databricks job ingests these records hourly, but individual records may have undergone multiple changes within that hour.

Which solution fulfills these requirements?

Exam-Like

Iterate through an ordered set of changes to the table, applying each in turn to create the current state of the table, (insert, update, delete), timestamp of change, and the values.

11.8%

Use merge into to insert, update, or delete the most recent entry for each pk_id into a table, then propagate all changes throughout the system.

Comments

Loading comments...