
Answer-first summary for fast verification
Answer: Ingest all incoming CDC logs into an append-only bronze table to maintain a full history, then use `MERGE INTO` to upsert the most recent version of each `pk_id` into a silver table to represent the current state.
To satisfy both the requirement for a historical audit trail and a 'current state' analytical table, the **Medallion Architecture** pattern is ideal. * **Bronze Layer (History):** By ingesting raw CDC logs into an append-only Bronze table, every change is preserved. This acts as an immutable ledger for audit and replay purposes. * **Silver Layer (Current State):** Using `MERGE INTO` to upsert records from the Bronze table into the Silver table allows you to collapse multiple changes for the same `pk_id` into a single 'latest' row. This provides an efficient, authoritative view for analytics. **Why other options are incorrect:** * **Merging into Bronze:** Overwriting or updating records in the Bronze layer destroys the raw audit trail required by governance. * **Delta Lake CDF:** CDF is used to track changes *made within* Delta tables for downstream consumption; it does not handle the ingestion or transformation of external CDC logs automatically. * **Table Versioning:** Delta versioning is coarse-grained (per transaction/write) and is not intended to serve as a high-frequency row-level SCD (Slowly Changing Dimension) audit log. * **Separate tables per ID:** This approach is not scalable, creates significant management overhead, and results in poor query performance.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
No comments yet.
An upstream system provides Change Data Capture (CDC) logs representing INSERT, UPDATE, and DELETE operations for a source table with a primary key (pk_id). The data governance team requires a solution that preserves a full historical audit trail while also maintaining a table containing only the most recent records for analytical queries. Data is ingested hourly, and individual records may undergo multiple changes within a single ingestion window.
Which solution effectively meets these requirements while following Databricks Medallion Architecture best practices?
A
Ingest all incoming CDC logs into an append-only bronze table to maintain a full history, then use MERGE INTO to upsert the most recent version of each pk_id into a silver table to represent the current state.
B
Utilize MERGE INTO directly on a bronze table to handle incoming CDC operations for each pk_id, then propagate these changes downstream to minimize storage requirements.
C
Implement Delta Lake Change Data Feed (CDF) to automatically ingest and process external CDC logs directly from the source system into the Lakehouse.
D
Sequence the incoming changes chronologically and apply them to a target table, relying on Delta Lake’s internal table versioning and time travel to serve as the required audit log.
E
Create separate history tables for every unique pk_id and resolve the current state by performing a UNION operation of the most recent entries during query execution.