
Answer-first summary for fast verification
Answer: Ingesting all raw data and metadata into a **Bronze** Delta table to create a permanent, replayable history of the data state.
The most effective way to prevent this scenario in a Databricks environment is to implement a **Bronze table** as part of a Medallion Architecture. * **Why A is correct:** By capturing the raw, unprocessed data (including all fields and metadata) directly from the source into a Delta table, you create a permanent, timestamped history. Because Delta Lake uses immutable Parquet files and a transaction log, this data remains available even after the source system's (Kafka) retention period expires. This allows engineers to re-process or 'replay' the data to recover missing fields at any time. * **Why B is incorrect:** Schema evolution allows for the addition of new columns to a table's schema during ingestion, but it cannot retroactively retrieve data that was never written to the table in the first place. * **Why C is incorrect:** Delta Lake stores exactly what is provided by the Spark DataFrame. It does not automatically discover and ingest fields unless the ingestion logic is specifically configured to capture the raw payload (e.g., as a JSON blob). * **Why D is incorrect:** Structured Streaming checkpoints and the Delta transaction log track the state of the current pipeline and table transactions; they do not archive data from the source producer once that data has been purged from Kafka.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
A data engineer discovers that a critical field from a Kafka source was omitted during the ingestion process into a Delta Lake pipeline. While the field was present in the Kafka source, it is now missing from downstream storage. Kafka has a seven-day retention period, but the pipeline has been running for three months.
How can Delta Lake be used to prevent this type of permanent data loss in the future?
A
Ingesting all raw data and metadata into a Bronze Delta table to create a permanent, replayable history of the data state.
B
Utilizing Delta Lake schema evolution to retroactively compute values for newly added fields based on original source properties.
C
Configuring Delta Lake to automatically include every field from the source data in the ingestion layer by default.
D
Relying on the Delta log and Structured Streaming checkpoints, which maintain a complete history of the Kafka producer’s records.
No comments yet.