
Databricks Certified Data Engineer - Professional
Get started today
Ultimate access to all questions.
A data pipeline uses Structured Streaming to ingest data from Apache Kafka into Delta Lake, storing data in a bronze table that includes Kafka's timestamp, key, and value. After three months, the data engineering team observes intermittent latency issues during peak hours.
A senior data engineer modifies the Delta table's schema and ingestion logic to include the current timestamp (recorded by Spark), Kafka topic, and partition. The team intends to use these additional metadata fields to troubleshoot the transient delays.
What limitation will the team encounter when diagnosing this issue?
A data pipeline uses Structured Streaming to ingest data from Apache Kafka into Delta Lake, storing data in a bronze table that includes Kafka's timestamp, key, and value. After three months, the data engineering team observes intermittent latency issues during peak hours.
A senior data engineer modifies the Delta table's schema and ingestion logic to include the current timestamp (recorded by Spark), Kafka topic, and partition. The team intends to use these additional metadata fields to troubleshoot the transient delays.
What limitation will the team encounter when diagnosing this issue?
Explanation:
The team added new metadata fields (current timestamp, Kafka topic, and partition) to the Delta Table's schema. However, Structured Streaming processes data incrementally. When the schema is updated, the new fields are only computed for incoming records processed after the schema change. Historic records ingested before the schema update will not have these new fields populated, making it impossible to use them for diagnosing past latency issues. This aligns with option A. Options B, C, and D are incorrect: Spark can capture Kafka metadata (B), Delta allows schema evolution without requiring default values (C), and schema updates do not invalidate the transaction log (D).