Databricks Certified Data Engineer - Professional

Databricks Certified Data Engineer - Professional

Get started today

Ultimate access to all questions.


A junior data engineer wants to use Delta Lake's Change Data Feed feature to build a Type 1 table that captures all historical valid values for every row in a bronze table (created with delta.enableChangeDataFeed = true). They intend to run the following code as a daily job:

from pyspark.sql.functions import col
(spark.read.format("delta")
 .option("readChangeFeed", "true")
 .option("startingVersion", 0)
 .table("bronze")
 .filter(col("_change_type").isin(["update_postimage", "insert"]))
 .write
 .mode("append")
 .table("bronze_history_type1"))

What describes the outcome and behavior of executing this query repeatedly?




Explanation:

The code reads the entire change data feed starting from version 0 each time it runs, due to startingVersion=0. This causes all historical insert and update_postimage records to be appended to the target table on every execution. Since the job does not track the last processed version, duplicates accumulate in the target table as each run reprocesses the entire history.