
Answer-first summary for fast verification
Answer: Each time the job is executed, the entire available history of inserted or updated records will be appended to the target table, resulting in many duplicate entries.
The code reads the entire change data feed starting from version 0 each time it runs, due to `startingVersion=0`. This causes all historical insert and update_postimage records to be appended to the target table on every execution. Since the job does not track the last processed version, duplicates accumulate in the target table as each run reprocesses the entire history.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
No comments yet.
A junior data engineer wants to use Delta Lake's Change Data Feed feature to build a Type 1 table that captures all historical valid values for every row in a bronze table (created with delta.enableChangeDataFeed = true). They intend to run the following code as a daily job:
from pyspark.sql.functions import col
(spark.read.format("delta")
.option("readChangeFeed", "true")
.option("startingVersion", 0)
.table("bronze")
.filter(col("_change_type").isin(["update_postimage", "insert"]))
.write
.mode("append")
.table("bronze_history_type1"))
from pyspark.sql.functions import col
(spark.read.format("delta")
.option("readChangeFeed", "true")
.option("startingVersion", 0)
.table("bronze")
.filter(col("_change_type").isin(["update_postimage", "insert"]))
.write
.mode("append")
.table("bronze_history_type1"))
What describes the outcome and behavior of executing this query repeatedly?
A
Each time the job is executed, newly updated records will be merged into the target table, overwriting previous values with the same primary keys.
B
Each time the job is executed, the entire available history of inserted or updated records will be appended to the target table, resulting in many duplicate entries.
C
Each time the job is executed, only those records that have been inserted or updated since the last execution will be appended to the target table, giving the desired result.
D
Each time the job is executed, the differences between the original and current versions are calculated; this may result in duplicate entries for some records.