
Ultimate access to all questions.
A junior data engineer wants to use Delta Lake's Change Data Feed feature to build a Type 1 table that captures all historical valid values for every row in a bronze table (created with delta.enableChangeDataFeed = true). They intend to run the following code as a daily job:
from pyspark.sql.functions import col
(spark.read.format("delta")
.option("readChangeFeed", "true")
.option("startingVersion", 0)
.table("bronze")
.filter(col("_change_type").isin(["update_postimage", "insert"]))
.write
.mode("append")
.table("bronze_history_type1"))
from pyspark.sql.functions import col
(spark.read.format("delta")
.option("readChangeFeed", "true")
.option("startingVersion", 0)
.table("bronze")
.filter(col("_change_type").isin(["update_postimage", "insert"]))
.write
.mode("append")
.table("bronze_history_type1"))
What describes the outcome and behavior of executing this query repeatedly?_
A
Each time the job is executed, newly updated records will be merged into the target table, overwriting previous values with the same primary keys.
B
Each time the job is executed, the entire available history of inserted or updated records will be appended to the target table, resulting in many duplicate entries.
C
Each time the job is executed, only those records that have been inserted or updated since the last execution will be appended to the target table, giving the desired result.
D
Each time the job is executed, the differences between the original and current versions are calculated; this may result in duplicate entries for some records.