
Explanation:
The correct answer is D. The provided code causes significant data duplication due to how Delta Lake batch reads and write modes interact:
.option("startingVersion", 0) in a batch read, the job is instructed to read the Change Data Feed (CDF) from the very first commit of the table every single time it executes. It does not track what has already been processed..write.mode("append") method simply adds the resulting DataFrame to the target table. It does not perform any deduplication or check for existing keys.Result: On Day 1, it writes all history. On Day 2, it reads all history (including everything from Day 1) and appends it again. This leads to an exponential growth of redundant records in the bronze_history table.
To fix this: The engineer should use Structured Streaming with a checkpoint to automatically track the last processed version, or utilize a MERGE statement to ensure idempotency.
Ultimate access to all questions.
A data engineer is attempting to construct a Type 1 historical table by capturing all changes from a bronze Delta table where delta.enableChangeDataFeed is set to true. They implement the following PySpark code as a daily scheduled task:
from pyspark.sql.functions import col
(spark.read.format("delta")
.option("readChangeFeed", "true")
.option("startingVersion", 0)
.table("bronze")
.filter(col("_change_type").isin(("update_postimage", "insert")))
.write.mode("append")
.table("bronze_history")
)
from pyspark.sql.functions import col
(spark.read.format("delta")
.option("readChangeFeed", "true")
.option("startingVersion", 0)
.table("bronze")
.filter(col("_change_type").isin(("update_postimage", "insert")))
.write.mode("append")
.table("bronze_history")
)
How will repeatedly running this query impact the target table (bronze_history) over time?
A
Only records inserted or updated since the last execution will be appended, successfully achieving the intended incremental result.
B
The target table will be entirely overwritten with the full history on each run, resulting in a clean but non-cumulative state.
C
Each execution will merge updates into the target table, overwriting prior values with matching keys to maintain the Type 1 structure.
D
Every run will append the entire history of inserts/updates from version 0, leading to massive data duplication as the same records are added repeatedly.
E
Each execution will calculate differences between the current version and the previous version, creating a delta-based historical log.
No comments yet.