
Answer-first summary for fast verification
Answer: Every run will append the entire history of inserts/updates from version 0, leading to massive data duplication as the same records are added repeatedly.
### Explanation The correct answer is **D**. The provided code causes significant data duplication due to how Delta Lake batch reads and write modes interact: 1. **Static Starting Version**: By specifying `.option("startingVersion", 0)` in a batch read, the job is instructed to read the Change Data Feed (CDF) from the very first commit of the table every single time it executes. It does not track what has already been processed. 2. **Append Mode**: The `.write.mode("append")` method simply adds the resulting DataFrame to the target table. It does not perform any deduplication or check for existing keys. **Result**: On Day 1, it writes all history. On Day 2, it reads all history (including everything from Day 1) and appends it again. This leads to an exponential growth of redundant records in the `bronze_history` table. **To fix this**: The engineer should use **Structured Streaming** with a checkpoint to automatically track the last processed version, or utilize a **MERGE** statement to ensure idempotency.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
No comments yet.
A data engineer is attempting to construct a Type 1 historical table by capturing all changes from a bronze Delta table where delta.enableChangeDataFeed is set to true. They implement the following PySpark code as a daily scheduled task:
from pyspark.sql.functions import col
(spark.read.format("delta")
.option("readChangeFeed", "true")
.option("startingVersion", 0)
.table("bronze")
.filter(col("_change_type").isin(("update_postimage", "insert")))
.write.mode("append")
.table("bronze_history")
)
from pyspark.sql.functions import col
(spark.read.format("delta")
.option("readChangeFeed", "true")
.option("startingVersion", 0)
.table("bronze")
.filter(col("_change_type").isin(("update_postimage", "insert")))
.write.mode("append")
.table("bronze_history")
)
How will repeatedly running this query impact the target table (bronze_history) over time?
A
Only records inserted or updated since the last execution will be appended, successfully achieving the intended incremental result.
B
The target table will be entirely overwritten with the full history on each run, resulting in a clean but non-cumulative state.
C
Each execution will merge updates into the target table, overwriting prior values with matching keys to maintain the Type 1 structure.
D
Every run will append the entire history of inserts/updates from version 0, leading to massive data duplication as the same records are added repeatedly.
E
Each execution will calculate differences between the current version and the previous version, creating a delta-based historical log.