
Ultimate access to all questions.
A junior data engineer is using the following code to de-duplicate raw streaming data and insert them into a target Delta table.
spark.readStream.table("orders_raw") \
.dropDuplicates(["order_id", "order_timestamp"]) \
.writeStream .option("checkpointLocation", "dbfs: /checkpoints") \
.table("orders_unique")
spark.readStream.table("orders_raw") \
.dropDuplicates(["order_id", "order_timestamp"]) \
.writeStream .option("checkpointLocation", "dbfs: /checkpoints") \
.table("orders_unique")
However, a senior data engineer points out that this approach may not suffice for ensuring distinct records in the target table, especially with late-arriving duplicates. What could explain the senior engineer's concern?

A
Implementing a ranking function is necessary to process only the most recent records.
B
Watermarking is required to track state information within a specific time window, accommodating delayed records.
C
The solution lacks a window function to apply deduplication across non-overlapping intervals.
D
New records must be deduplicated against existing data in the target table to prevent duplicates.
E
Additional information is necessary to accurately address the issue.