
Answer-first summary for fast verification
Answer: Duplicate records enqueued more than 2 hours apart may be retained and the orders table may contain duplicate records with the same customer_id and order_id.
The question involves understanding how Databricks handles streaming data with watermarks and deduplication. The key points are the use of a 2-hour watermark and the `dropDuplicates` function on a composite key. The watermark ensures that the system can handle late-arriving data up to 2 hours, but it does not guarantee deduplication across separate job runs if the duplicates arrive more than 2 hours apart. This is because each job run is independent, and without checkpointing, the state (including information about previously seen duplicates) is not persisted between runs. Therefore, duplicates arriving in different job runs may not be detected, leading to potential duplicates in the final table. This scenario is correctly described in option A. The other options either misinterpret the role of the watermark or the behavior of the `dropDuplicates` function across job runs.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
A task orchestrator is configured to execute two hourly tasks. First, an external system writes Parquet data to a mounted directory at /mnt/raw_orders/. Following this data write, a Databricks job runs the following code:
(spark.readStream
.format("parquet")
.load("/mnt/raw_orders/")
.withWatermark("time", "2 hours")
.dropDuplicates(["customer_id", "order_id"])
.writeStream
.trigger(once=True)
.table("orders"))
(spark.readStream
.format("parquet")
.load("/mnt/raw_orders/")
.withWatermark("time", "2 hours")
.dropDuplicates(["customer_id", "order_id"])
.writeStream
.trigger(once=True)
.table("orders"))
Given that customer_id and order_id form a composite key to uniquely identify orders, and the time field represents when the record was queued in the source system, which statement is true if the upstream system occasionally enqueues duplicate entries for the same order hours apart?
A
Duplicate records enqueued more than 2 hours apart may be retained and the orders table may contain duplicate records with the same customer_id and order_id.
B
All records will be held in the state store for 2 hours before being deduplicated and committed to the orders table.
C
The orders table will contain only the most recent 2 hours of records and no duplicates will be present.
D
The orders table will not contain duplicates, but records arriving more than 2 hours late will be ignored and missing from the table.
No comments yet.