Ultimate access to all questions.
A task orchestrator is configured to execute two hourly tasks. First, an external system writes Parquet data to a mounted directory at /mnt/raw_orders/
. Following this data write, a Databricks job runs the following code:
(spark.readStream
.format("parquet")
.load("/mnt/raw_orders/")
.withWatermark("time", "2 hours")
.dropDuplicates(["customer_id", "order_id"])
.writeStream
.trigger(once=True)
.table("orders"))
Given that customer_id
and order_id
form a composite key to uniquely identify orders, and the time
field represents when the record was queued in the source system, which statement is true if the upstream system occasionally enqueues duplicate entries for the same order hours apart?