
Ultimate access to all questions.
A task orchestrator is configured to execute two hourly tasks. First, an external system writes Parquet data to a mounted directory at /mnt/raw_orders/. Following this data write, a Databricks job runs the following code:
(spark.readStream
.format("parquet")
.load("/mnt/raw_orders/")
.withWatermark("time", "2 hours")
.dropDuplicates(["customer_id", "order_id"])
.writeStream
.trigger(once=True)
.table("orders"))
(spark.readStream
.format("parquet")
.load("/mnt/raw_orders/")
.withWatermark("time", "2 hours")
.dropDuplicates(["customer_id", "order_id"])
.writeStream
.trigger(once=True)
.table("orders"))
Given that customer_id and order_id form a composite key to uniquely identify orders, and the time field represents when the record was queued in the source system, which statement is true if the upstream system occasionally enqueues duplicate entries for the same order hours apart?
A
Duplicate records enqueued more than 2 hours apart may be retained and the orders table may contain duplicate records with the same customer_id and order_id.
B
All records will be held in the state store for 2 hours before being deduplicated and committed to the orders table.
C
The orders table will contain only the most recent 2 hours of records and no duplicates will be present.
D
The orders table will not contain duplicates, but records arriving more than 2 hours late will be ignored and missing from the table.