A task orchestrator is configured to execute two hourly tasks. First, an external system writes Parquet data to a mounted directory at `/mnt/raw_orders/`. Following this data write, a Databricks job runs the following code: ```python (spark.readStream .format("parquet") .load("/mnt/raw_orders/") .withWatermark("time", "2 hours") .dropDuplicates(["customer_id", "order_id"]) .writeStream .trigger(once=True) .table("orders")) ``` Given that `customer_id` and `order_id` form a composite key to uniquely identify orders, and the `time` field represents when the record was queued in the source system, which statement is true if the upstream system occasionally enqueues duplicate entries for the same order hours apart? | Databricks Certified Data Engineer - Professional Quiz

A task orchestrator is configured to execute two hourly tasks. First, an external system writes Parquet data to a mounted directory at /mnt/raw_orders/. Following this data write, a Databricks job runs the following code:

(spark.readStream
  .format("parquet")
  .load("/mnt/raw_orders/")
  .withWatermark("time", "2 hours")
  .dropDuplicates(["customer_id", "order_id"])
  .writeStream
  .trigger(once=True)
  .table("orders"))

Given that customer_id and order_id form a composite key to uniquely identify orders, and the time field represents when the record was queued in the source system, which statement is true if the upstream system occasionally enqueues duplicate entries for the same order hours apart?

Exam-Like

Powered ByGPT-5

Databricks Certified Data Engineer - Professional

Get started today

Comments

Databricks Certified Data Engineer - Professional

Get started today

Comments