An upstream system writes Parquet data in hourly batches to date-named directories. A nightly batch job processes the previous day's data (specified by the `date` variable) using this code: ```python (spark.read .format("parquet") .load(f"/mnt/raw_orders/{date}") .dropDuplicates(["customer_id", "order_id"]) .write .mode("append") .saveAsTable("orders")) ``` Given that `customer_id` and `order_id` form a composite key for unique order identification, and the upstream system sometimes generates duplicate entries for the same order hours apart, which statement is accurate? | Databricks Certified Data Engineer - Professional Quiz

An upstream system writes Parquet data in hourly batches to date-named directories. A nightly batch job processes the previous day's data (specified by the date variable) using this code:

(spark.read
    .format("parquet")
    .load(f"/mnt/raw_orders/{date}")
    .dropDuplicates(["customer_id", "order_id"])
    .write
    .mode("append")
    .saveAsTable("orders"))

Given that customer_id and order_id form a composite key for unique order identification, and the upstream system sometimes generates duplicate entries for the same order hours apart, which statement is accurate?

Exam-Like

Powered ByGPT-5

Databricks Certified Data Engineer - Professional

Get started today

Comments