Ultimate access to all questions.
An upstream system writes Parquet data in hourly batches to date-named directories. A nightly batch job processes the previous day's data (specified by the date
variable) using this code:
(spark.read
.format("parquet")
.load(f"/mnt/raw_orders/{date}")
.dropDuplicates(["customer_id", "order_id"])
.write
.mode("append")
.saveAsTable("orders"))
Given that customer_id
and order_id
form a composite key for unique order identification, and the upstream system sometimes generates duplicate entries for the same order hours apart, which statement is accurate?