
Ultimate access to all questions.
An upstream system writes Parquet data in hourly batches to date-named directories. A nightly batch job processes the previous day's data (specified by the date variable) using this code:
(spark.read
.format("parquet")
.load(f"/mnt/raw_orders/{date}")
.dropDuplicates(["customer_id", "order_id"])
.write
.mode("append")
.saveAsTable("orders"))
(spark.read
.format("parquet")
.load(f"/mnt/raw_orders/{date}")
.dropDuplicates(["customer_id", "order_id"])
.write
.mode("append")
.saveAsTable("orders"))
Given that customer_id and order_id form a composite key for unique order identification, and the upstream system sometimes generates duplicate entries for the same order hours apart, which statement is accurate?_
A
Each write to the orders table will only contain unique records, and only those records without duplicates in the target table will be written.
B
Each write to the orders table will only contain unique records, but newly written records may have duplicates already present in the target table.
C
Each write to the orders table will only contain unique records; if existing records with the same key are present in the target table, these records will be overwritten.
D
Each write to the orders table will run deduplication over the union of new and existing records, ensuring no duplicate records are present.