
Answer-first summary for fast verification
Answer: Each write to the orders table will only contain unique records, but newly written records may have duplicates already present in the target table.
The code snippet provided uses `dropDuplicates` on the composite key (customer_id, order_id) for the data being read from the specified directory for a given date. This ensures that within each batch (hourly data for a day), duplicates are removed before writing to the 'orders' table. However, because the write mode is 'append', the operation does not check for duplicates against the existing records in the 'orders' table. This means that while each batch written to the table is free of duplicates within itself, there could still be duplicates across different batches (i.e., the same order could be present in the table from a previous batch). Therefore, the correct statement is that each write to the orders table will only contain unique records, but newly written records may have duplicates already present in the target table.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
An upstream system writes Parquet data in hourly batches to date-named directories. A nightly batch job processes the previous day's data (specified by the date variable) using this code:
(spark.read
.format("parquet")
.load(f"/mnt/raw_orders/{date}")
.dropDuplicates(["customer_id", "order_id"])
.write
.mode("append")
.saveAsTable("orders"))
(spark.read
.format("parquet")
.load(f"/mnt/raw_orders/{date}")
.dropDuplicates(["customer_id", "order_id"])
.write
.mode("append")
.saveAsTable("orders"))
Given that customer_id and order_id form a composite key for unique order identification, and the upstream system sometimes generates duplicate entries for the same order hours apart, which statement is accurate?
A
Each write to the orders table will only contain unique records, and only those records without duplicates in the target table will be written.
B
Each write to the orders table will only contain unique records, but newly written records may have duplicates already present in the target table.
C
Each write to the orders table will only contain unique records; if existing records with the same key are present in the target table, these records will be overwritten.
D
Each write to the orders table will run deduplication over the union of new and existing records, ensuring no duplicate records are present.
No comments yet.