
Answer-first summary for fast verification
Answer: Each batch write will contain unique records, but the target table may still accumulate duplicates if a record was already written in a previous run.
The `dropDuplicates` transformation is a narrow transformation applied to the **incoming DataFrame only**. It removes duplicate `(customer_id, order_id)` rows within that specific batch. When writing with `.mode("append")`, Spark (and Delta Lake) simply adds the new records to the existing table without performing any lookup or comparison against data already stored in the target table. Therefore, while each individual batch is deduplicated before the write, global duplicates across multiple batches will still persist in the `orders` table. To achieve global deduplication across the entire table, a `MERGE` (upsert) operation would be necessary to check for existing records before inserting.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
A nightly Spark batch job ingests Parquet data from an upstream source located at /mnt/raw_orders/{{date}}. The job applies dropDuplicates(["customer_id", "order_id"]) to the incoming DataFrame before writing to the target table orders using the append mode. If the upstream system occasionally generates duplicate order entries across different batches, how will duplicate records be handled in the target table?
A
Existing records in the target table with matching keys will be overwritten by the incoming data.
B
The write job will deduplicate the union of the new data and the existing table data, ensuring the final table remains unique.
C
Each batch write will contain unique records, but the target table may still accumulate duplicates if a record was already written in a previous run.
D
The operation will fail with a constraint violation error if matching keys are detected in the target table.
E
The append operation will automatically filter out any incoming records that already exist in the target table.
No comments yet.