
Ultimate access to all questions.
A nightly Spark batch job ingests Parquet data from an upstream source located at /mnt/raw_orders/{{date}}. The job applies dropDuplicates(["customer_id", "order_id"]) to the incoming DataFrame before writing to the target table orders using the append mode. If the upstream system occasionally generates duplicate order entries across different batches, how will duplicate records be handled in the target table?_
A
Existing records in the target table with matching keys will be overwritten by the incoming data.
B
The write job will deduplicate the union of the new data and the existing table data, ensuring the final table remains unique.
C
Each batch write will contain unique records, but the target table may still accumulate duplicates if a record was already written in a previous run.
D
The operation will fail with a constraint violation error if matching keys are detected in the target table.
E
The append operation will automatically filter out any incoming records that already exist in the target table.