
Answer-first summary for fast verification
Answer: De-duplicate records within each batch, and then merge the result into the target table using insert-only merge
To correctly perform streaming deduplication, the `dropDuplicates()` function is used to eliminate duplicate records within each new micro batch. Additionally, it's essential to ensure that records to be inserted are not already present in the target table. This can be achieved by using an insert-only merge operation. For more details, refer to the Spark and Databricks documentation on `dropDuplicates()` and merge operations.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
No comments yet.
Which method correctly performs streaming deduplication?
A
De-duplicate records within each batch, rank the result, and then insert only records having rank = 1 into the target table
B
De-duplicate records in all batches with watermarking, and then overwrite the target table by the result
C
De-duplicate records within each batch, and then append the result into the target table
D
De-duplicate records within each batch, and then merge the result into the target table using insert-only merge
E
None of the above approaches allows to correctly perform streaming deduplication