
Answer-first summary for fast verification
Answer: Use watermarking to limit the state store size and the `dropDuplicates` method to remove duplicate records.
Watermarking helps in managing the state store size by limiting the amount of state that needs to be maintained for late-arriving data. The `dropDuplicates` method is used to identify and remove duplicate records efficiently.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
Given a requirement to implement necessary logic for deduplication using Spark Structured Streaming, describe the steps you would take to ensure that duplicate records are identified and removed. Include the use of watermarking and the dropDuplicates method.
A
Use the dropDuplicates method without using watermarking.
B
Use watermarking to limit the state store size and the dropDuplicates method to remove duplicate records.
C
Ignore duplicate records and focus only on the current data stream.
D
Use a batch query to handle deduplication.
No comments yet.