
Ultimate access to all questions.
In a real-time data processing scenario using Spark Structured Streaming, you are tasked with implementing a deduplication mechanism to ensure that each record in your stream is unique based on a specific key. The solution must efficiently handle late-arriving data and scale to process millions of records per second. Considering the constraints of cost, compliance, and scalability, which of the following approaches is the BEST for achieving deduplication in this context? Choose the single best option.
A
Utilize the 'distinct' function to filter out duplicate records, as it is the simplest method to implement.
B
Apply the 'groupBy' function combined with a window operation to aggregate and remove duplicates within a specific time frame, ensuring data is processed in batches.
C
Implement the 'groupBy' function with a watermark to manage late data and deduplicate records, providing a balance between latency and resource usage.
D
Use the 'mapWithState' function to maintain a state for each unique key, updating it with the latest record and effectively removing duplicates, which is optimal for handling high-volume streams with late-arriving data.