
Ultimate access to all questions.
In the context of Spark Structured Streaming, consider a scenario where you are processing real-time sales data from multiple sources. The data includes timestamps, and due to network latency, some data arrives late. Your application must accurately reflect sales figures, including late-arriving data, up to 2 hours after the expected time. Additionally, the solution must minimize resource usage by efficiently managing state. Which of the following approaches BEST meets these requirements? Choose the correct option from the four provided.
A
Ignore late data to ensure the processing pipeline is not delayed, as the sales figures are time-sensitive and late data is negligible.
B
Use a fixed delay window to buffer all incoming data for 2 hours before processing, ensuring no data is missed but increasing resource usage.
C
Implement watermarks with the 'withWatermark' function set to 2 hours to handle late data efficiently, allowing the system to manage state and update results continuously.
D
Process all data in batch mode at the end of each day, combining all sales data regardless of when it arrived, to ensure completeness without worrying about late data.