
Answer-first summary for fast verification
Answer: Implement watermarks with the 'withWatermark' function set to 2 hours to handle late data efficiently, allowing the system to manage state and update results continuously.
The best approach to handle late-arriving sales data in Spark Structured Streaming, while minimizing resource usage and accurately reflecting sales figures, is to use watermarks. Watermarks allow the system to track the progress of event time and handle late data up to a specified threshold (in this case, 2 hours). The 'withWatermark' function is specifically designed for this purpose, enabling efficient state management and continuous result updates. Ignoring late data (A) could lead to inaccurate sales figures. Using a fixed delay window (B) would increase resource usage unnecessarily. Processing in batch mode (D) does not meet the real-time processing requirement.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
In the context of Spark Structured Streaming, consider a scenario where you are processing real-time sales data from multiple sources. The data includes timestamps, and due to network latency, some data arrives late. Your application must accurately reflect sales figures, including late-arriving data, up to 2 hours after the expected time. Additionally, the solution must minimize resource usage by efficiently managing state. Which of the following approaches BEST meets these requirements? Choose the correct option from the four provided.
A
Ignore late data to ensure the processing pipeline is not delayed, as the sales figures are time-sensitive and late data is negligible.
B
Use a fixed delay window to buffer all incoming data for 2 hours before processing, ensuring no data is missed but increasing resource usage.
C
Implement watermarks with the 'withWatermark' function set to 2 hours to handle late data efficiently, allowing the system to manage state and update results continuously.
D
Process all data in batch mode at the end of each day, combining all sales data regardless of when it arrived, to ensure completeness without worrying about late data.
No comments yet.