
Answer-first summary for fast verification
Answer: Use the watermark feature to specify a threshold for late data and update window aggregates accordingly.
The watermark feature in Spark Structured Streaming allows you to specify a threshold for late data, determining how long the system should wait for late data to arrive before finalizing the results for a given window. This ensures that late-arriving data within the threshold is processed, maintaining accurate results. Ignoring late data can lead to inaccuracies, while arbitrarily increasing the state timeout or manually adjusting the system clock are inefficient and not recommended practices.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
How can you ensure accurate results in a Spark Structured Streaming job that processes time-windowed aggregates when dealing with late-arriving data?
A
Increase the state timeout duration to an arbitrarily high value to account for all possible late data.
B
Manually adjust the system clock to account for data latency before processing each micro-batch.
C
Use the watermark feature to specify a threshold for late data and update window aggregates accordingly.
D
Ignore late data, focusing only on data arriving within the expected time window.
No comments yet.