
Answer-first summary for fast verification
Answer: Data arriving within the defined watermark threshold is processed normally, while data arriving outside the threshold is ignored.
In Structured Streaming, a **watermark** is used to manage state and handle late data. It defines a threshold for how far behind the maximum event time the engine will track. If data arrives with a timestamp within this threshold, the engine can still update the state or include it in results. If the data arrives with a timestamp older than the watermark, it is considered too late and is dropped (ignored) to keep the state size manageable and ensure efficient resource utilization.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
In a Spark Structured Streaming pipeline using Delta Lake, how is late-arriving data handled when a watermark threshold has been defined?
A
Watermarking ensures that all late-arriving data is eventually processed, prioritizing data completeness over processing latency.
B
Late-arriving data is automatically redirected to a separate side-car table for manual reconciliation and auditing.
C
Data arriving within the defined watermark threshold is processed normally, while data arriving outside the threshold is ignored.
D
Delta Lake ignores the watermark and automatically updates the relevant historical partitions to maintain strict ACID consistency.
No comments yet.