
Explanation:
In Structured Streaming, a watermark is used to manage state and handle late data. It defines a threshold for how far behind the maximum event time the engine will track. If data arrives with a timestamp within this threshold, the engine can still update the state or include it in results. If the data arrives with a timestamp older than the watermark, it is considered too late and is dropped (ignored) to keep the state size manageable and ensure efficient resource utilization.
Ultimate access to all questions.
In a Spark Structured Streaming pipeline using Delta Lake, how is late-arriving data handled when a watermark threshold has been defined?
A
Watermarking ensures that all late-arriving data is eventually processed, prioritizing data completeness over processing latency.
B
Late-arriving data is automatically redirected to a separate side-car table for manual reconciliation and auditing.
C
Data arriving within the defined watermark threshold is processed normally, while data arriving outside the threshold is ignored.
D
Delta Lake ignores the watermark and automatically updates the relevant historical partitions to maintain strict ACID consistency.
No comments yet.