
Ultimate access to all questions.
In order for Structured Streaming to reliably track the exact progress of the processing so that it can handle any kind of failure by restarting and/or reprocessing, which of the following two approaches is used by Spark to record the offset range of the data being processed in each trigger?
A
Checkpointing and Write-ahead Logs
B
Structured Streaming cannot record the offset range of the data being processed in each trigger.
C
Replayable Sources and Idempotent Sinks
D
Write-ahead Logs and Idempotent Sinks
E
Checkpointing and Idempotent Sinks
Explanation:
Correct Answer: A (Checkpointing and Write-ahead Logs)
Explanation:
Structured Streaming uses two key mechanisms to ensure fault tolerance and exactly-once processing semantics:
Checkpointing: This stores the current state of the streaming query, including the progress of data processing. Checkpoints contain metadata about the query's progress, including the offset ranges that have been processed.
Write-ahead Logs (WAL): These logs record the data that is about to be processed before it's actually processed. This ensures that if a failure occurs during processing, the system can recover and reprocess the data from the log.
Why other options are incorrect:
Key Concepts:
This combination allows Structured Streaming to restart from the last known good state and reprocess any data that wasn't successfully written to the sink.