
Answer-first summary for fast verification
Answer: Checkpointing and Write-ahead Logs
## Explanation Structured Streaming uses two key mechanisms to reliably track processing progress and handle failures: **1. Checkpointing** - Stores the current state of the streaming query, including: - Progress information (which offsets have been processed) - Aggregation state (for stateful operations) - Metadata about the query **2. Write-ahead Logs (WAL)** - Records the offset ranges being processed in each trigger before the actual processing begins. This ensures: - Exactly-once processing semantics - If a failure occurs during processing, the system can replay from the last recorded offset - No data loss or duplication **Why other options are incorrect:** - **Option B**: Incorrect - Structured Streaming can and does record offset ranges - **Option C**: Replayable sources and idempotent sinks are important concepts but not the specific mechanisms for recording offset ranges - **Option D**: Write-ahead logs are correct, but idempotent sinks alone don't track offset ranges - **Option E**: Checkpointing is correct, but idempotent sinks don't track offset ranges The combination of checkpointing and write-ahead logs provides the fault tolerance and exactly-once semantics that Structured Streaming is known for.
Author: Keng Suppaseth
Ultimate access to all questions.
No comments yet.
In order for Structured Streaming to reliably track the exact progress of the processing so that it can handle any kind of failure by restarting and/or reprocessing, which of the following two approaches is used by Spark to record the offset range of the data being processed in each trigger?
A
Checkpointing and Write-ahead Logs
B
Structured Streaming cannot record the offset range of the data being processed in each trigger.
C
Replayable Sources and Idempotent Sinks
D
Write-ahead Logs and Idempotent Sinks
E
Checkpointing and Idempotent Sinks