
Explanation:
Checkpointing is essential in Apache Spark Structured Streaming for fault tolerance and recovery from failures. By periodically saving the application's state to a reliable storage system like HDFS, Spark can restart and recover the application's state after a failure. This is especially important for stateful operations, such as tracking event counts by key, where maintaining state accuracy across failures is critical.
Thus, Option C is the optimal choice for ensuring both performance and fault tolerance in stateful operations within a Spark Structured Streaming pipeline.
Ultimate access to all questions.
In a distributed computing environment, a data engineer is setting up a streaming data pipeline with Apache Spark Structured Streaming. This pipeline features a stateful operation designed to monitor the running count of events by key. Which configuration is crucial for achieving the best performance and ensuring fault tolerance for this stateful operation?
A
Broadcasting join tables to all executors to reduce shuffle during state updates.
B
Using a stateful operation that stores state in local memory for faster access.
C
Configuring checkpointing to HDFS to ensure fault tolerance.
D
Disabling write-ahead logs to increase the throughput of the streaming application.
E
Increasing the number of shuffle partitions to maximize parallelism.
No comments yet.