
Answer-first summary for fast verification
Answer: trigger(availableNow=True)
## Explanation In Apache Spark Structured Streaming, the `trigger()` method controls when the streaming query executes micro-batches. The `availableNow=True` trigger option is specifically designed for processing all available data in multiple batches and then terminating the query. ### Why Option B is correct: 1. **`trigger(availableNow=True)`** - This trigger processes all available data at the start of the query in one or more micro-batches, then stops the query. It's ideal for scenarios where you want to process all existing data without running a continuous streaming job. ### Why other options are incorrect: - **A. `processingTime(1)`** - This would create a continuous streaming job that runs every 1 second, not just process all available data once. - **C. `trigger(parallelBatch=True)`** - This is not a valid trigger option in Spark Structured Streaming. - **D. `trigger(processingTime="once")`** - This is not a valid syntax; `processingTime` expects a time interval string like "1 second" or a number of seconds. - **E. `trigger(continuous="once")`** - This is not a valid trigger option; continuous mode requires a checkpoint interval. ### Key Points: - `availableNow=True` is designed for batch-like processing of all available data - It processes data in multiple micro-batches to avoid overwhelming the system - The query automatically stops after processing all data - This is different from `once=True` (deprecated) which processed all data in a single batch - This trigger is particularly useful for incremental data processing scenarios where you want to process all new data since the last run
Author: Keng Suppaseth
Ultimate access to all questions.
No comments yet.
A data engineer has configured a Structured Streaming job to read from a table, manipulate the data, and then perform a streaming write into a new table. The code block used by the data engineer is below:
(spark.readStream
.table("sales")
.withColumn("avg_price", col("sales") / col("units"))
.writeStream
.option("checkpointLocation", checkpointPath)
.outputMode("complete")
.______
.table("new_sales")
)
(spark.readStream
.table("sales")
.withColumn("avg_price", col("sales") / col("units"))
.writeStream
.option("checkpointLocation", checkpointPath)
.outputMode("complete")
.______
.table("new_sales")
)
If the data engineer only wants the query to process all of the available data in as many batches as required, which of the following lines of code should the data engineer use to fill in the blank?
A
processingTime(1)
B
trigger(availableNow=True)
C
trigger(parallelBatch=True)
D
trigger(processingTime="once")
E
trigger(continuous="once")