
Answer-first summary for fast verification
Answer: trigger(availableNow=True)
## Explanation The correct answer is **B. trigger(availableNow=True)**. ### Why this is correct: 1. **availableNow=True** is a trigger option introduced in Spark 3.3.0 that processes all available data in the source and then stops the streaming query. 2. It processes data in multiple micro-batches as needed to handle all available data, which matches the requirement "in as many batches as required." 3. This trigger is specifically designed for processing all currently available data and then stopping, unlike continuous triggers that run indefinitely. ### Why other options are incorrect: - **A. processingTime(1)**: This would trigger the query every 1 second indefinitely, not just process all available data and stop. - **C. trigger(parallelBatch=True)**: This is not a valid trigger option in Spark Structured Streaming. - **D. trigger(processsingTime="once")**: While "once" is a valid trigger, the syntax is incorrect. The correct syntax would be `.trigger(once=True)` or `.trigger(processingTime='0 seconds')`. - **E. trigger(continuous="once")**: This is not a valid trigger option. Continuous triggers require a checkpoint interval, not a "once" parameter. ### Key Concept: The `availableNow=True` trigger is ideal for scenarios where you want to process all currently available data from a source (like processing a backlog) in a streaming fashion but don't want the query to run indefinitely. It's particularly useful for: - Processing historical data - Catching up on backlogged data - One-time data migration tasks - Batch-like processing with streaming semantics
Author: Keng Suppaseth
Ultimate access to all questions.
No comments yet.
A data engineer has configured a Structured Streaming job to read from a table, manipulate the data, and then perform a streaming write into a new table.
The code block used by the data engineer is below:
(spark.readStream
.table("sales")
.withColumn("avg_price", col("sales") / col("units"))
.writeStream
.option("checkpointLocation", checkpointPath)
.outputMode("complete")
.______
.table("new_sales")
)
(spark.readStream
.table("sales")
.withColumn("avg_price", col("sales") / col("units"))
.writeStream
.option("checkpointLocation", checkpointPath)
.outputMode("complete")
.______
.table("new_sales")
)
If the data engineer only wants the query to process all of the available data in as many batches as required, which of the following lines of code should the data engineer use to fill in the blank?
A
processingTime(1)
B
trigger(availableNow=True)
C
trigger(parallelBatch=True)
D
trigger(processsingTime="once")
E
trigger(continuous="once")