
Explanation:
To understand the correct answer, let's explore Trigger Intervals in Structured Streaming. The trigger method specifies when the system should process the next set of data. Triggers control the frequency of micro-batches. By default, Spark processes all new data since the last trigger automatically. For executing a single micro-batch to process all available data at once, the correct syntax is trigger(once=True). This ensures the query runs exactly one micro-batch for all current data.
Ultimate access to all questions.
No comments yet.
A data engineer has set up a Structured Streaming job to read from a table, aggregate the data, and then perform a streaming write into a new table. The code block used is as follows:
spark.table("sales")
.groupBy("store")
.agg(sum("sales").alias("sum_sales"))
.writeStream
.option("checkpointLocation", checkpointPath)
.outputMode("complete")
.______
.table("aggregatedSales")
spark.table("sales")
.groupBy("store")
.agg(sum("sales").alias("sum_sales"))
.writeStream
.option("checkpointLocation", checkpointPath)
.outputMode("complete")
.______
.table("aggregatedSales")
If the goal is to execute only a single micro-batch to process all available data, which line of code should fill in the blank?
A
trigger(continuous="once")
B
processingTime("once")
C
trigger(processingTime="once")
D
trigger(once=True)
E
processingTime(1)