
Ultimate access to all questions.
A data engineer has configured a Structured Streaming job to read from a table, manipulate the data, and then perform a streaming write into a new table. The code block used by the data engineer is below:
(spark.readStream
.table("sales")
.withColumn("avg_price", col("sales") / col("units"))
.writeStream
.option("checkpointLocation", checkpointPath)
.outputMode("complete")
.______
.table("new_sales")
)
(spark.readStream
.table("sales")
.withColumn("avg_price", col("sales") / col("units"))
.writeStream
.option("checkpointLocation", checkpointPath)
.outputMode("complete")
.______
.table("new_sales")
)
If the data engineer only wants the query to process all of the available data in as many batches as required, which of the following lines of code should the data engineer use to fill in the blank?
A
processingTime(1)
B
trigger(availableNow=True)
C
trigger(parallelBatch=True)
D
trigger(processingTime="once")
E
trigger(continuous="once")
Explanation:
In Apache Spark Structured Streaming, the trigger() method controls when the streaming query executes micro-batches. The availableNow=True trigger option is specifically designed for processing all available data in multiple batches and then terminating the query.
trigger(availableNow=True) - This trigger processes all available data at the start of the query in one or more micro-batches, then stops the query. It's ideal for scenarios where you want to process all existing data without running a continuous streaming job.processingTime(1) - This would create a continuous streaming job that runs every 1 second, not just process all available data once.trigger(parallelBatch=True) - This is not a valid trigger option in Spark Structured Streaming.trigger(processingTime="once") - This is not a valid syntax; processingTime expects a time interval string like "1 second" or a number of seconds.trigger(continuous="once") - This is not a valid trigger option; continuous mode requires a checkpoint interval.availableNow=True is designed for batch-like processing of all available dataonce=True (deprecated) which processed all data in a single batch