
Ultimate access to all questions.
Deep dive into the quiz with AI chat providers.
We prepare a focused prompt with your quiz and certificate details so each AI can offer a more tailored, in-depth explanation.
A data engineer has configured a Structured Streaming job to read from a table, manipulate the data, and then perform a streaming write into a new table.
The code block used by the data engineer is below:
(spark.readStream
.table("sales")
.withColumn("avg_price", col("sales") / col("units"))
.writeStream
.option("checkpointLocation", checkpointPath)
.outputMode("complete")
.______
.table("new_sales")
)
(spark.readStream
.table("sales")
.withColumn("avg_price", col("sales") / col("units"))
.writeStream
.option("checkpointLocation", checkpointPath)
.outputMode("complete")
.______
.table("new_sales")
)
If the data engineer only wants the query to process all of the available data in as many batches as required, which of the following lines of code should the data engineer use to fill in the blank?
A
processingTime(1)
B
trigger(availableNow=True)
C
trigger(parallelBatch=True)
D
trigger(processsingTime="once")
E
trigger(continuous="once")
Explanation:
The correct answer is B. trigger(availableNow=True).
.trigger(once=True) or .trigger(processingTime='0 seconds').The availableNow=True trigger is ideal for scenarios where you want to process all currently available data from a source (like processing a backlog) in a streaming fashion but don't want the query to run indefinitely. It's particularly useful for: