
Ultimate access to all questions.
A data engineer has configured a Structured Streaming job to read from a table, manipulate the data, and then perform a streaming write into a new table.
The code block used by the data engineer is below:
(spark.readStream
.table("sales")
.withColumn("avg_price", col("sales") / col("units"))
.writeStream
.option("checkpointLocation", checkpointPath)
.outputMode("complete")
.______
.table("new_sales")
)
(spark.readStream
.table("sales")
.withColumn("avg_price", col("sales") / col("units"))
.writeStream
.option("checkpointLocation", checkpointPath)
.outputMode("complete")
.______
.table("new_sales")
)
If the data engineer only wants the query to process all of the available data in as many batches as required, which of the following lines of code should the data engineer use to fill in the blank?
A
processingTime(1)
B
trigger(availableNow=True)
C
trigger(parallelBatch=True)
D
trigger(processsingTime="once")
E
trigger(continuous="once")
Explanation:
The correct answer is B. trigger(availableNow=True).
.trigger(once=True) or .trigger(processingTime='0 seconds').The availableNow=True trigger is ideal for scenarios where you want to process all currently available data from a source (like processing a backlog) in a streaming fashion but don't want the query to run indefinitely. It's particularly useful for: