
Ultimate access to all questions.
A data engineer has configured a Structured Streaming job to read from a table, manipulate the data, and then perform a streaming write into a new table. The code block used by the data engineer is below:
(spark.table("sales")
.withcolumn("avg_price", col("sales")/ col("units"))
.writestream
.option("checkpointLocation", checkpointpath)
.outputMode("complete")
_____
.table("new_sales"))
(spark.table("sales")
.withcolumn("avg_price", col("sales")/ col("units"))
.writestream
.option("checkpointLocation", checkpointpath)
.outputMode("complete")
_____
.table("new_sales"))
If the data engineer only wants the query to execute a micro-batch to process data every 5 seconds, which of the following lines of code should the data engineer use to fill in the blank?
A
trigger("5 seconds")
B
trigger()
C
trigger(once="5 seconds")
D
trigger(processingTime="5 seconds")
E
trigger(continuous="5 seconds")
Explanation:
In Apache Spark Structured Streaming, the trigger() method is used to specify how often the streaming query should process data. There are three main trigger types:
For this question, the data engineer wants to execute a micro-batch every 5 seconds. The correct syntax for a processing time trigger is:
trigger(processingTime="5 seconds")
trigger(processingTime="5 seconds")
Let's analyze each option:
trigger(continuous="1 second") format, but this would process data continuously, not in micro-batches every 5 seconds.Key Points:
processingTime trigger is used for fixed-interval micro-batch processing