
Ultimate access to all questions.
Answer-first summary for fast verification
Answer: trigger(processingTime="5 seconds")
## Explanation In Apache Spark Structured Streaming, the `trigger()` method is used to specify how often the streaming query should process data. There are three main trigger types: 1. **ProcessingTime trigger**: Processes data at fixed time intervals (micro-batches) 2. **Once trigger**: Processes all available data once and then stops 3. **Continuous trigger**: Processes data continuously (experimental feature) For this question, the data engineer wants to execute a micro-batch every 5 seconds. The correct syntax for a processing time trigger is: ```python trigger(processingTime="5 seconds") ``` Let's analyze each option: - **A. trigger("5 seconds")**: This is incorrect because the trigger method requires specifying the trigger type explicitly. - **B. trigger()**: This would use the default trigger (which processes data as soon as possible), not every 5 seconds. - **C. trigger(once="5 seconds")**: This is incorrect syntax for the once trigger. The once trigger doesn't take a time parameter. - **D. trigger(processingTime="5 seconds")**: **CORRECT** - This specifies a processing time trigger that executes micro-batches every 5 seconds. - **E. trigger(continuous="5 seconds")**: This is incorrect syntax for continuous trigger. Continuous trigger uses `trigger(continuous="1 second")` format, but this would process data continuously, not in micro-batches every 5 seconds. **Key Points**: - `processingTime` trigger is used for fixed-interval micro-batch processing - The time interval can be specified as a string like "5 seconds", "1 minute", etc. - This trigger ensures the query processes data in micro-batches at the specified interval, which is exactly what the data engineer needs.
Author: Keng Suppaseth
No comments yet.
A data engineer has configured a Structured Streaming job to read from a table, manipulate the data, and then perform a streaming write into a new table. The code block used by the data engineer is below:
(spark.table("sales")
.withcolumn("avg_price", col("sales")/ col("units"))
.writestream
.option("checkpointLocation", checkpointpath)
.outputMode("complete")
_____
.table("new_sales"))
(spark.table("sales")
.withcolumn("avg_price", col("sales")/ col("units"))
.writestream
.option("checkpointLocation", checkpointpath)
.outputMode("complete")
_____
.table("new_sales"))
If the data engineer only wants the query to execute a micro-batch to process data every 5 seconds, which of the following lines of code should the data engineer use to fill in the blank?
A
trigger("5 seconds")
B
trigger()
C
trigger(once="5 seconds")
D
trigger(processingTime="5 seconds")
E
trigger(continuous="5 seconds")