
Answer-first summary for fast verification
Answer: .withWatermark("order_timestamp", "30 minutes")
The `pyspark.sql.DataFrame.withWatermark` function is designed to track state information for a specified window of time, accommodating delays in record arrivals. This makes it the correct choice for handling late-arriving data by maintaining the streaming state information for 30 minutes. For more details, refer to the [Apache Spark documentation](https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.DataFrame.withWatermark.html).
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
A data engineer is working on a streaming query that processes orders data. The query is missing a crucial part for handling late-arriving data, specifically to maintain the streaming state information for 30 minutes. The query snippet is as follows:
spark.readStream
.table("orders_cleaned")
____________________________
.groupBy(
"order_timestamp",
"author")
.agg(
count("order_id").alias("orders_count"),
avg("quantity").alias("avg_quantity"))
.writeStream
.option("checkpointLocation", "dbfs:/path/checkpoint")
.table("orders_stats")
spark.readStream
.table("orders_cleaned")
____________________________
.groupBy(
"order_timestamp",
"author")
.agg(
count("order_id").alias("orders_count"),
avg("quantity").alias("avg_quantity"))
.writeStream
.option("checkpointLocation", "dbfs:/path/checkpoint")
.table("orders_stats")
Which option correctly fills in the blank to meet the requirement of handling late-arriving data by maintaining the streaming state information for 30 minutes?
A
.trigger(processingTime="30 minutes")
B
.awaitTermination("order_timestamp", "30 minutes")
C
.awaitWatermark("order_timestamp", "30 minutes")
D
.withWatermark("order_timestamp", "30 minutes")
E
.window("order_timestamp", "30 minutes")
No comments yet.