
Explanation:
The pyspark.sql.functions.window function organizes rows into one or more time windows based on a timestamp column. This query aggregates data per order_timestamp for each non-overlapping five-minute interval. The pyspark.sql.DataFrame.withWatermark method specifies a 10-minute watermark, allowing the system to manage state information for delayed records within this timeframe.
Ultimate access to all questions.
Given the following streaming query in Databricks, which statement best describes its functionality?
spark.readStream \
.table("orders_cleaned") \
.withWatermark("order_timestamp", "10 minutes") \
.groupBy(
window("order_timestamp", "5 minutes").alias("time"),
"author"
) \
.agg(
count("order_id").alias("orders_count"),
avg("quantity").alias("avg_quantity")
) \
.writeStream \
.option("checkpointLocation", "dbfs:/path/checkpoint") \
.table("orders_stats")
spark.readStream \
.table("orders_cleaned") \
.withWatermark("order_timestamp", "10 minutes") \
.groupBy(
window("order_timestamp", "5 minutes").alias("time"),
"author"
) \
.agg(
count("order_id").alias("orders_count"),
avg("quantity").alias("avg_quantity")
) \
.writeStream \
.option("checkpointLocation", "dbfs:/path/checkpoint") \
.table("orders_stats")
A
It computes business-level aggregates for each non-overlapping ten-minute interval, maintaining incremental state information for 5 minutes to accommodate late-arriving data.
B
It computes business-level aggregates for each overlapping five-minute interval, maintaining incremental state information for 10 minutes to accommodate late-arriving data.
C
It computes business-level aggregates for each non-overlapping five-minute interval, maintaining incremental state information for 10 minutes to accommodate late-arriving data.
D
It computes business-level aggregates for each overlapping ten-minute interval, maintaining incremental state information for 5 minutes to accommodate late-arriving data.
E
None of the above statements accurately describe this query.
No comments yet.