Ultimate access to all questions.
A junior data engineer is developing a streaming pipeline to calculate average humidity and temperature per device in 5-minute non-overlapping windows. Given a streaming DataFrame df
with schema "device_id INT, event_time TIMESTAMP, temp FLOAT, humidity FLOAT"
, which line of code correctly completes this aggregation?
df.withWatermark("event_time", "10 minutes")
.groupBy(
window("event_time", "5 minutes"),
"device_id"
)
.agg(
avg("temp").alias("avg_temp"),
avg("humidity").alias("avg_humidity")
)
.writeStream
.format("delta")
.saveAsTable("sensor_avg")