Reddit

A junior data engineer is developing a streaming data pipeline to perform grouped aggregations on DataFrame df. The pipeline must compute the average humidity and average temperature for each device in non-overlapping five-minute intervals, with events recorded every minute per device.

Streaming DataFrame df has the schema:

"device_id INT, event_time TIMESTAMP, temp FLOAT, humidity FLOAT"

"device_id INT, event_time TIMESTAMP, temp FLOAT, humidity FLOAT"

The following code block contains syntax errors and typos. Correct them and fill in the missing logic to achieve the desired aggregation:

df.withWatermark("event_time", "10 minutes")  
   .groupBy(  
       "device_id",  
       window("event_time", "5 minutes")  
   )  
   .agg(  
       avg("temp").alias("avg_temp"),  
       avg("humidity").alias("avg_humidity")  
   )  
   .writeStream  
   .format("delta")  
   .saveAsTable("sensor_avg")

df.withWatermark("event_time", "10 minutes")  
   .groupBy(  
       "device_id",  
       window("event_time", "5 minutes")  
   )  
   .agg(  
       avg("temp").alias("avg_temp"),  
       avg("humidity").alias("avg_humidity")  
   )  
   .writeStream  
   .format("delta")  
   .saveAsTable("sensor_avg")

Choose the correct option to complete the missing logic in the code block.

Exam-Like

to_interval("event_time", "5 minutes").alias("time")

9.2%

window("event_time", "5 minutes").alias("time")_

70.4%

"event_time"_

6.1%

window("event_time", "10 minutes").alias("time")_

10.2%

lag("event_time", "10 minutes").alias("time")_

4.1%

Databricks Certified Data Engineer - Professional

Get started today

Comments