Databricks Certified Data Engineer - Professional

Databricks Certified Data Engineer - Professional

Get started today

Ultimate access to all questions.


A junior data engineer is developing a streaming data pipeline to perform grouped aggregations on DataFrame df. The pipeline must compute the average humidity and average temperature for each device in non-overlapping five-minute intervals, with events recorded every minute per device.

Streaming DataFrame df has the schema:

"device_id INT, event_time TIMESTAMP, temp FLOAT, humidity FLOAT"

The following code block contains syntax errors and typos. Correct them and fill in the missing logic to achieve the desired aggregation:

df.withWatermark("event_time", "10 minutes")  
   .groupBy(  
       "device_id",  
       window("event_time", "5 minutes")  
   )  
   .agg(  
       avg("temp").alias("avg_temp"),  
       avg("humidity").alias("avg_humidity")  
   )  
   .writeStream  
   .format("delta")  
   .saveAsTable("sensor_avg")  

Choose the correct option to complete the missing logic in the code block.