Reddit

A data engineer is working on a streaming query that processes orders data. The query is missing a crucial part for handling late-arriving data, specifically to maintain the streaming state information for 30 minutes. The query snippet is as follows:

spark.readStream
    .table("orders_cleaned")
    ____________________________
    .groupBy(
        "order_timestamp",
        "author")
    .agg(
        count("order_id").alias("orders_count"),
        avg("quantity").alias("avg_quantity"))
.writeStream
    .option("checkpointLocation", "dbfs:/path/checkpoint")
    .table("orders_stats")

spark.readStream
    .table("orders_cleaned")
    ____________________________
    .groupBy(
        "order_timestamp",
        "author")
    .agg(
        count("order_id").alias("orders_count"),
        avg("quantity").alias("avg_quantity"))
.writeStream
    .option("checkpointLocation", "dbfs:/path/checkpoint")
    .table("orders_stats")

Which option correctly fills in the blank to meet the requirement of handling late-arriving data by maintaining the streaming state information for 30 minutes?

Real Exam

.trigger(processingTime="30 minutes")

6.3%

.awaitTermination("order_timestamp", "30 minutes")_

3.2%

.awaitWatermark("order_timestamp", "30 minutes")_

5.1%

.withWatermark("order_timestamp", "30 minutes")_

79.7%

.window("order_timestamp", "30 minutes")_

5.7%

Databricks Certified Data Engineer - Professional

Get started today

Comments