Reddit

A junior data engineer is testing a code block designed to fetch the most recent entry for each item in the 'sales' table since the last update. The code utilizes a window function partitioned by 'item_id' and ordered by 'item_time' in descending order. However, the execution fails. What is the primary reason for this failure?

from pyspark.sql import functions as F
from pyspark.sql.window import Window

window = Window.partitionBy(“item_id“).orderBy(F.col(“item_time“).desc())
ranked_df = (spark.readStream.table(“sales“)
                      .withColumn(“rank“, F.rank().over(window))
                     .filter(“rank == 1“) .drop(“rank“) )
display(ranked_df)

from pyspark.sql import functions as F
from pyspark.sql.window import Window

window = Window.partitionBy(“item_id“).orderBy(F.col(“item_time“).desc())
ranked_df = (spark.readStream.table(“sales“)
                      .withColumn(“rank“, F.rank().over(window))
                     .filter(“rank == 1“) .drop(“rank“) )
display(ranked_df)

Real Exam

The query output cannot be displayed directly. The solution is to use spark.writeStream to save the query result.

6.1%

The absence of watermarking prevents tracking state information for the time window, which is necessary for the operation.

11.4%

Streaming DataFrames do not support non-time-based window operations. These operations must be executed within a foreachBatch logic.

64.5%

The 'item_id' field lacks uniqueness. It's essential to deduplicate records on 'item_id' using the dropDuplicates function.

10.6%

The 'item_id' field's non-uniqueness requires dropping the 'rank' column before applying the rank function to eliminate duplicate records.

7.3%

Databricks Certified Data Engineer - Professional

Get started today

Comments