
Ultimate access to all questions.
A junior data engineer is testing a code block designed to fetch the most recent entry for each item in the 'sales' table since the last update. The code utilizes a window function partitioned by 'item_id' and ordered by 'item_time' in descending order. However, the execution fails. What is the primary reason for this failure?
from pyspark.sql import functions as F
from pyspark.sql.window import Window
window = Window.partitionBy(“item_id“).orderBy(F.col(“item_time“).desc())
ranked_df = (spark.readStream.table(“sales“)
.withColumn(“rank“, F.rank().over(window))
.filter(“rank == 1“) .drop(“rank“) )
display(ranked_df)
from pyspark.sql import functions as F
from pyspark.sql.window import Window
window = Window.partitionBy(“item_id“).orderBy(F.col(“item_time“).desc())
ranked_df = (spark.readStream.table(“sales“)
.withColumn(“rank“, F.rank().over(window))
.filter(“rank == 1“) .drop(“rank“) )
display(ranked_df)
A
The query output cannot be displayed directly. The solution is to use spark.writeStream to save the query result.
B
The absence of watermarking prevents tracking state information for the time window, which is necessary for the operation.
C
Streaming DataFrames do not support non-time-based window operations. These operations must be executed within a foreachBatch logic.
D
The 'item_id' field lacks uniqueness. It's essential to deduplicate records on 'item_id' using the dropDuplicates function.
E
The 'item_id' field's non-uniqueness requires dropping the 'rank' column before applying the rank function to eliminate duplicate records._