Ultimate access to all questions.
A data engineer needs to correlate advertisement impressions with user clicks by joining two streaming DataFrames. The Impressions
stream has a watermark set on "event_time"
for 10 minutes. The current implementation is:
impressions \
.groupBy(
window("event_time", "5 minutes"),
"id") \
.count() \
.withWatermark("event_time", "2 hours") \
.join(clicks, expr("clickAdId = impressionAdId"), "inner")
The query performance is degrading significantly. What solution would improve its performance?