
Ultimate access to all questions.
A data engineer needs to correlate advertisement impressions with user clicks by joining two streaming DataFrames. The Impressions stream has a watermark set on "event_time" for 10 minutes. The current implementation is:
impressions \
.groupBy(
window("event_time", "5 minutes"),
"id") \
.count() \
.withWatermark("event_time", "2 hours") \
.join(clicks, expr("clickAdId = impressionAdId"), "inner")
impressions \
.groupBy(
window("event_time", "5 minutes"),
"id") \
.count() \
.withWatermark("event_time", "2 hours") \
.join(clicks, expr("clickAdId = impressionAdId"), "inner")
The query performance is degrading significantly. What solution would improve its performance?_
A
Joining on event time constraint: clickTime >= impressionTime AND clickTime <= impressionTime interval 1 hour
B
Joining on event time constraint: clickTime + 3 hours < impressionTime - 2 hours
C
Joining on event time constraint: clickTime == impressionTime using a leftOuter join
D
Joining on event time constraint: clickTime >= impressionTime - interval 3 hours and removing watermarks