
Answer-first summary for fast verification
Answer: Use the 'join' function with a watermark to handle late data and specify the threshold for out-of-order events.
The BEST approach to implement a stream-static join in Spark Structured Streaming, especially when dealing with late data and out-of-order events, is to use the 'join' function with a watermark. The watermark specifies the threshold for how late the data can be and still be included in the join, ensuring accurate results while efficiently handling out-of-order events. This approach balances the need for accuracy with the system's scalability and performance, making it the optimal choice for most scenarios.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
No comments yet.
In the context of Spark Structured Streaming, you are tasked with implementing a stream-static join to enrich streaming data with static reference data. The solution must handle late data efficiently without compromising the accuracy of the join results. Considering the need for scalability and the handling of out-of-order events, which of the following approaches is the BEST to implement a stream-static join? Choose the correct option from the four provided.
A
Use the 'join' function to join the streaming DataFrame with the static DataFrame without any additional configurations.
B
Use the 'join' function with a watermark to handle late data and specify the threshold for out-of-order events.
C
Use the 'join' function with a stateful aggregation to continuously update the join results, which may increase the computational overhead.
D
Use the 'join' function with a state timeout to handle state expiration, which might lead to incomplete join results for late data.