
Answer-first summary for fast verification
Answer: Implementing watermarks to manage late data and windowed aggregations, allowing the system to wait for a specified time for late data before updating the results.
1. **Watermarks**: In Structured Streaming, watermarks allow you to set a threshold for how late data can arrive before it's excluded from processing. This ensures the system waits a reasonable time for late data. 2. **Windowed Aggregations**: These enable grouping data into time-based windows, allowing late data to be included in the correct timeframe for accurate analytics. 3. **Efficiency**: This method balances real-time processing with handling late data, maintaining query performance. 4. **Data Integrity**: By waiting for late data within a specified timeframe, the approach ensures analytics are accurate and reliable. Thus, using watermarks and windowed aggregations is the optimal strategy for managing late-arriving data in Spark on Databricks.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
No comments yet.
When dealing with late-arriving data in a streaming data pipeline using Structured Streaming in Spark on Databricks, what strategy ensures accurate real-time analytics without affecting data integrity or query performance?
A
Creating a separate processing stream exclusively for late-arriving data, employing upserts to merge this data into the primary analytics tables.
B
Using a two-tier architecture where late data is first stored in a temporary buffer (Delta table) and periodically merged with the main dataset using batch processing.
C
Implementing watermarks to manage late data and windowed aggregations, allowing the system to wait for a specified time for late data before updating the results.
D
Leveraging Apache Kafka alongside Structured Streaming to reprocess data windows when late data is detected, ensuring completeness of data.