Ultimate access to all questions.
Upgrade Now 🚀
Sign in to unlock AI tutor
When dealing with late-arriving data in a streaming data pipeline using Structured Streaming in Spark on Databricks, what strategy ensures accurate real-time analytics without affecting data integrity or query performance?
A
Creating a separate processing stream exclusively for late-arriving data, employing upserts to merge this data into the primary analytics tables.
B
Using a two-tier architecture where late data is first stored in a temporary buffer (Delta table) and periodically merged with the main dataset using batch processing.
C
Implementing watermarks to manage late data and windowed aggregations, allowing the system to wait for a specified time for late data before updating the results.
D
Leveraging Apache Kafka alongside Structured Streaming to reprocess data windows when late data is detected, ensuring completeness of data.