
Answer-first summary for fast verification
Answer: Use Apache Spark Streaming to create a real-time ETL pipeline, with appropriate data sources, transformations, and sinks to handle the time-series data efficiently, considering time-window operations and data aggregation.
Option B is the correct answer. Apache Spark Streaming can be used to create a real-time ETL pipeline that can handle the high velocity and variability of time-series data from IoT devices. The pipeline should include appropriate data sources, such as Kafka or Kinesis, to ingest the data, transformations to process and analyze the data considering time-window operations and data aggregation, and sinks to store or visualize the results. Using batch processing, traditional databases, or processing a subset of the data would not meet the real-time processing requirements for time-series data.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
No comments yet.
You are working on a data processing project that involves analyzing real-time streaming data from IoT devices. The data includes time-series data with high velocity and variability. Describe how you would use Apache Spark to create an ETL pipeline for this use case, and explain the considerations involved in handling time-series data.
A
Use Apache Spark's batch processing capabilities to process the time-series data at regular intervals, as real-time processing is not required.
B
Use Apache Spark Streaming to create a real-time ETL pipeline, with appropriate data sources, transformations, and sinks to handle the time-series data efficiently, considering time-window operations and data aggregation.
C
Use a traditional database system to store and process the time-series data, as it can handle high velocity and variability more effectively than Apache Spark.
D
Only process a subset of the time-series data to reduce the velocity and variability, as real-time processing of the entire dataset is not feasible.