
Answer-first summary for fast verification
Answer: Use Apache Spark Streaming to create a real-time ETL pipeline, with appropriate data sources, transformations, and sinks to handle the data efficiently.
Option B is the correct answer. Apache Spark Streaming can be used to create a real-time ETL pipeline that can handle the high velocity and volume of IoT device data. The pipeline should include appropriate data sources, such as Kafka or Kinesis, to ingest the data, transformations to process and analyze the data, and sinks to store or visualize the results. Design considerations include ensuring fault tolerance, scalability, and low latency in the pipeline. Using batch processing, traditional databases, or processing a subset of the data would not meet the real-time processing requirements.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
Your company has a large dataset of IoT device data that needs to be processed in real-time for monitoring and alerting purposes. Describe how you would use Apache Spark to create a streaming ETL pipeline that can handle the high velocity and volume of data, and explain the considerations involved in designing such a pipeline.
A
Use Apache Spark's batch processing capabilities to process the data at regular intervals, as real-time processing is not required.
B
Use Apache Spark Streaming to create a real-time ETL pipeline, with appropriate data sources, transformations, and sinks to handle the data efficiently.
C
Use a traditional database system to store and process the data, as it can handle high velocity and volume more effectively than Apache Spark.
D
Only process a subset of the data to reduce the volume and velocity, as real-time processing of the entire dataset is not feasible.
No comments yet.