
Answer-first summary for fast verification
Answer: Create a multi-stage ETL pipeline with intermediate data staging locations to handle the different data types and sources, and use Apache Spark to process the data in a distributed manner.
Option B is the correct answer. A multi-stage ETL pipeline with intermediate data staging locations is necessary to handle the different data types and sources. Apache Spark can be used to process the data in a distributed manner, allowing for efficient handling of large volumes of data with high velocity. Ignoring unstructured data or using a single-stage ETL process would not be sufficient for the given requirements. Traditional batch processing may not be able to handle the velocity and variety of the data effectively.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
You are tasked with designing an ETL pipeline for a large e-commerce company that needs to process both structured and unstructured data from various sources. The data includes customer transactions, product reviews, and social media posts. Describe the steps you would take to create an ETL pipeline that can handle the volume, velocity, and variety of this data, and explain how you would use Apache Spark to process the data efficiently.
A
Use a single-stage ETL process to load all data into a data warehouse and then perform transformations and analysis.
B
Create a multi-stage ETL pipeline with intermediate data staging locations to handle the different data types and sources, and use Apache Spark to process the data in a distributed manner.
C
Only process structured data and ignore unstructured data due to the complexity of handling different data types.
D
Use a traditional batch processing approach to handle the data, as it is more cost-effective than using a distributed computing framework like Apache Spark.
No comments yet.