
Answer-first summary for fast verification
Answer: Use Apache Spark to read the data from the relational database, perform necessary transformations, and store the transformed data in a distributed file system for further analysis.
Option B is the correct answer. Apache Spark can be used to read the data from the relational database, perform necessary transformations such as filtering, aggregation, and joining, and store the transformed data in a distributed file system like HDFS or Amazon S3. This approach allows for a resilient and scalable ETL pipeline that can handle large volumes of data. Loading data directly into a data warehouse or using a MapReduce approach may not be as efficient or flexible. Using machine learning libraries without data transformation is not appropriate for the given task.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
No comments yet.
Your company has a large dataset of customer transaction records stored in a relational database. You need to transform this data to analyze customer behavior patterns. Describe how you would use Apache Spark to process the data, including the steps involved in creating a resilient and scalable ETL pipeline.
A
Load the data directly into a data warehouse and perform SQL queries to analyze customer behavior patterns.
B
Use Apache Spark to read the data from the relational database, perform necessary transformations, and store the transformed data in a distributed file system for further analysis.
C
Use a MapReduce approach to process the data in Apache Spark, as it is more suitable for batch processing.
D
Use Apache Spark's machine learning libraries to directly predict customer behavior patterns without any data transformation.