
Ultimate access to all questions.
You are working on a data processing project that involves analyzing large volumes of log data from various sources. The data includes structured and unstructured data, with varying formats and schemas. Describe how you would use Apache Spark to create an ETL pipeline that can handle the diverse data types and formats, and explain the steps involved in the process.
A
Use Apache Spark's built-in functions to directly read and process the data from the sources, without any data transformation or schema definition.
B
Define a common schema for all the data sources and use Apache Spark to read, transform, and process the data according to the defined schema.
C
Use a custom data processing library to handle the diverse data types and formats, as Apache Spark is not suitable for this task.
D
Ignore the unstructured data and only process the structured data, as it is easier to handle and analyze.