
Ultimate access to all questions.
Consider a scenario where you need to train a linear regression model on a dataset with billions of records using Spark. Outline the steps you would take to ensure that the training process is both efficient and scalable, including data preprocessing, Spark configuration, and the use of Spark's MLlib functions.
A
Load the entire dataset into a single node and perform all computations sequentially.
B
Partition the dataset appropriately, configure Spark with an optimal number of executors and cores, leverage Spark's MLlib functions for data transformations and aggregations, and use caching where necessary to optimize performance.
C
Use a small subset of the data to train the model to save time and resources.
D
Perform all data preprocessing and model training on a single, high-performance server.