
Ultimate access to all questions.
Consider a scenario where you are tasked with scaling a linear regression model using Apache Spark across a distributed environment. Describe in detail how Spark manages to distribute the computation of linear regression, including the handling of large datasets, the use of parallel processing, and the optimization techniques employed to ensure efficient computation.
A
Spark uses a single node to process all data and performs linear regression sequentially.
B
Spark distributes the data across multiple nodes and uses parallel processing to compute the regression coefficients, leveraging techniques like gradient descent optimization and iterative algorithms to handle large datasets efficiently.
C
Spark only supports decision trees and does not scale linear regression.
D
Spark uses a centralized server to collect all data and then processes it using linear regression.