
Ultimate access to all questions.
Consider a scenario where you are working with a large dataset and need to build a decision tree model using Spark MLlib. Describe the steps involved in scaling the decision tree algorithm in Spark and explain how it leverages the distributed nature of Spark to improve performance.
A
Spark uses a single decision tree model to process the entire dataset, which may lead to memory issues and slow performance.
B
Spark distributes the dataset across multiple nodes and builds a single decision tree model in parallel, but does not support any further optimizations.
C
Spark uses a distributed version of the decision tree algorithm, where each node in the cluster builds a separate decision tree and the final model is a combination of these individual trees.
D
Spark uses a distributed version of the decision tree algorithm, where each node in the cluster builds a separate decision tree, and the final model is an ensemble of these individual trees, such as a random forest or gradient boosted trees.