
Ultimate access to all questions.
Discuss the role of data parallelism in distributed machine learning and how Spark ML leverages this concept to improve training efficiency. Provide an example of a machine learning algorithm where data parallelism is particularly effective.
A
Data parallelism involves replicating the entire dataset across nodes; Spark ML uses this for all algorithms.
B
Data parallelism splits the dataset across nodes, each training on a subset; Spark ML uses this in gradient-based algorithms like logistic regression.
C
Data parallelism is not used in Spark ML; model parallelism is more effective.
D
Data parallelism is only effective in small-scale datasets; Spark ML uses other methods for large datasets.