
Ultimate access to all questions.
A data analyst has built an ML pipeline using a fixed input dataset with Spark ML, but the pipeline's processing time is too high. To enhance efficiency, the analyst increased the number of workers in the cluster. However, they noticed a difference in the training set's row count after the cluster reconfiguration compared to before. Which strategy guarantees a consistent training and test set for each model iteration?
A
Adjust the cluster configuration manually
B
Prescribe a rate in the data splitting process
C
Implement manual partitioning of the input dataset
D
Persistently store the split datasets
E
There exists no strategy to assure consistent training and test set