A data analyst has built an ML pipeline using a fixed input dataset with Spark ML, but the pipeline's processing time is too high. To enhance efficiency, the analyst increased the number of workers in the cluster. However, they noticed a difference in the training set's row count after the cluster reconfiguration compared to before. Which strategy guarantees a consistent training and test set for each model iteration?

Real Exam

Adjust the cluster configuration manually

7.0%

Prescribe a rate in the data splitting process

21.1%

Implement manual partitioning of the input dataset

15.8%

Persistently store the split datasets

54.4%

There exists no strategy to assure consistent training and test set

1.8%

Databricks Certified Machine Learning - Associate