
Answer-first summary for fast verification
Answer: Persistently store the split datasets
**Correct Answer: D. Persistently store the split datasets** **Explanation:** To maintain consistency in training and test datasets across various model iterations or cluster setups, it's crucial to persistently store the datasets once they've been split. This involves saving the split datasets to a stable storage system, ensuring the exact same training and test datasets are used for each model iteration, irrespective of cluster configuration changes. **Other Options:** - **A:** Manually adjusting the cluster configuration may optimize resource use but doesn't directly solve dataset consistency issues. - **B:** Setting a fixed split ratio is common, but without persistent storage, the rows in each split can vary between runs, especially in distributed environments like Spark. - **C:** Manual partitioning offers control over data distribution but doesn't ensure consistency across runs unless partitions are persistently stored. - **E:** This is incorrect. Strategies like persistently storing split datasets can ensure consistent training and test sets. Persistently storing the training and test datasets after splitting is a reliable method to ensure consistency across different runs and cluster configurations, which is essential for the reproducibility of machine learning models.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
No comments yet.
A data analyst has built an ML pipeline using a fixed input dataset with Spark ML, but the pipeline's processing time is too high. To enhance efficiency, the analyst increased the number of workers in the cluster. However, they noticed a difference in the training set's row count after the cluster reconfiguration compared to before. Which strategy guarantees a consistent training and test set for each model iteration?
A
Adjust the cluster configuration manually
B
Prescribe a rate in the data splitting process
C
Implement manual partitioning of the input dataset
D
Persistently store the split datasets
E
There exists no strategy to assure consistent training and test set