Ultimate access to all questions.
When a data scientist integrates a random forest regressor pipeline as the final stage in a Spark ML Pipeline and initiates cross-validation, what is a potential downside of constructing the pipeline within the cross-validation process?
Explanation:
Incorporating the entire pipeline as an estimator in cross-validation means that each stage, including data preprocessing, is executed for every fold. This leads to a longer runtime because each stage of the pipeline must be refitted or retransformed for every model iteration, ensuring that the process is thorough but time-consuming.