
Answer-first summary for fast verification
Answer: Use the `Pipeline` class from the `pyspark.ml` module to define the stages of the pipeline, but be aware of the potential for data leakage between stages.
The correct approach to developing a Spark ML Pipeline is to use the `Pipeline` class from the `pyspark.ml` module to define the stages of the pipeline, ensuring that each stage is a Spark ML estimator or transformer. It is important to be aware of the potential for data leakage between stages, which can occur if the same data is used for both training and testing within the pipeline. Option A is incorrect because it does not address the potential issue of data leakage. Option B is incorrect because `PipelineModel` is used to represent a fitted pipeline model, not to define the stages of the pipeline. Option D is incorrect for the same reason as option B.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
No comments yet.
In the context of developing a Spark ML Pipeline, explain the process of identifying key gotchas and how to handle them. Provide a code snippet demonstrating the development of a Spark ML Pipeline and explain how to address the potential issues that may arise.
A
Use the Pipeline class from the pyspark.ml module to define the stages of the pipeline, and ensure that each stage is a Spark ML estimator or transformer.
B
Use the PipelineModel class from the pyspark.ml module to define the stages of the pipeline, and ensure that each stage is a Spark ML estimator or transformer.
C
Use the Pipeline class from the pyspark.ml module to define the stages of the pipeline, but be aware of the potential for data leakage between stages.
D
Use the PipelineModel class from the pyspark.ml module to define the stages of the pipeline, but be aware of the potential for data leakage between stages.