
Answer-first summary for fast verification
Answer: Use the `ChiSqSelector` class from the `pyspark.ml.feature` module to select the top k features with the highest chi-squared statistics for categorical features.
The correct approach to feature selection using Spark ML for categorical features is to use the `ChiSqSelector` class from the `pyspark.ml.feature` module, which selects the top k features with the highest chi-squared statistics. This helps to identify the most relevant features for the target variable. Option B is incorrect because `RFE` is a technique for continuous features, not categorical features. Option C is incorrect because `VectorSlicer` is used to select a subset of features from a vector column, not for feature selection based on statistical tests. Option D is incorrect because `MinMaxScaler` is used for feature scaling, not for feature selection.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
No comments yet.
In the context of feature selection using Spark ML, explain the process of selecting the most relevant features for a machine learning model. Provide a code snippet demonstrating the use of Spark ML's feature selection techniques, such as ChiSqSelector or RFE (Recursive Feature Elimination), and explain the key considerations to keep in mind during this process.
A
Use the ChiSqSelector class from the pyspark.ml.feature module to select the top k features with the highest chi-squared statistics for categorical features.
B
Use the RFE class from the pyspark.ml.feature module to perform recursive feature elimination based on the importance of features learned by a machine learning model.
C
Use the VectorSlicer class from the pyspark.ml.feature module to select a subset of features from a vector column.
D
Use the MinMaxScaler class from the pyspark.ml.feature module to scale the features to a specific range, without performing feature selection.