
Answer-first summary for fast verification
Answer: Use the `OneHotEncoder` transformer from the `pyspark.ml.feature` module to convert categorical features to a binary vector representation.
The correct approach to handle categorical features and encode them for use in machine learning models using Spark ML is to use the `OneHotEncoder` transformer from the `pyspark.ml.feature` module. This transformer converts categorical features to a binary vector representation, where each category is represented by a unique binary vector. Option A is incorrect because `StringIndexer` is used to convert categorical features to numerical indices, which may not be suitable for all machine learning models. Option C is incorrect because `VectorAssembler` is used to combine multiple columns into a single vector column, not specifically for encoding categorical features. Option D is incorrect because `Bucketizer` is used to map continuous features into a fixed-size vector of values based on specified bucket ranges, not for encoding categorical features.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
No comments yet.
In the context of feature engineering using Spark ML, explain the process of handling categorical features and encoding them for use in machine learning models. Provide a code snippet demonstrating the encoding of categorical features using Spark ML transformers.
A
Use the StringIndexer transformer from the pyspark.ml.feature module to convert categorical features to numerical indices.
B
Use the OneHotEncoder transformer from the pyspark.ml.feature module to convert categorical features to a binary vector representation.
C
Use the VectorAssembler transformer from the pyspark.ml.feature module to combine categorical features into a single vector column.
D
Use the Bucketizer transformer from the pyspark.ml.feature module to map categorical features to a fixed-size vector of values.