Ultimate access to all questions.
A data scientist is working on one-hot encoding categorical attributes in a PySpark DataFrame named 'features_df' using Spark ML. The string column names are stored in the variable 'input_columns'. The provided code snippet is causing an error. What change is necessary to correctly perform one-hot encoding?
Explanation:
In Spark ML, categorical string attributes must first be converted to numerical indices using StringIndexer before one-hot encoding can be applied. This is because OneHotEncoder does not directly process string columns. The correct approach involves two steps: 1) Use StringIndexer to convert string columns into indices, and 2) Apply OneHotEncoder to these indices to produce one-hot encoded vectors. This preparation is essential for the one-hot encoding process in Spark ML and addresses the error encountered. Option A is incorrect as OneHotEncoder does not require a 'method' parameter. Option B is incorrect because the 'fit' operation is necessary for OneHotEncoder to learn the category mappings. Option D is incorrect since OneHotEncoder needs distinct names for output columns to store the encoded vectors.