
Answer-first summary for fast verification
Answer: The imputer must first be fitted to the data to create an 'ImputerModel' before transforming.
In Spark ML, the correct process involves first fitting the imputer to the data to create an 'ImputerModel', which learns the median values for each column. Only after this step can the 'transform' method be used to apply these medians to impute missing values in the dataset. The other options either misinterpret Spark ML's capabilities or misidentify the issue with the code. Specifically, option A is incorrect because median imputation is supported; option B is irrelevant as the code's issue is not about dataset splitting; option C misrepresents the requirement for 'inputCols' and 'outputCols'; and option E is incorrect because both 'fit' and 'transform' are necessary but in the correct sequence.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
No comments yet.
A data scientist is attempting to use Spark ML to impute missing values in their PySpark DataFrame 'features_df'. The goal is to replace missing values in all numeric columns with the median of each column. However, the provided code snippet does not achieve this. What is the primary reason the code fails to perform the intended imputation?
my_imputer = imputer(strategy = 'median', inputCols = input_columns, outputCols = output_columns)
imputed_df = my_imputer.transform(features_df)
my_imputer = imputer(strategy = 'median', inputCols = input_columns, outputCols = output_columns)
imputed_df = my_imputer.transform(features_df)
A
Imputing using a median value is not supported in Spark ML.
B
The code does not handle imputation for both training and test datasets at the same time.
C
The 'inputCols' and 'outputCols' parameters must have identical column names.
D
The imputer must first be fitted to the data to create an 'ImputerModel' before transforming.
E
The 'transform' method should be replaced with the 'fit' method.