
Databricks Certified Machine Learning - Associate
Get started today
Ultimate access to all questions.
A data scientist is attempting to use Spark ML to impute missing values in their PySpark DataFrame 'features_df'. The goal is to replace missing values in all numeric columns with the median of each column. However, the provided code snippet does not achieve this. What is the primary reason the code fails to perform the intended imputation?
my_imputer = imputer(strategy = 'median', inputCols = input_columns, outputCols = output_columns)
imputed_df = my_imputer.transform(features_df)
A data scientist is attempting to use Spark ML to impute missing values in their PySpark DataFrame 'features_df'. The goal is to replace missing values in all numeric columns with the median of each column. However, the provided code snippet does not achieve this. What is the primary reason the code fails to perform the intended imputation?
my_imputer = imputer(strategy = 'median', inputCols = input_columns, outputCols = output_columns)
imputed_df = my_imputer.transform(features_df)
Explanation:
In Spark ML, the correct process involves first fitting the imputer to the data to create an 'ImputerModel', which learns the median values for each column. Only after this step can the 'transform' method be used to apply these medians to impute missing values in the dataset. The other options either misinterpret Spark ML's capabilities or misidentify the issue with the code. Specifically, option A is incorrect because median imputation is supported; option B is irrelevant as the code's issue is not about dataset splitting; option C misrepresents the requirement for 'inputCols' and 'outputCols'; and option E is incorrect because both 'fit' and 'transform' are necessary but in the correct sequence.