
Ultimate access to all questions.
You are working on a machine learning project where you have identified a categorical feature, Feature A, during exploratory data analysis. Feature A shows significant predictive power for your target variable but is found to have missing values in approximately 10% of the dataset. The dataset is large, and the missingness in Feature A is believed to be random. Given the importance of Feature A and the need to maintain the integrity of your model's predictions, which of the following approaches is the BEST course of action? Choose one correct option.
A
Impute the missing values in Feature A with the mode of the feature, assuming that the most common category is the best replacement for missing data.
B
Replace the missing values in Feature A with values from the feature that has the highest Pearson correlation with Feature A, under the assumption that correlated features can provide reasonable substitutes.
C
Remove Feature A from the dataset if more than 15% of its values are missing, to avoid the complexity of handling missing data, despite its predictive power.
D
Introduce a new category within Feature A to represent missing values and create an additional binary feature indicating the presence or absence of a value in Feature A, to explicitly model the missingness.