
Google Professional Machine Learning Engineer
Get started today
Ultimate access to all questions.
You are working on a machine learning project where you have identified a categorical feature, Feature A, during exploratory data analysis. Feature A shows significant predictive power for your target variable but is found to have missing values in approximately 10% of the dataset. The dataset is large, and the missingness in Feature A is believed to be random. Given the importance of Feature A and the need to maintain the integrity of your model's predictions, which of the following approaches is the BEST course of action? Choose one correct option.
You are working on a machine learning project where you have identified a categorical feature, Feature A, during exploratory data analysis. Feature A shows significant predictive power for your target variable but is found to have missing values in approximately 10% of the dataset. The dataset is large, and the missingness in Feature A is believed to be random. Given the importance of Feature A and the need to maintain the integrity of your model's predictions, which of the following approaches is the BEST course of action? Choose one correct option.
Explanation:
The best approach is to introduce a new category for missing values within Feature A and create a binary indicator feature. This method preserves the predictive power of Feature A by maintaining its categorical nature and explicitly accounts for the missingness, allowing the model to learn from this pattern. Imputing missing values with the mode or a correlated feature can introduce bias by assuming missing values are similar to observed ones. Dropping Feature A could lead to significant information loss, especially given its predictive importance. The proposed solution effectively handles missing values without introducing bias or losing valuable information.