
Answer-first summary for fast verification
Answer: Introduce a special category (e.g., 'Missing' or 'Unknown') to denote missing values, preserving the categorical nature of the feature.
Introducing a special category for missing values in categorical features is the optimal strategy because it preserves the categorical integrity of the feature without introducing irrelevant numerical values. This approach transparently handles missing data by clearly indicating where data is absent, allowing the model to utilize this information effectively. It also reduces bias, as alternatives like using the mean can skew the data, especially if the missing values are not randomly distributed. Deleting rows may lead to a loss of valuable information, and artificially augmenting the dataset is not always practical or feasible. Moving problematic rows to the validation set could unfairly influence the model's assessment. Therefore, adding a placeholder category is the recommended method for managing missing categorical data in machine learning.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
No comments yet.
In the context of preparing a dataset for a machine learning model, you encounter null values in a crucial categorical feature during exploratory data analysis. This issue could potentially introduce bias into your model. Considering the constraints of maintaining data integrity, minimizing bias, and ensuring the model's performance is not adversely affected, what is the optimal strategy to handle these missing values effectively? Choose the best option.
A
Replace the missing values with the mean of the feature, assuming the categorical data can be numerically encoded.
B
Introduce a special category (e.g., 'Missing' or 'Unknown') to denote missing values, preserving the categorical nature of the feature.
C
Remove the rows containing missing values entirely, and then artificially increase your dataset size by 5% to compensate for the loss.
D
Transfer the rows with missing values to your validation dataset, ensuring they do not influence the training phase.