
Answer-first summary for fast verification
Answer: For numerical fields with missing values, replace the NaN entries with the mean or median of the available data in those fields., For categorical fields, replace missing values with the most frequently occurring category in those fields., Implement a secondary machine learning model specifically designed to predict and fill in missing values based on the available data.
Handling missing data effectively is crucial for building reliable machine learning models. Replacing missing numerical values with the mean or median (B) helps maintain the dataset's statistical properties without introducing significant bias. For categorical data, using the most frequent category (D) is a common approach that leverages the existing data distribution. Implementing a model to predict missing values (A) can be more sophisticated and accurate, especially when the missingness is not at random. Deleting records with any missing values (C) can lead to significant data loss and bias, while randomly filling missing values (E) does not consider the underlying data distribution and can introduce noise. For more detailed strategies on handling missing data, refer to: [Towards Data Science](https://towardsdatascience.com/7-ways-to-handle-missing-values-in-machine-learning-1a6326adf79e).
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
No comments yet.
As a junior Data Scientist at a consulting firm, you're tasked with improving a machine learning model's performance. The initial analysis reveals that the dataset contains numerous missing values (NaN) across various fields, significantly impacting the model's accuracy. Your team lead emphasizes the importance of handling these missing values effectively during the data acquisition phase to ensure the model's reliability and performance. Considering the constraints of maintaining data integrity, minimizing bias, and ensuring scalability, which three strategies should you implement to address the missing values? (Choose three)
A
Implement a secondary machine learning model specifically designed to predict and fill in missing values based on the available data.
B
For numerical fields with missing values, replace the NaN entries with the mean or median of the available data in those fields.
C
Remove all records from the dataset that contain any missing values, regardless of the field or the amount of missing data.
D
For categorical fields, replace missing values with the most frequently occurring category in those fields.
E
Use a random value from the existing data to fill in missing entries, ensuring a uniform distribution of replaced values.