Ultimate access to all questions.
You are tasked with developing a machine learning model to predict house prices for a real estate company. During the data preprocessing phase, you discover that the 'distance from the nearest school' feature, which is considered a crucial predictor, has a significant number of missing values and exhibits low variance. The company emphasizes the importance of utilizing every data row to maximize the model's predictive accuracy. Additionally, the solution must be cost-effective and scalable to accommodate future data. Given these constraints, what is the optimal strategy to handle the missing data in this scenario? Choose the best option.
Explanation:
Option D is correct because employing linear regression to estimate missing values ensures the data's accuracy without introducing bias, aligning with the need for a cost-effective and scalable solution. Linear regression, a supervised learning method, fits a linear equation to the data, enabling the prediction of unknown values. This technique is particularly useful for maintaining the model's integrity when dealing with low-variance features that are missing data. Option E is also correct as it suggests a comprehensive approach that not only addresses the missing data issue but also enhances the feature's variance, potentially improving the model's predictive performance. However, the primary focus should be on accurately handling the missing data, making D the best single option.