
Ultimate access to all questions.
You are tasked with developing a machine learning model to predict house prices for a real estate company. During the data preprocessing phase, you discover that the 'distance from the nearest school' feature, which is considered a crucial predictor, has a significant number of missing values and exhibits low variance. The company emphasizes the importance of utilizing every data row to maximize the model's predictive accuracy. Additionally, the solution must be cost-effective and scalable to accommodate future data. Given these constraints, what is the optimal strategy to handle the missing data in this scenario? Choose the best option.
A
Replace the missing values with zeros to maintain dataset completeness, despite the potential introduction of bias.
B
Eliminate any rows that contain missing entries to ensure data integrity, at the risk of reducing the dataset size and losing valuable information.
C
Merge this feature with another column that is fully populated to avoid data loss, though this may dilute the predictive power of the original feature.
D
Predict the missing values by applying linear regression for an accurate data representation, leveraging the existing data to estimate missing values without introducing significant bias.
E
Use a combination of imputation for missing values and feature engineering to enhance the 'distance from the nearest school' feature's variance, ensuring both data completeness and improved model performance.