
Ultimate access to all questions.
In the context of preparing a dataset for machine learning, you encounter a significant amount of missing values across several features. The dataset is large, and the missing values are not randomly distributed. Considering the need to preserve as much data as possible for accurate model training, which of the following methods would be the MOST appropriate to handle the missing values? Choose the best option.
A
Normalizing the dataset to adjust the scale of numerical features, assuming this will indirectly address the missing values.
B
Implementing ensemble learning techniques, expecting that the combination of multiple models will compensate for the missing data.
C
Dropping all rows with missing values to ensure the dataset is clean and ready for model training.
D
Using imputation methods to fill in missing values based on the distribution or other statistical measures of the available data.
E
Applying data sharding to distribute the dataset across multiple nodes, thinking this will mitigate the impact of missing values.