
Ultimate access to all questions.
In your role as a Machine Learning Engineer, you're working on a classification problem involving time series data. After minimal experimentation using random cross-validation, you've achieved an unusually high 99% AUC ROC on the training data, without employing sophisticated algorithms or extensive hyperparameter tuning. Given the potential issues this could indicate, such as data leakage or overfitting, and considering the constraints of ensuring model generalizability and compliance with data privacy standards, what should be your next step to accurately identify and resolve the underlying issue? Choose the best two options.
A
Implement a simpler algorithm and use k-fold cross-validation to address potential overfitting, ensuring the model's performance is not just a result of memorizing the training data.
B
Investigate and remove features that are highly correlated with the target variable to reduce the risk of data leakage, while also ensuring that the model's input features comply with data privacy standards.
C
Adjust the model's hyperparameters to intentionally lower the AUC ROC on the training data, aiming for a more realistic performance metric that might better generalize to unseen data.
D
Employ nested cross-validation during the model training phase to rigorously check for and mitigate data leakage, ensuring that the model evaluation is based on genuinely unseen data.
E
Both A and D are necessary steps to comprehensively address the issue, combining the prevention of overfitting with the mitigation of data leakage.