
Ultimate access to all questions.
You are working on a classification problem involving time series data. Early experiments have yielded an AUC ROC value of 99% on the training set, even though you have not used any advanced algorithms or hyperparameter tuning. Given these results, what should be your next step to diagnose and address potential issues with your model?
A
Address the model overfitting by using a less complex algorithm.
B
Address data leakage by applying nested cross-validation during model training.
C
Address data leakage by removing features highly correlated with the target value.
D
Address the model overfitting by tuning the hyperparameters to reduce the AUC ROC value.
Explanation:
The correct answer is B: Address data leakage by applying nested cross-validation during model training. When you achieve a very high AUC ROC value on the training data with minimal effort, it often indicates that there might be data leakage in your model. Data leakage occurs when information from outside the training dataset is used to create the model, which can cause it to perform exceptionally well on training data but poorly on new, unseen data. Nested cross-validation helps in evaluating model performance more accurately and ensuring that the model doesn't have access to future information during training phases, especially in time series data.