
Answer-first summary for fast verification
Answer: Limit feature engineering to the training set only, excluding validation and test sets.
To prevent data leakage, it's crucial to perform feature engineering exclusively on the training set. This ensures that the model's performance on validation and test sets accurately reflects its ability to generalize to unseen data. Data leakage occurs when information from outside the training set influences the model's training, leading to overly optimistic performance estimates. The other options either do not address the root cause of data leakage or may exacerbate the problem by encouraging overfitting or including irrelevant information.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
No comments yet.
What is the best strategy to avoid data leakage in machine learning?
A
Include as many features as possible in the model to ensure no information is left out.
B
Perform feature engineering on the entire dataset before splitting into training, validation, and test sets.
C
Limit feature engineering to the training set only, excluding validation and test sets.
D
Use the most complex models available to capture every possible pattern in the data.
E
Train the model for an excessive number of iterations to ensure all patterns are learned.