
Answer-first summary for fast verification
Answer: Split the training and test data based on time rather than a random split to avoid leakage.
The correct answer is B. When dealing with time series data, you should not randomly split the data into training and test sets because it can cause data leakage, leading to artificially inflated accuracy during testing. Instead, you should split the data based on time, using past data to train the model and future data to test it. This approach ensures that the model's performance during testing more accurately reflects its performance in a real-world production environment.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
No comments yet.
You are building a machine learning model to predict daily temperatures based on hourly temperature data that is continuously uploaded. Initially, you randomly split the dataset into training and test sets and applied transformations to these datasets separately. During testing, your model achieved an accuracy of 97%. However, after deploying the model to a production environment, its accuracy dropped to 66%. What steps can you take to improve your model's accuracy in production?
A
Normalize the data for the training, and test datasets as two separate steps.
B
Split the training and test data based on time rather than a random split to avoid leakage.
C
Add more data to your test set to ensure that you have a fair distribution and sample for testing.
D
Apply data transformations before splitting, and cross-validate to make sure that the transformations are applied to both the training and test sets.