
Answer-first summary for fast verification
Answer: Implement a time-based split for the training and test data instead of a random split to prevent data leakage and ensure the model is evaluated on future, unseen data, mimicking real-world deployment scenarios., Apply data transformations before splitting the dataset and employ cross-validation to ensure that transformations are consistently applied across training and test sets, thereby minimizing discrepancies.
**Correct Answers: D and A** **Why D?** - **Data Leakage Prevention:** A time-based split ensures that the model is tested on data that comes after the training data, preventing the model from having access to future information during training, which is crucial for time-series data like temperature forecasts. - **Real-world Simulation:** This approach mirrors the actual deployment scenario where the model predicts future temperatures based on past data, providing a more accurate assessment of the model's performance. **Why A?** - **Uniform Transformation Application:** Applying transformations before splitting and using cross-validation ensures that the same preprocessing steps are applied to both training and test sets, reducing the risk of introducing biases or discrepancies that could affect the model's performance in production. While option B might seem beneficial by maintaining dataset independence, it does not address the core issue of data leakage in time-series data. Option C, increasing the test set size, may improve evaluation reliability but does not tackle the fundamental problem of data leakage. Option E combines the strengths of A and D, offering a comprehensive solution by addressing both data leakage and uniform transformation application, making it a strong alternative. However, the question specifies to choose the BEST single strategy, hence D is prioritized for its direct impact on preventing data leakage, a critical issue in time-series forecasting.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
No comments yet.
You are tasked with developing a machine learning model to forecast daily temperatures for a weather forecasting application. The dataset includes hourly temperature readings over several years. Initially, you randomly split the data into training and test sets, applied necessary transformations, and achieved a testing accuracy of 97%. However, upon deployment, the model's accuracy significantly dropped to 66%. Considering the importance of accurate forecasts for planning and safety, and the need to comply with data privacy regulations, which of the following strategies would BEST improve the production model's accuracy? (Choose one correct option)
A
Apply data transformations before splitting the dataset and employ cross-validation to ensure that transformations are consistently applied across training and test sets, thereby minimizing discrepancies.
B
Normalize the training and test datasets independently to maintain the integrity of each dataset's distribution, potentially improving model generalization.
C
Increase the size of the test set to ensure it is more representative of the overall data distribution, aiming for a more reliable evaluation of the model's performance.
D
Implement a time-based split for the training and test data instead of a random split to prevent data leakage and ensure the model is evaluated on future, unseen data, mimicking real-world deployment scenarios.
E
Combine options A and D, applying data transformations before splitting and using a time-based split to both prevent data leakage and ensure uniform transformation application.