
Answer-first summary for fast verification
Answer: The presence of categorical variables that require encoding before being used in the model., Duplicate examples in the dataset, which could skew the model's understanding of feature importance.
The correct answer is E because categorical variables are not an issue of data validation but rather a preprocessing step in machine learning. Data validation focuses on ensuring the accuracy and consistency of data, not on the transformation of data types. Option A, while related to data quality, is more about data cleaning than validation. The explanation aligns with the original question's intent, emphasizing that categories (or categorical variables) are not a concern of data validation but are part of the data preparation phase. For more detailed understanding, refer to Google's Machine Learning Crash Course on data preprocessing and validation techniques.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
No comments yet.
In the process of developing a machine learning model using regression techniques for a project aimed at predicting housing prices, you've encountered a scenario where the model performs exceptionally well during training but fails to generalize during testing, leading to unsatisfactory results. A preliminary investigation reveals issues related to data quality, including wrong and missing data. Your team decides to implement a data validation tool to address these issues. Considering the constraints of budget, time, and the need for scalability, which of the following issues is NOT directly related to Data Validation? Choose the best option.
A
Duplicate examples in the dataset, which could skew the model's understanding of feature importance.
B
Feature values that fall outside expected ranges, potentially leading to inaccurate model predictions.
C
Instances where values are missing from the dataset, affecting the model's ability to learn from all available information.
D
Incorrect labels in the training data, which directly mislead the model during the learning process.
E
The presence of categorical variables that require encoding before being used in the model.