
Answer-first summary for fast verification
Answer: Apply one-hot encoding to the categorical variables in the test data using the encoder fitted on the training set, ensuring the feature space is consistent.
The correct approach is to apply one-hot encoding to the test data using the same encoder fitted on the training set. This ensures that the feature space is consistent between the training and test sets, which is crucial for the model's performance and scalability in a production environment. Ignoring the missing category or redistributing the data does not guarantee the resolution of the missing category issue and can lead to inconsistent predictions. Collecting additional data, while theoretically ideal, may not be practical or timely for deployment.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
You are developing a machine learning model to predict customer churn for a telecommunications company. The dataset includes categorical variables such as 'service type' and 'customer region'. After splitting your dataset equally into training and test sets, you apply one-hot encoding to the categorical variables in the training set. During the preprocessing of the test set, you notice that the 'customer region' variable in the test set is missing a category that was present in the training set. The model is expected to be deployed in a production environment where scalability and consistency in predictions are critical. Given these constraints, what is the BEST course of action to ensure the model's performance is not compromised? Choose one correct option.
A
Ignore the missing category in the test set and proceed with the model training and evaluation, assuming the impact will be minimal.
B
Redistribute the data randomly, allocating 70% to training and 30% to testing, to increase the chances of all categories being represented.
C
Collect additional data specifically for the missing category to ensure all categories are represented in the test set.
D
Apply one-hot encoding to the categorical variables in the test data using the encoder fitted on the training set, ensuring the feature space is consistent.
No comments yet.