
Ultimate access to all questions.
You are developing a machine learning model to predict customer churn for a telecommunications company. The dataset includes categorical variables such as 'service type' and 'customer region'. After splitting your dataset equally into training and test sets, you apply one-hot encoding to the categorical variables in the training set. During the preprocessing of the test set, you notice that the 'customer region' variable in the test set is missing a category that was present in the training set. The model is expected to be deployed in a production environment where scalability and consistency in predictions are critical. Given these constraints, what is the BEST course of action to ensure the model's performance is not compromised? Choose one correct option.
A
Ignore the missing category in the test set and proceed with the model training and evaluation, assuming the impact will be minimal.
B
Redistribute the data randomly, allocating 70% to training and 30% to testing, to increase the chances of all categories being represented.
C
Collect additional data specifically for the missing category to ensure all categories are represented in the test set.
D
Apply one-hot encoding to the categorical variables in the test data using the encoder fitted on the training set, ensuring the feature space is consistent.