
Ultimate access to all questions.
In the context of machine learning project development for a telecommunications company aiming to predict customer churn, the team is equipped with a large dataset. Despite the dataset's size, the team is apprehensive about the model's efficacy on new, unseen data due to potential pitfalls in the dataset's composition or model's design. Considering the importance of dataset representativeness and model generalizability, which of the following scenarios most accurately illustrates a common pitfall that could lead to underperformance of the model on new, unseen data? (Choose one correct option)
A
Utilizing a highly complex model with an excessive number of features without considering the risk of overfitting, despite having a large dataset.
B
Following industry best practices meticulously to ensure the model's robustness and generalizability, including cross-validation and feature selection.
C
Collecting a dataset that, while large, is not representative enough of the problem space, leading to underfitting because it lacks diversity in customer behaviors and demographics.
D
Applying advanced regularization techniques to prevent the model from learning noise in the training data, without first ensuring the dataset's representativeness.