AWS Certified AI Practitioner

Get started today

Ultimate access to all questions.

Explanation:

Analysis of the Question

The question asks which metric should be used to evaluate the accuracy of a binary classification model (customer churn prediction) after one week in production, specifically comparing predictions to actual outcomes.

Evaluation of Each Option

A: Root mean squared error (RMSE) - This is primarily used for regression problems (predicting continuous values), not for classification problems like churn prediction. It measures the average magnitude of errors between predicted and actual numerical values, making it unsuitable for this binary classification scenario.
B: Return on investment (ROI) - This is a business/financial metric, not a machine learning evaluation metric. While important for assessing the business impact of the model, it doesn't directly measure prediction accuracy against actual outcomes.
C: F1 score - This is the optimal choice for binary classification problems like churn prediction. The F1 score is the harmonic mean of precision and recall, providing a balanced measure that considers both false positives and false negatives. This is particularly important in churn prediction where:
- Class imbalance is common (typically more non-churners than churners)
- Both types of errors have significant business implications
- It directly addresses the requirement to compare predictions to actual outcomes
D: Bilingual Evaluation Understudy (BLEU) score - This is used for natural language processing tasks, specifically for evaluating machine translation quality. It has no relevance to binary classification problems like churn prediction.

Why F1 Score is the Best Choice

Binary Classification Context: Customer churn prediction is fundamentally a binary classification problem (churn vs. non-churn), and the F1 score is specifically designed for such scenarios.
Handles Class Imbalance: Financial datasets often have imbalanced classes where churners represent a minority. The F1 score provides a more meaningful evaluation than simple accuracy in such cases.
Balances Precision and Recall: In churn prediction:
- Precision ensures that when the model predicts churn, it's likely correct (avoiding false alarms)
- Recall ensures that actual churners are identified (avoiding missed opportunities) The F1 score balances these competing concerns.
Direct Comparison Requirement: The F1 score directly compares model predictions against actual outcomes, meeting the question's specific requirement.
Production Evaluation: After one week in production, the F1 score provides a comprehensive assessment of how well the model is performing in real-world conditions.

Conclusion

For evaluating a binary classification model like customer churn prediction, the F1 score is the most appropriate metric as it provides a balanced assessment of prediction accuracy against actual outcomes, handles class imbalance effectively, and is specifically designed for classification problems.

Explanation:

Analysis of the Question

Evaluation of Each Option

A: Root mean squared error (RMSE) - This is primarily used for regression problems (predicting continuous values), not for classification problems like churn prediction. It measures the average magnitude of errors between predicted and actual numerical values, making it unsuitable for this binary classification scenario.
B: Return on investment (ROI) - This is a business/financial metric, not a machine learning evaluation metric. While important for assessing the business impact of the model, it doesn't directly measure prediction accuracy against actual outcomes.
C: F1 score - This is the optimal choice for binary classification problems like churn prediction. The F1 score is the harmonic mean of precision and recall, providing a balanced measure that considers both false positives and false negatives. This is particularly important in churn prediction where:
- Class imbalance is common (typically more non-churners than churners)
- Both types of errors have significant business implications
- It directly addresses the requirement to compare predictions to actual outcomes
D: Bilingual Evaluation Understudy (BLEU) score - This is used for natural language processing tasks, specifically for evaluating machine translation quality. It has no relevance to binary classification problems like churn prediction.

Why F1 Score is the Best Choice

Binary Classification Context: Customer churn prediction is fundamentally a binary classification problem (churn vs. non-churn), and the F1 score is specifically designed for such scenarios.
Handles Class Imbalance: Financial datasets often have imbalanced classes where churners represent a minority. The F1 score provides a more meaningful evaluation than simple accuracy in such cases.
Balances Precision and Recall: In churn prediction:
- Precision ensures that when the model predicts churn, it's likely correct (avoiding false alarms)
- Recall ensures that actual churners are identified (avoiding missed opportunities) The F1 score balances these competing concerns.
Direct Comparison Requirement: The F1 score directly compares model predictions against actual outcomes, meeting the question's specific requirement.
Production Evaluation: After one week in production, the F1 score provides a comprehensive assessment of how well the model is performing in real-world conditions.

Conclusion

Comments (0)

No comments yet.

Which metric should a financial company use to assess the accuracy of its customer churn prediction model after one week in production, comparing predictions to actual outcomes?

Exam-Like

Last updated: May 29, 2026 at 14:02

Root mean squared error (RMSE)

23.1%

Return on investment (ROI)

7.7%

F1 score

57.7%