You have developed a language understanding model for a virtual assistant that can handle various intents such as 'play_music', 'set_alarm', and 'get_weather'. You want to test the model's performance using a set of test data. Which of the following evaluation strategies would be most appropriate for this scenario? | Microsoft Certified Azure AI Engineer Associate - AI-102 Quiz - LeetQuiz