Ultimate access to all questions.
You have developed a language understanding model for a virtual assistant that can handle various intents such as 'play_music', 'set_alarm', and 'get_weather'. You want to test the model's performance using a set of test data. Which of the following evaluation strategies would be most appropriate for this scenario?