
Answer-first summary for fast verification
Answer: Large multi-modal language model
The question describes an application that must accept both text and image inputs (users can enter text or upload a picture of a question) and generate a written answer with an explanation. This requires a model capable of multimodal understanding and text generation. **Analysis of options:** - **A: Computer vision model** - Primarily processes and analyzes visual data (images/videos) but lacks the natural language generation capabilities needed to produce written answers and explanations. While it could interpret images, it cannot generate the required text output. - **B: Large multi-modal language model** - This is the optimal choice. Large multi-modal language models (MLLMs) are specifically designed to handle multiple input modalities (such as text and images) and generate coherent text responses. They combine computer vision capabilities to interpret visual content with natural language processing to understand and generate text, making them ideal for tasks requiring both image understanding and textual explanation. - **C: Diffusion model** - Primarily used for generating images from text prompts or other inputs through a denoising process. While some diffusion models can incorporate text conditioning, they are not designed for the multimodal input understanding and text generation required here. - **D: Text-to-speech model** - Converts text input into spoken audio output, which is the opposite of what's needed. The application requires generating written text, not converting text to speech. **Conclusion:** Only a large multi-modal language model can simultaneously process both text and image inputs while generating the required written answer with explanation. This aligns with AWS AI services like Amazon Bedrock's multimodal models (e.g., Claude 3 models) that support vision and language tasks.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
An education company is developing an application that allows users to input text or upload an image of a question. The application must output a written answer along with an explanation for that answer.
Which model type meets these requirements?
A
Computer vision model
B
Large multi-modal language model
C
Diffusion model
D
Text-to-speech model
No comments yet.