AWS Certified AI Practitioner

Get started today

Ultimate access to all questions.

Explanation:

The question describes an application that must accept both text and image inputs (users can enter text or upload a picture of a question) and generate a written answer with an explanation. This requires a model capable of multimodal understanding and text generation.

Analysis of options:

A: Computer vision model - Primarily processes and analyzes visual data (images/videos) but lacks the natural language generation capabilities needed to produce written answers and explanations. While it could interpret images, it cannot generate the required text output.
B: Large multi-modal language model - This is the optimal choice. Large multi-modal language models (MLLMs) are specifically designed to handle multiple input modalities (such as text and images) and generate coherent text responses. They combine computer vision capabilities to interpret visual content with natural language processing to understand and generate text, making them ideal for tasks requiring both image understanding and textual explanation.
C: Diffusion model - Primarily used for generating images from text prompts or other inputs through a denoising process. While some diffusion models can incorporate text conditioning, they are not designed for the multimodal input understanding and text generation required here.
D: Text-to-speech model - Converts text input into spoken audio output, which is the opposite of what's needed. The application requires generating written text, not converting text to speech.

Conclusion: Only a large multi-modal language model can simultaneously process both text and image inputs while generating the required written answer with explanation. This aligns with AWS AI services like Amazon Bedrock's multimodal models (e.g., Claude 3 models) that support vision and language tasks.

Explanation:

Analysis of options:

A: Computer vision model - Primarily processes and analyzes visual data (images/videos) but lacks the natural language generation capabilities needed to produce written answers and explanations. While it could interpret images, it cannot generate the required text output.
B: Large multi-modal language model - This is the optimal choice. Large multi-modal language models (MLLMs) are specifically designed to handle multiple input modalities (such as text and images) and generate coherent text responses. They combine computer vision capabilities to interpret visual content with natural language processing to understand and generate text, making them ideal for tasks requiring both image understanding and textual explanation.
C: Diffusion model - Primarily used for generating images from text prompts or other inputs through a denoising process. While some diffusion models can incorporate text conditioning, they are not designed for the multimodal input understanding and text generation required here.
D: Text-to-speech model - Converts text input into spoken audio output, which is the opposite of what's needed. The application requires generating written text, not converting text to speech.

Comments (0)

No comments yet.

An education company is developing an application that allows users to input text or upload an image of a question. The application must output a written answer along with an explanation for that answer.

Which model type meets these requirements?

Exam-Like

Last updated: June 16, 2026 at 14:02

Computer vision model

5.3%

Large multi-modal language model

84.2%

Diffusion model

5.3%

Text-to-speech model

5.3%

AWS Certified AI Practitioner

Get started today

Comments (0)

Get started today

An education company is developing an application that allows users to input text or upload an image of a question. The application must output a written answer along with an explanation for that answer. Which model type meets these requirements?

Comments (0)

An education company is developing an application that allows users to input text or upload an image of a question. The application must output a written answer along with an explanation for that answer.

Which model type meets these requirements?