AWS Certified AI Practitioner

Get started today

Ultimate access to all questions.

Explanation:

Explanation of the Correct Answer

C is the correct answer because transformer-based language models fundamentally rely on self-attention mechanisms to capture contextual relationships within input sequences. This architectural innovation allows transformers to:

Process entire sequences in parallel rather than sequentially, enabling more efficient training and inference compared to recurrent neural networks (RNNs).
Capture long-range dependencies by allowing each token in the sequence to attend to all other tokens, regardless of their positional distance. This is crucial for understanding complex linguistic structures where meaning depends on relationships between distant words.
Compute attention weights dynamically based on the relevance between tokens, enabling the model to focus on the most important parts of the input when generating outputs.

Analysis of Incorrect Options

A is incorrect because transformer-based models do not use convolutional layers as their primary mechanism. While some hybrid architectures exist, the core innovation of transformers is self-attention, not convolutional operations. Convolutional neural networks (CNNs) are better suited for capturing local patterns in grid-like data (e.g., images), not the global contextual relationships essential for language understanding.

B is incorrect because transformer-based models are not limited to text data. While originally designed for natural language processing tasks, transformer architectures have been successfully adapted to various modalities including:

Vision transformers (ViTs) for image classification
Multimodal transformers that process both text and images (e.g., CLIP, DALL-E)
Audio transformers for speech recognition and generation
Time-series transformers for sequential data analysis

D is incorrect because transformers do not process data sequences one element at a time in cyclic iterations. This description characterizes recurrent neural networks (RNNs) and their variants (LSTMs, GRUs), which process sequences sequentially with hidden states that carry information forward. In contrast, transformers process all tokens in parallel through self-attention, making them more computationally efficient for long sequences and better at capturing long-range dependencies without the vanishing gradient problems that plague RNNs.

Key Distinguishing Features of Transformers

Self-Attention: The core mechanism that computes relationships between all pairs of tokens in a sequence.
Positional Encoding: Since transformers process tokens in parallel (losing inherent sequential information), they use positional encodings to inject information about token order.
Multi-Head Attention: Multiple attention heads allow the model to focus on different types of relationships simultaneously (e.g., syntactic, semantic).
Feed-Forward Networks: Applied independently to each position after attention layers.
Layer Normalization and Residual Connections: Help with training stability and gradient flow.

These characteristics make transformers particularly effective for language modeling tasks where understanding context and relationships between words is paramount.

Explanation:

Explanation of the Correct Answer

Process entire sequences in parallel rather than sequentially, enabling more efficient training and inference compared to recurrent neural networks (RNNs).
Capture long-range dependencies by allowing each token in the sequence to attend to all other tokens, regardless of their positional distance. This is crucial for understanding complex linguistic structures where meaning depends on relationships between distant words.
Compute attention weights dynamically based on the relevance between tokens, enabling the model to focus on the most important parts of the input when generating outputs.

Analysis of Incorrect Options

Vision transformers (ViTs) for image classification
Multimodal transformers that process both text and images (e.g., CLIP, DALL-E)
Audio transformers for speech recognition and generation
Time-series transformers for sequential data analysis

Key Distinguishing Features of Transformers

Self-Attention: The core mechanism that computes relationships between all pairs of tokens in a sequence.
Positional Encoding: Since transformers process tokens in parallel (losing inherent sequential information), they use positional encodings to inject information about token order.
Multi-Head Attention: Multiple attention heads allow the model to focus on different types of relationships simultaneously (e.g., syntactic, semantic).
Feed-Forward Networks: Applied independently to each position after attention layers.
Layer Normalization and Residual Connections: Help with training stability and gradient flow.

These characteristics make transformers particularly effective for language modeling tasks where understanding context and relationships between words is paramount.

Comments (0)

No comments yet.

What is a characteristic of transformer-based language models?

Exam-Like

Last updated: May 7, 2026 at 14:02

Transformer-based language models use convolutional layers to apply filters across an input to capture local patterns through filtered views.

25.0%

Transformer-based language models can process only text data.

6.3%

Transformer-based language models use self-attention mechanisms to capture contextual relationships.

56.3%