
Answer-first summary for fast verification
Answer: Transformer-based language models use self-attention mechanisms to capture contextual relationships.
## Explanation of the Correct Answer **C** is the correct answer because transformer-based language models fundamentally rely on **self-attention mechanisms** to capture contextual relationships within input sequences. This architectural innovation allows transformers to: 1. **Process entire sequences in parallel** rather than sequentially, enabling more efficient training and inference compared to recurrent neural networks (RNNs). 2. **Capture long-range dependencies** by allowing each token in the sequence to attend to all other tokens, regardless of their positional distance. This is crucial for understanding complex linguistic structures where meaning depends on relationships between distant words. 3. **Compute attention weights dynamically** based on the relevance between tokens, enabling the model to focus on the most important parts of the input when generating outputs. ## Analysis of Incorrect Options **A** is incorrect because transformer-based models do **not** use convolutional layers as their primary mechanism. While some hybrid architectures exist, the core innovation of transformers is self-attention, not convolutional operations. Convolutional neural networks (CNNs) are better suited for capturing local patterns in grid-like data (e.g., images), not the global contextual relationships essential for language understanding. **B** is incorrect because transformer-based models are **not** limited to text data. While originally designed for natural language processing tasks, transformer architectures have been successfully adapted to various modalities including: - **Vision transformers (ViTs)** for image classification - **Multimodal transformers** that process both text and images (e.g., CLIP, DALL-E) - **Audio transformers** for speech recognition and generation - **Time-series transformers** for sequential data analysis **D** is incorrect because transformers do **not** process data sequences one element at a time in cyclic iterations. This description characterizes **recurrent neural networks (RNNs)** and their variants (LSTMs, GRUs), which process sequences sequentially with hidden states that carry information forward. In contrast, transformers process all tokens in parallel through self-attention, making them more computationally efficient for long sequences and better at capturing long-range dependencies without the vanishing gradient problems that plague RNNs. ## Key Distinguishing Features of Transformers 1. **Self-Attention**: The core mechanism that computes relationships between all pairs of tokens in a sequence. 2. **Positional Encoding**: Since transformers process tokens in parallel (losing inherent sequential information), they use positional encodings to inject information about token order. 3. **Multi-Head Attention**: Multiple attention heads allow the model to focus on different types of relationships simultaneously (e.g., syntactic, semantic). 4. **Feed-Forward Networks**: Applied independently to each position after attention layers. 5. **Layer Normalization and Residual Connections**: Help with training stability and gradient flow. These characteristics make transformers particularly effective for language modeling tasks where understanding context and relationships between words is paramount.
Ultimate access to all questions.
No comments yet.
Author: LeetQuiz Editorial Team
What is a characteristic of transformer-based language models?
A
Transformer-based language models use convolutional layers to apply filters across an input to capture local patterns through filtered views.
B
Transformer-based language models can process only text data.
C
Transformer-based language models use self-attention mechanisms to capture contextual relationships.
D
Transformer-based language models process data sequences one element at a time in cyclic iterations.