
Ultimate access to all questions.
Answer-first summary for fast verification
Answer: They allow parallel computation and handle long-term dependencies better
## Explanation Transformers have several key advantages over RNN-based models: ### 1. **Parallel Computation** - RNNs process sequences sequentially (one token at a time), which makes them inherently slow for training. - Transformers process all tokens in a sequence simultaneously through self-attention mechanisms, enabling parallel computation and significantly faster training times. ### 2. **Better Handling of Long-Term Dependencies** - RNNs suffer from vanishing/exploding gradient problems when dealing with long sequences, making it difficult to capture long-range dependencies. - Transformers use self-attention mechanisms that can directly connect any two positions in the sequence, regardless of distance, allowing them to capture long-term dependencies more effectively. ### 3. **Architectural Differences** - **Option A is incorrect**: Transformers do NOT process input sequentially - this is actually a characteristic of RNNs. - **Option B is incorrect**: Transformers do not rely on convolution filters; they use attention mechanisms. - **Option D is incorrect**: Transformers typically have MORE parameters than RNNs due to their attention mechanisms and feed-forward networks. ### Key Transformer Features: - **Self-Attention**: Allows the model to weigh the importance of different words in a sequence relative to each other - **Positional Encoding**: Injects information about word order since Transformers don't process sequentially - **Multi-Head Attention**: Enables the model to focus on different parts of the sequence simultaneously This parallel processing capability and superior handling of long-range dependencies make Transformers particularly well-suited for large-scale language modeling tasks.
Author: Ritesh Yadav
No comments yet.
What is a key advantage of Transformers over RNN-based models?
A
They process input sequentially, maintaining word order
B
They rely on convolution filters for speed
C
They allow parallel computation and handle long-term dependencies better
D
They require fewer parameters