
Answer-first summary for fast verification
Answer: They allow parallel computation and handle long-term dependencies better
## Explanation Transformers have several key advantages over RNN-based models: ### 1. **Parallel Computation** - RNNs process sequences sequentially (one token at a time), which makes them slow and difficult to parallelize. - Transformers process all tokens in a sequence simultaneously using self-attention mechanisms, enabling efficient parallel computation on modern hardware (GPUs/TPUs). ### 2. **Better Handling of Long-Term Dependencies** - RNNs suffer from vanishing/exploding gradient problems when processing long sequences, making it difficult to capture long-range dependencies. - Transformers use self-attention mechanisms that can directly connect any two positions in the sequence, regardless of distance, allowing them to capture long-term dependencies more effectively. ### 3. **Why Other Options Are Incorrect** - **Option A**: This describes RNNs, not Transformers. RNNs process input sequentially, while Transformers process all tokens in parallel. - **Option B**: Transformers don't use convolution filters; they use attention mechanisms. Convolution filters are used in CNNs. - **Option D**: Transformers typically have more parameters than RNNs due to their attention mechanisms and multiple layers. ### 4. **Additional Advantages** - **Scalability**: Transformers scale better with larger datasets and model sizes. - **Global Context**: Self-attention provides global context for each token, unlike RNNs which have limited context windows. - **Training Efficiency**: Parallel processing makes training faster and more efficient. This architectural advantage is why Transformers have become the foundation for most state-of-the-art NLP models like BERT, GPT, and T5.
Author: Jin H
Ultimate access to all questions.
What is a key advantage of Transformers over RNN-based models?
A
They process input sequentially, maintaining word order
B
They rely on convolution filters for speed
C
They allow parallel computation and handle long-term dependencies better
D
They require fewer parameters
No comments yet.