Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models
Abstract
Recurrent neural networks (RNNs) have fast inference and scale efficiently on long sequences, but they are difficult to train and hard to scale. We propose Hawk, an RNN with gated linear recurrences, and Griffin, a hybrid model that mixes gated linear recurrences with local attention. Hawk exceeds the reported performance of Mamba on downstream tasks, while Griffin matches the performance of Llama-2 despite being trained on over 6 times fewer tokens. We also show that Griffin can extrapolate on sequences significantly longer than those seen during training. Our models match the hardware efficiency of Transformers during training, and during inference they have lower latency and significantly higher throughput. We scale Griffin up to 14B parameters, and explain how to shard our models for efficient distributed training.
Community
This is awesome. New architecture - new possibilities!
And are these architectures more optimised for TPUs than GPUs?
And are you gonna release a comparison of Griffin 14B with Mixtral which is almost 13B model (2Γ7B MoE) though trained on far more tokens than 300B?
And why did you selected Llama but not Mistral 7B for comparison? May be because there is no information about how many tokens it was trained on?
Not an author or anything, but yeah, they use llama because they do an closer comparison with it (most papers do this, especially when they don't train a model fully).
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- BlackMamba: Mixture of Experts for State-Space Models (2024)
- Can Mamba Learn How to Learn? A Comparative Study on In-Context Learning Tasks (2024)
- MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts (2024)
- Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Models (2024)
- Beyond the Limits: A Survey of Techniques to Extend the Context Length in Large Language Models (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Will these models and code open source?
Hawk & Griffin: Revolutionizing Language Models with Efficient Architecture
Links π:
π Subscribe: https://www.youtube.com/@Arxflix
π Twitter: https://x.com/arxflix
π LMNT (Partner): https://lmnt.com/
Models citing this paper 11
Browse 11 models citing this paperDatasets citing this paper 0
No dataset linking this paper