Extending Llama-3's Context Ten-Fold Overnight
Abstract
We extend the context length of Llama-3-8B-Instruct from 8K to 80K via QLoRA fine-tuning. The entire training cycle is super efficient, which takes 8 hours on one 8xA800 (80G) GPU machine. The resulted model exhibits superior performances across a broad range of evaluation tasks, such as NIHS, topic retrieval, and long-context language understanding; meanwhile, it also well preserves the original capability over short contexts. The dramatic context extension is mainly attributed to merely 3.5K synthetic training samples generated by GPT-4 , which indicates the LLMs' inherent (yet largely underestimated) potential to extend its original context length. In fact, the context length could be extended far beyond 80K with more computation resources. Therefore, the team will publicly release the entire resources (including data, model, data generation pipeline, training code) so as to facilitate the future research from the community: https://github.com/FlagOpen/FlagEmbedding.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Make Your LLM Fully Utilize the Context (2024)
- LLoCO: Learning Long Contexts Offline (2024)
- Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks (2024)
- Long-context LLMs Struggle with Long In-context Learning (2024)
- LongEmbed: Extending Embedding Models for Long Context Retrieval (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Well, but llama-3 PoSE can be scaled up to 96K without training, only needing to modify max_position_embeddings and rope_theta. Please correct me if I'm wrong.
Hi! Only increasing the rope_theta can only improve the model's long-context retrieval performance (i.e. finding needles) while hardly improves its long-context utilization capability (doing QA and summarization). Evidence here:
Besides, only increasing the rope_theta significantly degrades the model's instruction following capability. Discussions here: https://www.reddit.com/r/LocalLLaMA/comments/1chd8px/comment/l24gxp0/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper