Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
singhsidhukuldeepΒ 
posted an update May 12, 2024
Post
1457
You are all happy 😊 that @meta-llama released Llama 3.

Then you are sad πŸ˜” that it only has a context length of 8k.

Then you are happy πŸ˜„ that you can just scale llama-3 PoSE to 96k without training, only needing to modify max_position_embeddings and rope_theta.

But then you are sad 😒 it only improves the model's long-context retrieval performance (i.e., finding needles) while hardly improving its long-context utilization capability (doing QA and summarization).

But then you are happy 😁 that the
@GradientsTechnologies community has released the long-context Llama-3-8B-Instruct-262K with long context (262k-1M+).

Now we have another paper "Extending Llama-3's Context Ten-Fold Overnight" πŸ“œ.

The context length of Llama-3-8B-Instruct is extended from 8K to 80K using QLoRA fine-tuningβš™οΈ.

The training cycle is highly efficient, taking "only" πŸ˜‚ 8 hours on a single 8xA800 (80G) GPU machine.

The model also preserves its original capability over short contexts. ✁

The dramatic context extension is mainly attributed to merely 3.5K synthetic training samples generated by GPT-4.πŸ“Š

The paper suggests that the context length could be extended far beyond 80K with more computation resources (πŸ˜… GPU-poor).

The team plans to publicly release all resources, including data, model, data generation pipeline, and training code, to facilitate future research from the ❀️ community.

Paper: https://arxiv.org/abs/2404.19553

This is where we are... until next time... 🌟

Extending Llama-3's Context Ten-Fold Overnight (2404.19553)