gradientai
/

Llama-3-8B-Instruct-262k

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

leo-pekelis-gradient commited on Apr 25, 2024

Commit

ab6d0a2

·

verified ·

1 Parent(s): 7976e88

Update README.md

Files changed (1) hide show

README.md +4 -3

README.md CHANGED Viewed

@@ -11,17 +11,17 @@ tags:
 This model extends LLama-3's context length from 8k to > 130K, developed by Gradient, sponsored by compute from Crusoe Energy. It demonstrates that SOTA LLMs can learn to operate on long context with minimal training (< 200M tokens) by appropriately adjusting RoPE theta.
-Approach:
 - [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) as the base
 - NTK-aware interpolation [1] to initialize an optimal schedule for RoPE theta, followed by a new data-driven RoPE theta optimization technique
 - progressive training on increasing context lengths similar to the [Large World Model](https://huggingface.co/LargeWorldModel) [2]
-Infra:
 We build on top of the EasyContext Blockwise RingAttention library [3] to scalably and efficiently train on contexts up to 256k in length on Crusoe Energy's high performance L40S cluster.
-Data:
 For training data, we generate long contexts by augmenting [SlimPajama](https://huggingface.co/datasets/cerebras/SlimPajama-627B).
@@ -55,6 +55,7 @@ Gradient is accelerating AI transformation across industries. https://gradient.a
 Drop an email to [[email protected]](mailto:[email protected])
 # Base Model

 This model extends LLama-3's context length from 8k to > 130K, developed by Gradient, sponsored by compute from Crusoe Energy. It demonstrates that SOTA LLMs can learn to operate on long context with minimal training (< 200M tokens) by appropriately adjusting RoPE theta.
+**Approach:**
 - [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) as the base
 - NTK-aware interpolation [1] to initialize an optimal schedule for RoPE theta, followed by a new data-driven RoPE theta optimization technique
 - progressive training on increasing context lengths similar to the [Large World Model](https://huggingface.co/LargeWorldModel) [2]
+**Infra:**
 We build on top of the EasyContext Blockwise RingAttention library [3] to scalably and efficiently train on contexts up to 256k in length on Crusoe Energy's high performance L40S cluster.
+**Data:**
 For training data, we generate long contexts by augmenting [SlimPajama](https://huggingface.co/datasets/cerebras/SlimPajama-627B).
 Drop an email to [[email protected]](mailto:[email protected])
+----
 # Base Model