leo-pekelis-gradient
commited on
Update README.md
Browse files
README.md
CHANGED
@@ -11,17 +11,17 @@ tags:
|
|
11 |
|
12 |
This model extends LLama-3's context length from 8k to > 130K, developed by Gradient, sponsored by compute from Crusoe Energy. It demonstrates that SOTA LLMs can learn to operate on long context with minimal training (< 200M tokens) by appropriately adjusting RoPE theta.
|
13 |
|
14 |
-
Approach
|
15 |
|
16 |
- [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) as the base
|
17 |
- NTK-aware interpolation [1] to initialize an optimal schedule for RoPE theta, followed by a new data-driven RoPE theta optimization technique
|
18 |
- progressive training on increasing context lengths similar to the [Large World Model](https://huggingface.co/LargeWorldModel) [2]
|
19 |
|
20 |
-
Infra
|
21 |
|
22 |
We build on top of the EasyContext Blockwise RingAttention library [3] to scalably and efficiently train on contexts up to 256k in length on Crusoe Energy's high performance L40S cluster.
|
23 |
|
24 |
-
Data
|
25 |
|
26 |
For training data, we generate long contexts by augmenting [SlimPajama](https://huggingface.co/datasets/cerebras/SlimPajama-627B).
|
27 |
|
@@ -55,6 +55,7 @@ Gradient is accelerating AI transformation across industries. https://gradient.a
|
|
55 |
|
56 |
Drop an email to [[email protected]](mailto:[email protected])
|
57 |
|
|
|
58 |
|
59 |
# Base Model
|
60 |
|
|
|
11 |
|
12 |
This model extends LLama-3's context length from 8k to > 130K, developed by Gradient, sponsored by compute from Crusoe Energy. It demonstrates that SOTA LLMs can learn to operate on long context with minimal training (< 200M tokens) by appropriately adjusting RoPE theta.
|
13 |
|
14 |
+
**Approach:**
|
15 |
|
16 |
- [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) as the base
|
17 |
- NTK-aware interpolation [1] to initialize an optimal schedule for RoPE theta, followed by a new data-driven RoPE theta optimization technique
|
18 |
- progressive training on increasing context lengths similar to the [Large World Model](https://huggingface.co/LargeWorldModel) [2]
|
19 |
|
20 |
+
**Infra:**
|
21 |
|
22 |
We build on top of the EasyContext Blockwise RingAttention library [3] to scalably and efficiently train on contexts up to 256k in length on Crusoe Energy's high performance L40S cluster.
|
23 |
|
24 |
+
**Data:**
|
25 |
|
26 |
For training data, we generate long contexts by augmenting [SlimPajama](https://huggingface.co/datasets/cerebras/SlimPajama-627B).
|
27 |
|
|
|
55 |
|
56 |
Drop an email to [[email protected]](mailto:[email protected])
|
57 |
|
58 |
+
----
|
59 |
|
60 |
# Base Model
|
61 |
|