leo-pekelis-gradient commited on
Commit
ab6d0a2
·
verified ·
1 Parent(s): 7976e88

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +4 -3
README.md CHANGED
@@ -11,17 +11,17 @@ tags:
11
 
12
  This model extends LLama-3's context length from 8k to > 130K, developed by Gradient, sponsored by compute from Crusoe Energy. It demonstrates that SOTA LLMs can learn to operate on long context with minimal training (< 200M tokens) by appropriately adjusting RoPE theta.
13
 
14
- Approach:
15
 
16
  - [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) as the base
17
  - NTK-aware interpolation [1] to initialize an optimal schedule for RoPE theta, followed by a new data-driven RoPE theta optimization technique
18
  - progressive training on increasing context lengths similar to the [Large World Model](https://huggingface.co/LargeWorldModel) [2]
19
 
20
- Infra:
21
 
22
  We build on top of the EasyContext Blockwise RingAttention library [3] to scalably and efficiently train on contexts up to 256k in length on Crusoe Energy's high performance L40S cluster.
23
 
24
- Data:
25
 
26
  For training data, we generate long contexts by augmenting [SlimPajama](https://huggingface.co/datasets/cerebras/SlimPajama-627B).
27
 
@@ -55,6 +55,7 @@ Gradient is accelerating AI transformation across industries. https://gradient.a
55
 
56
  Drop an email to [[email protected]](mailto:[email protected])
57
 
 
58
 
59
  # Base Model
60
 
 
11
 
12
  This model extends LLama-3's context length from 8k to > 130K, developed by Gradient, sponsored by compute from Crusoe Energy. It demonstrates that SOTA LLMs can learn to operate on long context with minimal training (< 200M tokens) by appropriately adjusting RoPE theta.
13
 
14
+ **Approach:**
15
 
16
  - [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) as the base
17
  - NTK-aware interpolation [1] to initialize an optimal schedule for RoPE theta, followed by a new data-driven RoPE theta optimization technique
18
  - progressive training on increasing context lengths similar to the [Large World Model](https://huggingface.co/LargeWorldModel) [2]
19
 
20
+ **Infra:**
21
 
22
  We build on top of the EasyContext Blockwise RingAttention library [3] to scalably and efficiently train on contexts up to 256k in length on Crusoe Energy's high performance L40S cluster.
23
 
24
+ **Data:**
25
 
26
  For training data, we generate long contexts by augmenting [SlimPajama](https://huggingface.co/datasets/cerebras/SlimPajama-627B).
27
 
 
55
 
56
  Drop an email to [[email protected]](mailto:[email protected])
57
 
58
+ ----
59
 
60
  # Base Model
61