Text Generation
Transformers
Safetensors
English
falcon_mamba
Eval Results
Inference Endpoints
JingweiZuo commited on
Commit
89fee4a
·
verified ·
1 Parent(s): 1afab62

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +4 -7
README.md CHANGED
@@ -30,9 +30,6 @@ license: apache-2.0
30
  - **Language(s) (NLP):** Mainly English
31
  - **License:** TII Falcon-Mamba License 2.0
32
 
33
- ### Model Source
34
-
35
- - **Paper:** *coming soon*.
36
 
37
  # Usage
38
 
@@ -150,13 +147,13 @@ print(tokenizer.decode(outputs[0]))
150
 
151
  </details>
152
 
153
-
154
 
155
  # Training Details
156
 
157
  ## Training Data
158
 
159
- Falcon-Mamba has been trained with ~ 6,000 GT mainly coming from [Refined-Web](https://huggingface.co/datasets/tiiuae/falcon-refinedweb), a large volume web-only dataset filtered and deduplicated.
160
  Similar to the others [Falcon](https://huggingface.co/tiiuae/falcon-11B) suite models, Falcon-Mamba has been trained leveraging a multi-stage training strategy to increase the context-length training from 2,048 up to 8,192.
161
  Note that at inference the context-length is not relevant as the Mamba architecture has no limit on long range dependency.
162
  At the last training stage, small portion of high-quality curated data was used to further enhance performance.
@@ -169,7 +166,7 @@ The data was tokenized with the Falcon-[7B](https://huggingface.co/tiiuae/falcon
169
  ## Training Procedure
170
  Falcon-Mamba-7B was trained on 256 H100 80GB GPUs for the majority of the training, using a 3D parallelism strategy (TP=1, PP=1, DP=256) combined with ZeRO.
171
 
172
- #### Training Hyperparameters
173
 
174
  | **Hyperparameter** | **Value** | **Comment** |
175
  |--------------------|------------|-------------------------------------------|
@@ -184,7 +181,7 @@ The model was trained AdamW optimizer, WSD (warmup-stable-decay) learning rate s
184
  In the stable phase we used maximal learning rate \\(\eta_{\mathrm{max}}=6.4 \times 10^{-4}\\), and decayed it to the minimal value \\(\eta_{\mathrm{min}}=\frac{\eta_{\mathrm{max}}}{256}\\) with exponential schedule over 500 GT.
185
  Also, we applied *BatchScaling* during the rampup — rescaling learning rate \\(\eta\\) so that the Adam noise temperature \\(T_{\mathrm{noise}}\equiv\frac{\eta}{\sqrt{b}}\\) is kept constant.
186
 
187
- #### Speeds, Sizes, Times
188
 
189
  The model training took roughly two months.
190
 
 
30
  - **Language(s) (NLP):** Mainly English
31
  - **License:** TII Falcon-Mamba License 2.0
32
 
 
 
 
33
 
34
  # Usage
35
 
 
147
 
148
  </details>
149
 
150
+ <br>
151
 
152
  # Training Details
153
 
154
  ## Training Data
155
 
156
+ Falcon-Mamba has been trained with ~ 5,500 GT mainly coming from [Refined-Web](https://huggingface.co/datasets/tiiuae/falcon-refinedweb), a large volume web-only dataset filtered and deduplicated.
157
  Similar to the others [Falcon](https://huggingface.co/tiiuae/falcon-11B) suite models, Falcon-Mamba has been trained leveraging a multi-stage training strategy to increase the context-length training from 2,048 up to 8,192.
158
  Note that at inference the context-length is not relevant as the Mamba architecture has no limit on long range dependency.
159
  At the last training stage, small portion of high-quality curated data was used to further enhance performance.
 
166
  ## Training Procedure
167
  Falcon-Mamba-7B was trained on 256 H100 80GB GPUs for the majority of the training, using a 3D parallelism strategy (TP=1, PP=1, DP=256) combined with ZeRO.
168
 
169
+ ### Training Hyperparameters
170
 
171
  | **Hyperparameter** | **Value** | **Comment** |
172
  |--------------------|------------|-------------------------------------------|
 
181
  In the stable phase we used maximal learning rate \\(\eta_{\mathrm{max}}=6.4 \times 10^{-4}\\), and decayed it to the minimal value \\(\eta_{\mathrm{min}}=\frac{\eta_{\mathrm{max}}}{256}\\) with exponential schedule over 500 GT.
182
  Also, we applied *BatchScaling* during the rampup — rescaling learning rate \\(\eta\\) so that the Adam noise temperature \\(T_{\mathrm{noise}}\equiv\frac{\eta}{\sqrt{b}}\\) is kept constant.
183
 
184
+ ### Speeds, Sizes, Times
185
 
186
  The model training took roughly two months.
187