JingweiZuo
commited on
Update README.md
Browse files
README.md
CHANGED
@@ -30,9 +30,6 @@ license: apache-2.0
|
|
30 |
- **Language(s) (NLP):** Mainly English
|
31 |
- **License:** TII Falcon-Mamba License 2.0
|
32 |
|
33 |
-
### Model Source
|
34 |
-
|
35 |
-
- **Paper:** *coming soon*.
|
36 |
|
37 |
# Usage
|
38 |
|
@@ -150,13 +147,13 @@ print(tokenizer.decode(outputs[0]))
|
|
150 |
|
151 |
</details>
|
152 |
|
153 |
-
|
154 |
|
155 |
# Training Details
|
156 |
|
157 |
## Training Data
|
158 |
|
159 |
-
Falcon-Mamba has been trained with ~
|
160 |
Similar to the others [Falcon](https://huggingface.co/tiiuae/falcon-11B) suite models, Falcon-Mamba has been trained leveraging a multi-stage training strategy to increase the context-length training from 2,048 up to 8,192.
|
161 |
Note that at inference the context-length is not relevant as the Mamba architecture has no limit on long range dependency.
|
162 |
At the last training stage, small portion of high-quality curated data was used to further enhance performance.
|
@@ -169,7 +166,7 @@ The data was tokenized with the Falcon-[7B](https://huggingface.co/tiiuae/falcon
|
|
169 |
## Training Procedure
|
170 |
Falcon-Mamba-7B was trained on 256 H100 80GB GPUs for the majority of the training, using a 3D parallelism strategy (TP=1, PP=1, DP=256) combined with ZeRO.
|
171 |
|
172 |
-
|
173 |
|
174 |
| **Hyperparameter** | **Value** | **Comment** |
|
175 |
|--------------------|------------|-------------------------------------------|
|
@@ -184,7 +181,7 @@ The model was trained AdamW optimizer, WSD (warmup-stable-decay) learning rate s
|
|
184 |
In the stable phase we used maximal learning rate \\(\eta_{\mathrm{max}}=6.4 \times 10^{-4}\\), and decayed it to the minimal value \\(\eta_{\mathrm{min}}=\frac{\eta_{\mathrm{max}}}{256}\\) with exponential schedule over 500 GT.
|
185 |
Also, we applied *BatchScaling* during the rampup — rescaling learning rate \\(\eta\\) so that the Adam noise temperature \\(T_{\mathrm{noise}}\equiv\frac{\eta}{\sqrt{b}}\\) is kept constant.
|
186 |
|
187 |
-
|
188 |
|
189 |
The model training took roughly two months.
|
190 |
|
|
|
30 |
- **Language(s) (NLP):** Mainly English
|
31 |
- **License:** TII Falcon-Mamba License 2.0
|
32 |
|
|
|
|
|
|
|
33 |
|
34 |
# Usage
|
35 |
|
|
|
147 |
|
148 |
</details>
|
149 |
|
150 |
+
<br>
|
151 |
|
152 |
# Training Details
|
153 |
|
154 |
## Training Data
|
155 |
|
156 |
+
Falcon-Mamba has been trained with ~ 5,500 GT mainly coming from [Refined-Web](https://huggingface.co/datasets/tiiuae/falcon-refinedweb), a large volume web-only dataset filtered and deduplicated.
|
157 |
Similar to the others [Falcon](https://huggingface.co/tiiuae/falcon-11B) suite models, Falcon-Mamba has been trained leveraging a multi-stage training strategy to increase the context-length training from 2,048 up to 8,192.
|
158 |
Note that at inference the context-length is not relevant as the Mamba architecture has no limit on long range dependency.
|
159 |
At the last training stage, small portion of high-quality curated data was used to further enhance performance.
|
|
|
166 |
## Training Procedure
|
167 |
Falcon-Mamba-7B was trained on 256 H100 80GB GPUs for the majority of the training, using a 3D parallelism strategy (TP=1, PP=1, DP=256) combined with ZeRO.
|
168 |
|
169 |
+
### Training Hyperparameters
|
170 |
|
171 |
| **Hyperparameter** | **Value** | **Comment** |
|
172 |
|--------------------|------------|-------------------------------------------|
|
|
|
181 |
In the stable phase we used maximal learning rate \\(\eta_{\mathrm{max}}=6.4 \times 10^{-4}\\), and decayed it to the minimal value \\(\eta_{\mathrm{min}}=\frac{\eta_{\mathrm{max}}}{256}\\) with exponential schedule over 500 GT.
|
182 |
Also, we applied *BatchScaling* during the rampup — rescaling learning rate \\(\eta\\) so that the Adam noise temperature \\(T_{\mathrm{noise}}\equiv\frac{\eta}{\sqrt{b}}\\) is kept constant.
|
183 |
|
184 |
+
### Speeds, Sizes, Times
|
185 |
|
186 |
The model training took roughly two months.
|
187 |
|