Update Training Data readme
Browse files
README.md
CHANGED
@@ -2,6 +2,8 @@
|
|
2 |
language:
|
3 |
- multilingual
|
4 |
license: apache-2.0
|
|
|
|
|
5 |
---
|
6 |
|
7 |
# Model Card for Sindibad-7B
|
@@ -131,7 +133,14 @@ print(tokenizer.decode(outputs[0]))
|
|
131 |
|
132 |
## Training Data
|
133 |
|
134 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
135 |
|
136 |
## Training Procedure
|
137 |
Sindibad-7B was trained on 256 H100 80GB GPUs for the majority of the training, using a 3D parallelism strategy (TP=1, PP=1, DP=256) combined with ZeRO.
|
|
|
2 |
language:
|
3 |
- multilingual
|
4 |
license: apache-2.0
|
5 |
+
datasets:
|
6 |
+
- tiiuae/falcon-refinedweb
|
7 |
---
|
8 |
|
9 |
# Model Card for Sindibad-7B
|
|
|
133 |
|
134 |
## Training Data
|
135 |
|
136 |
+
Falcon-Mamba has been trained with ~ 6,000 GT mainly coming from [Refined-Web](https://huggingface.co/datasets/tiiuae/falcon-refinedweb), a large volume web-only dataset filtered and deduplicated.
|
137 |
+
Similar to the others [Falcon](https://huggingface.co/tiiuae/falcon-11B) suite models, Falcon-Mamba has been trained leveraging a multi-stage training strategy to increase the context-length training from 2,048 up to 8,192.
|
138 |
+
Note that at inference the context-length is not relevant as the Mamba architecture has no limit on long range dependency.
|
139 |
+
At the last training stage, small portion of high-quality curated data was used to further enhance performance.
|
140 |
+
|
141 |
+
Overall, the data sources included RefinedWeb-English, Refined-Multilingual (latin languages), high quality technical data, code data, and conversational data extracted from public sources.
|
142 |
+
|
143 |
+
The data was tokenized with the Falcon-[7B](https://huggingface.co/tiiuae/falcon-7B)/[11B](https://huggingface.co/tiiuae/falcon-11B) tokenizer.
|
144 |
|
145 |
## Training Procedure
|
146 |
Sindibad-7B was trained on 256 H100 80GB GPUs for the majority of the training, using a 3D parallelism strategy (TP=1, PP=1, DP=256) combined with ZeRO.
|