Text Generation
Transformers
Safetensors
English
falcon_mamba
Eval Results
Inference Endpoints
Gkunsch commited on
Commit
39afd30
·
verified ·
1 Parent(s): 9373216

Update Training Data readme

Browse files
Files changed (1) hide show
  1. README.md +10 -1
README.md CHANGED
@@ -2,6 +2,8 @@
2
  language:
3
  - multilingual
4
  license: apache-2.0
 
 
5
  ---
6
 
7
  # Model Card for Sindibad-7B
@@ -131,7 +133,14 @@ print(tokenizer.decode(outputs[0]))
131
 
132
  ## Training Data
133
 
134
- Guillaume
 
 
 
 
 
 
 
135
 
136
  ## Training Procedure
137
  Sindibad-7B was trained on 256 H100 80GB GPUs for the majority of the training, using a 3D parallelism strategy (TP=1, PP=1, DP=256) combined with ZeRO.
 
2
  language:
3
  - multilingual
4
  license: apache-2.0
5
+ datasets:
6
+ - tiiuae/falcon-refinedweb
7
  ---
8
 
9
  # Model Card for Sindibad-7B
 
133
 
134
  ## Training Data
135
 
136
+ Falcon-Mamba has been trained with ~ 6,000 GT mainly coming from [Refined-Web](https://huggingface.co/datasets/tiiuae/falcon-refinedweb), a large volume web-only dataset filtered and deduplicated.
137
+ Similar to the others [Falcon](https://huggingface.co/tiiuae/falcon-11B) suite models, Falcon-Mamba has been trained leveraging a multi-stage training strategy to increase the context-length training from 2,048 up to 8,192.
138
+ Note that at inference the context-length is not relevant as the Mamba architecture has no limit on long range dependency.
139
+ At the last training stage, small portion of high-quality curated data was used to further enhance performance.
140
+
141
+ Overall, the data sources included RefinedWeb-English, Refined-Multilingual (latin languages), high quality technical data, code data, and conversational data extracted from public sources.
142
+
143
+ The data was tokenized with the Falcon-[7B](https://huggingface.co/tiiuae/falcon-7B)/[11B](https://huggingface.co/tiiuae/falcon-11B) tokenizer.
144
 
145
  ## Training Procedure
146
  Sindibad-7B was trained on 256 H100 80GB GPUs for the majority of the training, using a 3D parallelism strategy (TP=1, PP=1, DP=256) combined with ZeRO.