Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,290 @@
|
|
1 |
-
---
|
2 |
-
license: apache-2.0
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: apache-2.0
|
3 |
+
library_name: transformers
|
4 |
+
pipeline_tag: translation
|
5 |
+
language:
|
6 |
+
- it
|
7 |
+
- pt
|
8 |
+
- de
|
9 |
+
- en
|
10 |
+
- es
|
11 |
+
- eu
|
12 |
+
- gl
|
13 |
+
- fr
|
14 |
+
- bg
|
15 |
+
- cs
|
16 |
+
- lt
|
17 |
+
- hr
|
18 |
+
- ca
|
19 |
+
- nl
|
20 |
+
- ro
|
21 |
+
- da
|
22 |
+
- el
|
23 |
+
- fi
|
24 |
+
- hu
|
25 |
+
- sk
|
26 |
+
- sl
|
27 |
+
- et
|
28 |
+
- pl
|
29 |
+
- lv
|
30 |
+
- mt
|
31 |
+
- ga
|
32 |
+
- sv
|
33 |
+
- an
|
34 |
+
- ast
|
35 |
+
- arn
|
36 |
+
base_model:
|
37 |
+
- BSC-LT/salamandra-2b
|
38 |
+
---
|
39 |
+
|
40 |
+
![](./images/salamandra_header.png)
|
41 |
+
|
42 |
+
# Salamandra Model Card
|
43 |
+
|
44 |
+
|
45 |
+
SalamandraTA-2B is a machine translation model that has been continually pre-trained on Salamandra2B on 70 billion tokens of parallel data in 30 different languages: Catalan, Italian, Portuguese, German, English, Spanish, Euskera, Galician, French, Bulgarian, Czech, Lithuanian, Croatian, Dutch, Romanian, Danish, Greek, Finnish, Hungarian, Slovak, Slovenian, Estonian, Polish, Latvian, Swedish, Maltese, Irish, Aranese, Aragonese, Asturian. SalamandraTA-2B is the first model in **SalamandraTA** series and is trained to handle sentence- and paragraph- level machine translation.
|
46 |
+
|
47 |
+
- **Developed by:** The Language Technologies Unit from Barcelona Supercomputing Center (BSC).
|
48 |
+
- **Model type:** A 2B parameter model continually pre-trained on 70 billion tokens.
|
49 |
+
- **Languages:** Catalan, Italian, Portuguese, German, English, Spanish, Euskera, Galician, French, Bulgarian, Czech, Lithuanian, Croatian, Dutch, Romanian, Danish, Greek, Finnish, Hungarian, Slovak, Slovenian, Estonian, Polish, Latvian, Swedish, Maltese, Irish, Aranese, Aragonese, Asturian.
|
50 |
+
- **License:** Apache License, Version 2.0
|
51 |
+
|
52 |
+
|
53 |
+
## Model Details
|
54 |
+
|
55 |
+
### Description
|
56 |
+
|
57 |
+
Continual pre-trained model from Salamandra-2B on 70 billion tokens of highly curated parallel data.
|
58 |
+
|
59 |
+
### Hyperparameters
|
60 |
+
|
61 |
+
The full list of hyperparameters for each model can be found [here](https://github.com/langtech-bsc/salamandra/tree/main/configs).
|
62 |
+
|
63 |
+
### Architecture
|
64 |
+
|
65 |
+
| | |
|
66 |
+
|-------------------------|:--------------|
|
67 |
+
| Total Parameters | 2,253,490,176 |
|
68 |
+
| Embedding Parameters | 524,288,000 |
|
69 |
+
| Layers | 24 |
|
70 |
+
| Hidden size | 2,048 |
|
71 |
+
| Attention heads | 16 |
|
72 |
+
| Context length | 8,192 |
|
73 |
+
| Vocabulary size | 256,000 |
|
74 |
+
| Precision | bfloat16 |
|
75 |
+
| Embedding type | RoPE |
|
76 |
+
| Activation Function | SwiGLU |
|
77 |
+
| Layer normalization | RMS Norm |
|
78 |
+
| Flash attention | ✅ |
|
79 |
+
| Grouped Query Attention | ❌ |
|
80 |
+
| Num. query groups | N/A |
|
81 |
+
|
82 |
+
---
|
83 |
+
|
84 |
+
## Intended Use
|
85 |
+
|
86 |
+
### Direct Use
|
87 |
+
|
88 |
+
The models are intended for both research and commercial use in any of the languages included in the training data.
|
89 |
+
The base models are intended either for general machine translation tasks.
|
90 |
+
|
91 |
+
### Out-of-scope Use
|
92 |
+
|
93 |
+
The model is not intended for malicious activities, such as harming others or violating human rights.
|
94 |
+
Any downstream application must comply with current laws and regulations.
|
95 |
+
Irresponsible usage in production environments without proper risk assessment and mitigation is also discouraged.
|
96 |
+
|
97 |
+
---
|
98 |
+
|
99 |
+
## Hardware and Software
|
100 |
+
|
101 |
+
### Training Framework
|
102 |
+
|
103 |
+
Pre-training was conducted using [LLaMA-Factory framework](https://github.com/hiyouga/LLaMA-Factory).
|
104 |
+
|
105 |
+
### Compute Infrastructure
|
106 |
+
|
107 |
+
All models were trained on [MareNostrum 5](https://www.bsc.es/ca/marenostrum/marenostrum-5), a pre-exascale EuroHPC supercomputer hosted and operated by Barcelona Supercomputing Center.
|
108 |
+
|
109 |
+
The accelerated partition is composed of 1,120 nodes with the following specifications:
|
110 |
+
- 4x Nvidia Hopper GPUs with 64 HBM2 memory
|
111 |
+
- 2x Intel Sapphire Rapids 8460Y+ at 2.3Ghz and 32c each (64 cores)
|
112 |
+
- 4x NDR200 (BW per node 800Gb/s)
|
113 |
+
- 512 GB of Main memory (DDR5)
|
114 |
+
- 460GB on NVMe storage
|
115 |
+
|
116 |
+
---
|
117 |
+
|
118 |
+
## How to use
|
119 |
+
This section offers examples of how to perform inference using various methods.
|
120 |
+
|
121 |
+
### Inference
|
122 |
+
|
123 |
+
To run inference using Huggingface's AutoModel class on a single sentence you can use the following code.
|
124 |
+
|
125 |
+
<details>
|
126 |
+
<summary>Show code</summary>
|
127 |
+
|
128 |
+
```python
|
129 |
+
from transformers import AutoTokenizer, AutoModelForCausalLM
|
130 |
+
|
131 |
+
model_id = 'BSC-LT/salamandraTA-2b'
|
132 |
+
|
133 |
+
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
134 |
+
model = AutoModelForCausalLM.from_pretrained(model_id)
|
135 |
+
|
136 |
+
src_lang_code = 'Spanish'
|
137 |
+
tgt_lang_code = 'Catalan'
|
138 |
+
sentence = 'Ayer se fue, tomó sus cosas y se puso a navegar.'
|
139 |
+
|
140 |
+
prompt = f'[{src_lang_code}] {sentence} \n[{tgt_lang_code}]'
|
141 |
+
|
142 |
+
input_ids = tokenizer(prompt, return_tensors='pt').input_ids
|
143 |
+
output_ids = model.generate( input_ids, max_length=500, num_beams=5 )
|
144 |
+
input_length = input_ids.shape[1]
|
145 |
+
|
146 |
+
generated_text = tokenizer.decode(output_ids[0, input_length: ], skip_special_tokens=True).strip()
|
147 |
+
# Ahir se'n va anar, va agafar les seves coses i es va posar a navegar.
|
148 |
+
```
|
149 |
+
</details>
|
150 |
+
|
151 |
+
<br>
|
152 |
+
|
153 |
+
|
154 |
+
To run batch inference using Huggingface's AutoModel class you can use the following code.
|
155 |
+
|
156 |
+
<details>
|
157 |
+
<summary>Show code</summary>
|
158 |
+
|
159 |
+
```python
|
160 |
+
from transformers import AutoTokenizer, AutoModelForCausalLM
|
161 |
+
import torch
|
162 |
+
|
163 |
+
model_id = 'BSC-LT/salamandraTA-2b'
|
164 |
+
|
165 |
+
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
166 |
+
model = AutoModelForCausalLM.from_pretrained(model_id, attn_implementation='eager')
|
167 |
+
|
168 |
+
# List of sentences to translate
|
169 |
+
sentences = [
|
170 |
+
'Ayer se fue, tomó sus cosas y se puso a navegar.',
|
171 |
+
'Se despidió y decidió batirse en duelo con el mar, y recorrer el mundo en su velero',
|
172 |
+
'Su corazón buscó una forma diferente de vivir, pero las olas le gritaron: Vete con los demás',
|
173 |
+
'Y se durmió y la noche le gritó: Dónde vas, y en sus sueños dibujó gaviotas, y pensó: Hoy debo regresar.'
|
174 |
+
]
|
175 |
+
|
176 |
+
src_lang_code = 'Spanish'
|
177 |
+
tgt_lang_code = 'Catalan'
|
178 |
+
|
179 |
+
prompt = lambda x: f'[{src_lang_code}] {x} \n[{tgt_lang_code}]'
|
180 |
+
prompts = [prompt(x) for x in sentences]
|
181 |
+
|
182 |
+
|
183 |
+
encodings = tokenizer(prompts, return_tensors='pt', padding=True, add_special_tokens=True)
|
184 |
+
|
185 |
+
input_ids = encodings['input_ids'].to(model.device)
|
186 |
+
attention_mask = encodings['attention_mask'].to(model.device)
|
187 |
+
|
188 |
+
with torch.no_grad():
|
189 |
+
outputs = model.generate(input_ids=input_ids, attention_mask=attention_mask, num_beams=5,max_length=256,early_stopping=True)
|
190 |
+
|
191 |
+
results_detokenized = []
|
192 |
+
for i, output in enumerate(outputs):
|
193 |
+
input_length = input_ids[i].shape[0]
|
194 |
+
generated_text = tokenizer.decode(output[input_length:], skip_special_tokens=True).strip()
|
195 |
+
results_detokenized.append(generated_text)
|
196 |
+
|
197 |
+
print("Generated Translations:", results_detokenized)
|
198 |
+
|
199 |
+
#Generated Translations: ["Ahir se'n va anar, va agafar les seves coses i es va posar a navegar.", "Es va acomiadar i va decidir batre's en duel amb el mar, i recórrer el món en el seu veler", 'El seu cor va buscar una forma diferent de viure, però les onades li van cridar: Vés amb els altres', 'I es va adormir i la nit li va cridar: On vas, i en els seus somnis va dibuixar gavines, i va pensar: Avui he de tornar.']
|
200 |
+
|
201 |
+
```
|
202 |
+
</details>
|
203 |
+
|
204 |
+
## Data
|
205 |
+
|
206 |
+
|
207 |
+
|
208 |
+
|
209 |
+
## Evaluation
|
210 |
+
|
211 |
+
Below are the evaluation results on Flores-200 dev and devtest compared to NLLB-3.3 ([Costa-jussà et al., 2022](https://arxiv.org/abs/2207.04672)) for CA-XX and XX-CA directions.
|
212 |
+
|
213 |
+
|
214 |
+
#### Flores200-dev
|
215 |
+
|
216 |
+
| | Bleu ↑ | Ter ↓ | ChrF ↑ | Comet ↑ | Comet-kiwi ↑ | Bleurt ↑ |
|
217 |
+
|:-----------------------|-------:|------:|-------:|--------:|-------------:|---------:|
|
218 |
+
| **CA-XX** | | | | | | |
|
219 |
+
| SalamandraTA | **27.41** | **60.88** | **56.27** | 0.86 | 0.82 | 0.76 |
|
220 |
+
| nllb 3.3B | 26.84 | 61.75 | 55.7 | 0.86 | 0.82 | 0.76 |
|
221 |
+
| **XX-CA** | | | | | | |
|
222 |
+
| SalamandraTA | **30.75** | **57.66** | **57.6** | 0.85 | 0.81 | 0.73 |
|
223 |
+
| nllb 3.3B | 29.76 | 58.25 | 56.75 | 0.85 | **0.82** | 0.73 |
|
224 |
+
|
225 |
+
|
226 |
+
<details>
|
227 |
+
<summary>Click to show full table</summary>
|
228 |
+
</details>
|
229 |
+
|
230 |
+
#### Flores200-devtest
|
231 |
+
|
232 |
+
| | Bleu ↑ | Ter ↓ | ChrF ↑ | Comet ↑ | Comet-kiwi ↑ | Bleurt ↑ |
|
233 |
+
|:-----------------------|-------:|------:|-------:|--------:|-------------:|---------:|
|
234 |
+
| **CA-XX** | | | | | | |
|
235 |
+
| SalamandraTA | **27.09** | **61.06** | **56.41** | 0.86 | 0.81 | 0.75 |
|
236 |
+
| nllb 3.3B | 26.7 | 61.74 | 55.85 | 0.86 | **0.82** | **0.76** |
|
237 |
+
| **XX-CA** | | | | | | |
|
238 |
+
| SalamandraTA | **31** | **57.46** | **57.96** | 0.85 | 0.81 | 0.73 |
|
239 |
+
| nllb 3.3B | 30.31 | 58.26 | 57.12 | 0.85 | **0.82** | 0.73 |
|
240 |
+
<details>
|
241 |
+
<summary>Click to show full table</summary>
|
242 |
+
|
243 |
+
</details>
|
244 |
+
|
245 |
+
|
246 |
+
|
247 |
+
## Ethical Considerations and Limitations
|
248 |
+
|
249 |
+
---
|
250 |
+
|
251 |
+
## Additional information
|
252 |
+
|
253 |
+
### Author
|
254 |
+
The Language Technologies Unit from Barcelona Supercomputing Center.
|
255 |
+
|
256 |
+
### Contact
|
257 |
+
For further information, please send an email to <[email protected]>.
|
258 |
+
|
259 |
+
### Copyright
|
260 |
+
Copyright(c) 2024 by Language Technologies Unit, Barcelona Supercomputing Center.
|
261 |
+
|
262 |
+
### Funding
|
263 |
+
This work has been promoted and financed by the Government of Catalonia through the [Aina Project](https://projecteaina.cat/).
|
264 |
+
|
265 |
+
This work is funded by the _Ministerio para la Transformación Digital y de la Función Pública_ - Funded by EU – NextGenerationEU
|
266 |
+
within the framework of [ILENIA Project](https://proyectoilenia.es/) with reference 2022/TL22/00215337.
|
267 |
+
|
268 |
+
### Acknowledgements
|
269 |
+
|
270 |
+
|
271 |
+
This project has benefited from the contributions of numerous teams and institutions, mainly through data contributions, knowledge transfer or technical support.
|
272 |
+
|
273 |
+
In Catalonia, many institutions have been involved in the project. Our thanks to Òmnium Cultural, Parlament de Catalunya, Institut d'Estudis Aranesos, Racó Català, Vilaweb, ACN, Nació Digital, El món and Aquí Berguedà.
|
274 |
+
|
275 |
+
At national level, we are especially grateful to our ILENIA project partners: CENID, HiTZ and CiTIUS for their participation. We also extend our genuine gratitude to the Spanish Senate and Congress, Fundación Dialnet, Fundación Elcano and the ‘Instituto Universitario de Sistemas Inteligentes y Aplicaciones Numéricas en Ingeniería (SIANI)’ of the University of Las Palmas de Gran Canaria.
|
276 |
+
|
277 |
+
At the international level, we thank the Welsh government, DFKI, Occiglot project, especially Malte Ostendorff, and The Common Crawl Foundation, especially Pedro Ortiz, for their collaboration. We would also like to give special thanks to the NVIDIA team, with whom we have met regularly, specially to: Ignacio Sarasua, Adam Henryk Grzywaczewski, Oleg Sudakov, Sergio Perez, Miguel Martinez, Felipes Soares and Meriem Bendris. Their constant support has been especially appreciated throughout the entire process.
|
278 |
+
|
279 |
+
Their valuable efforts have been instrumental in the development of this work.
|
280 |
+
|
281 |
+
### Disclaimer
|
282 |
+
Be aware that the model may contain biases or other unintended distortions.
|
283 |
+
When third parties deploy systems or provide services based on this model, or use the model themselves,
|
284 |
+
they bear the responsibility for mitigating any associated risks and ensuring compliance with applicable regulations,
|
285 |
+
including those governing the use of Artificial Intelligence.
|
286 |
+
|
287 |
+
The Barcelona Supercomputing Center, as the owner and creator of the model, shall not be held liable for any outcomes resulting from third-party use.
|
288 |
+
|
289 |
+
### License
|
290 |
+
[Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
|