javi8979 commited on
Commit
a3cab29
·
verified ·
1 Parent(s): d990815

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +290 -3
README.md CHANGED
@@ -1,3 +1,290 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ library_name: transformers
4
+ pipeline_tag: translation
5
+ language:
6
+ - it
7
+ - pt
8
+ - de
9
+ - en
10
+ - es
11
+ - eu
12
+ - gl
13
+ - fr
14
+ - bg
15
+ - cs
16
+ - lt
17
+ - hr
18
+ - ca
19
+ - nl
20
+ - ro
21
+ - da
22
+ - el
23
+ - fi
24
+ - hu
25
+ - sk
26
+ - sl
27
+ - et
28
+ - pl
29
+ - lv
30
+ - mt
31
+ - ga
32
+ - sv
33
+ - an
34
+ - ast
35
+ - arn
36
+ base_model:
37
+ - BSC-LT/salamandra-2b
38
+ ---
39
+
40
+ ![](./images/salamandra_header.png)
41
+
42
+ # Salamandra Model Card
43
+
44
+
45
+ SalamandraTA-2B is a machine translation model that has been continually pre-trained on Salamandra2B on 70 billion tokens of parallel data in 30 different languages: Catalan, Italian, Portuguese, German, English, Spanish, Euskera, Galician, French, Bulgarian, Czech, Lithuanian, Croatian, Dutch, Romanian, Danish, Greek, Finnish, Hungarian, Slovak, Slovenian, Estonian, Polish, Latvian, Swedish, Maltese, Irish, Aranese, Aragonese, Asturian. SalamandraTA-2B is the first model in **SalamandraTA** series and is trained to handle sentence- and paragraph- level machine translation.
46
+
47
+ - **Developed by:** The Language Technologies Unit from Barcelona Supercomputing Center (BSC).
48
+ - **Model type:** A 2B parameter model continually pre-trained on 70 billion tokens.
49
+ - **Languages:** Catalan, Italian, Portuguese, German, English, Spanish, Euskera, Galician, French, Bulgarian, Czech, Lithuanian, Croatian, Dutch, Romanian, Danish, Greek, Finnish, Hungarian, Slovak, Slovenian, Estonian, Polish, Latvian, Swedish, Maltese, Irish, Aranese, Aragonese, Asturian.
50
+ - **License:** Apache License, Version 2.0
51
+
52
+
53
+ ## Model Details
54
+
55
+ ### Description
56
+
57
+ Continual pre-trained model from Salamandra-2B on 70 billion tokens of highly curated parallel data.
58
+
59
+ ### Hyperparameters
60
+
61
+ The full list of hyperparameters for each model can be found [here](https://github.com/langtech-bsc/salamandra/tree/main/configs).
62
+
63
+ ### Architecture
64
+
65
+ | | |
66
+ |-------------------------|:--------------|
67
+ | Total Parameters | 2,253,490,176 |
68
+ | Embedding Parameters | 524,288,000 |
69
+ | Layers | 24 |
70
+ | Hidden size | 2,048 |
71
+ | Attention heads | 16 |
72
+ | Context length | 8,192 |
73
+ | Vocabulary size | 256,000 |
74
+ | Precision | bfloat16 |
75
+ | Embedding type | RoPE |
76
+ | Activation Function | SwiGLU |
77
+ | Layer normalization | RMS Norm |
78
+ | Flash attention | ✅ |
79
+ | Grouped Query Attention | ❌ |
80
+ | Num. query groups | N/A |
81
+
82
+ ---
83
+
84
+ ## Intended Use
85
+
86
+ ### Direct Use
87
+
88
+ The models are intended for both research and commercial use in any of the languages included in the training data.
89
+ The base models are intended either for general machine translation tasks.
90
+
91
+ ### Out-of-scope Use
92
+
93
+ The model is not intended for malicious activities, such as harming others or violating human rights.
94
+ Any downstream application must comply with current laws and regulations.
95
+ Irresponsible usage in production environments without proper risk assessment and mitigation is also discouraged.
96
+
97
+ ---
98
+
99
+ ## Hardware and Software
100
+
101
+ ### Training Framework
102
+
103
+ Pre-training was conducted using [LLaMA-Factory framework](https://github.com/hiyouga/LLaMA-Factory).
104
+
105
+ ### Compute Infrastructure
106
+
107
+ All models were trained on [MareNostrum 5](https://www.bsc.es/ca/marenostrum/marenostrum-5), a pre-exascale EuroHPC supercomputer hosted and operated by Barcelona Supercomputing Center.
108
+
109
+ The accelerated partition is composed of 1,120 nodes with the following specifications:
110
+ - 4x Nvidia Hopper GPUs with 64 HBM2 memory
111
+ - 2x Intel Sapphire Rapids 8460Y+ at 2.3Ghz and 32c each (64 cores)
112
+ - 4x NDR200 (BW per node 800Gb/s)
113
+ - 512 GB of Main memory (DDR5)
114
+ - 460GB on NVMe storage
115
+
116
+ ---
117
+
118
+ ## How to use
119
+ This section offers examples of how to perform inference using various methods.
120
+
121
+ ### Inference
122
+
123
+ To run inference using Huggingface's AutoModel class on a single sentence you can use the following code.
124
+
125
+ <details>
126
+ <summary>Show code</summary>
127
+
128
+ ```python
129
+ from transformers import AutoTokenizer, AutoModelForCausalLM
130
+
131
+ model_id = 'BSC-LT/salamandraTA-2b'
132
+
133
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
134
+ model = AutoModelForCausalLM.from_pretrained(model_id)
135
+
136
+ src_lang_code = 'Spanish'
137
+ tgt_lang_code = 'Catalan'
138
+ sentence = 'Ayer se fue, tomó sus cosas y se puso a navegar.'
139
+
140
+ prompt = f'[{src_lang_code}] {sentence} \n[{tgt_lang_code}]'
141
+
142
+ input_ids = tokenizer(prompt, return_tensors='pt').input_ids
143
+ output_ids = model.generate( input_ids, max_length=500, num_beams=5 )
144
+ input_length = input_ids.shape[1]
145
+
146
+ generated_text = tokenizer.decode(output_ids[0, input_length: ], skip_special_tokens=True).strip()
147
+ # Ahir se'n va anar, va agafar les seves coses i es va posar a navegar.
148
+ ```
149
+ </details>
150
+
151
+ <br>
152
+
153
+
154
+ To run batch inference using Huggingface's AutoModel class you can use the following code.
155
+
156
+ <details>
157
+ <summary>Show code</summary>
158
+
159
+ ```python
160
+ from transformers import AutoTokenizer, AutoModelForCausalLM
161
+ import torch
162
+
163
+ model_id = 'BSC-LT/salamandraTA-2b'
164
+
165
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
166
+ model = AutoModelForCausalLM.from_pretrained(model_id, attn_implementation='eager')
167
+
168
+ # List of sentences to translate
169
+ sentences = [
170
+ 'Ayer se fue, tomó sus cosas y se puso a navegar.',
171
+ 'Se despidió y decidió batirse en duelo con el mar, y recorrer el mundo en su velero',
172
+ 'Su corazón buscó una forma diferente de vivir, pero las olas le gritaron: Vete con los demás',
173
+ 'Y se durmió y la noche le gritó: Dónde vas, y en sus sueños dibujó gaviotas, y pensó: Hoy debo regresar.'
174
+ ]
175
+
176
+ src_lang_code = 'Spanish'
177
+ tgt_lang_code = 'Catalan'
178
+
179
+ prompt = lambda x: f'[{src_lang_code}] {x} \n[{tgt_lang_code}]'
180
+ prompts = [prompt(x) for x in sentences]
181
+
182
+
183
+ encodings = tokenizer(prompts, return_tensors='pt', padding=True, add_special_tokens=True)
184
+
185
+ input_ids = encodings['input_ids'].to(model.device)
186
+ attention_mask = encodings['attention_mask'].to(model.device)
187
+
188
+ with torch.no_grad():
189
+ outputs = model.generate(input_ids=input_ids, attention_mask=attention_mask, num_beams=5,max_length=256,early_stopping=True)
190
+
191
+ results_detokenized = []
192
+ for i, output in enumerate(outputs):
193
+ input_length = input_ids[i].shape[0]
194
+ generated_text = tokenizer.decode(output[input_length:], skip_special_tokens=True).strip()
195
+ results_detokenized.append(generated_text)
196
+
197
+ print("Generated Translations:", results_detokenized)
198
+
199
+ #Generated Translations: ["Ahir se'n va anar, va agafar les seves coses i es va posar a navegar.", "Es va acomiadar i va decidir batre's en duel amb el mar, i recórrer el món en el seu veler", 'El seu cor va buscar una forma diferent de viure, però les onades li van cridar: Vés amb els altres', 'I es va adormir i la nit li va cridar: On vas, i en els seus somnis va dibuixar gavines, i va pensar: Avui he de tornar.']
200
+
201
+ ```
202
+ </details>
203
+
204
+ ## Data
205
+
206
+
207
+
208
+
209
+ ## Evaluation
210
+
211
+ Below are the evaluation results on Flores-200 dev and devtest compared to NLLB-3.3 ([Costa-jussà et al., 2022](https://arxiv.org/abs/2207.04672)) for CA-XX and XX-CA directions.
212
+
213
+
214
+ #### Flores200-dev
215
+
216
+ | | Bleu ↑ | Ter ↓ | ChrF ↑ | Comet ↑ | Comet-kiwi ↑ | Bleurt ↑ |
217
+ |:-----------------------|-------:|------:|-------:|--------:|-------------:|---------:|
218
+ | **CA-XX** | | | | | | |
219
+ | SalamandraTA | **27.41** | **60.88** | **56.27** | 0.86 | 0.82 | 0.76 |
220
+ | nllb 3.3B | 26.84 | 61.75 | 55.7 | 0.86 | 0.82 | 0.76 |
221
+ | **XX-CA** | | | | | | |
222
+ | SalamandraTA | **30.75** | **57.66** | **57.6** | 0.85 | 0.81 | 0.73 |
223
+ | nllb 3.3B | 29.76 | 58.25 | 56.75 | 0.85 | **0.82** | 0.73 |
224
+
225
+
226
+ <details>
227
+ <summary>Click to show full table</summary>
228
+ </details>
229
+
230
+ #### Flores200-devtest
231
+
232
+ | | Bleu ↑ | Ter ↓ | ChrF ↑ | Comet ↑ | Comet-kiwi ↑ | Bleurt ↑ |
233
+ |:-----------------------|-------:|------:|-------:|--------:|-------------:|---------:|
234
+ | **CA-XX** | | | | | | |
235
+ | SalamandraTA | **27.09** | **61.06** | **56.41** | 0.86 | 0.81 | 0.75 |
236
+ | nllb 3.3B | 26.7 | 61.74 | 55.85 | 0.86 | **0.82** | **0.76** |
237
+ | **XX-CA** | | | | | | |
238
+ | SalamandraTA | **31** | **57.46** | **57.96** | 0.85 | 0.81 | 0.73 |
239
+ | nllb 3.3B | 30.31 | 58.26 | 57.12 | 0.85 | **0.82** | 0.73 |
240
+ <details>
241
+ <summary>Click to show full table</summary>
242
+
243
+ </details>
244
+
245
+
246
+
247
+ ## Ethical Considerations and Limitations
248
+
249
+ ---
250
+
251
+ ## Additional information
252
+
253
+ ### Author
254
+ The Language Technologies Unit from Barcelona Supercomputing Center.
255
+
256
+ ### Contact
257
+ For further information, please send an email to <[email protected]>.
258
+
259
+ ### Copyright
260
+ Copyright(c) 2024 by Language Technologies Unit, Barcelona Supercomputing Center.
261
+
262
+ ### Funding
263
+ This work has been promoted and financed by the Government of Catalonia through the [Aina Project](https://projecteaina.cat/).
264
+
265
+ This work is funded by the _Ministerio para la Transformación Digital y de la Función Pública_ - Funded by EU – NextGenerationEU
266
+ within the framework of [ILENIA Project](https://proyectoilenia.es/) with reference 2022/TL22/00215337.
267
+
268
+ ### Acknowledgements
269
+
270
+
271
+ This project has benefited from the contributions of numerous teams and institutions, mainly through data contributions, knowledge transfer or technical support.
272
+
273
+ In Catalonia, many institutions have been involved in the project. Our thanks to Òmnium Cultural, Parlament de Catalunya, Institut d'Estudis Aranesos, Racó Català, Vilaweb, ACN, Nació Digital, El món and Aquí Berguedà.
274
+
275
+ At national level, we are especially grateful to our ILENIA project partners: CENID, HiTZ and CiTIUS for their participation. We also extend our genuine gratitude to the Spanish Senate and Congress, Fundación Dialnet, Fundación Elcano and the ‘Instituto Universitario de Sistemas Inteligentes y Aplicaciones Numéricas en Ingeniería (SIANI)’ of the University of Las Palmas de Gran Canaria.
276
+
277
+ At the international level, we thank the Welsh government, DFKI, Occiglot project, especially Malte Ostendorff, and The Common Crawl Foundation, especially Pedro Ortiz, for their collaboration. We would also like to give special thanks to the NVIDIA team, with whom we have met regularly, specially to: Ignacio Sarasua, Adam Henryk Grzywaczewski, Oleg Sudakov, Sergio Perez, Miguel Martinez, Felipes Soares and Meriem Bendris. Their constant support has been especially appreciated throughout the entire process.
278
+
279
+ Their valuable efforts have been instrumental in the development of this work.
280
+
281
+ ### Disclaimer
282
+ Be aware that the model may contain biases or other unintended distortions.
283
+ When third parties deploy systems or provide services based on this model, or use the model themselves,
284
+ they bear the responsibility for mitigating any associated risks and ensuring compliance with applicable regulations,
285
+ including those governing the use of Artificial Intelligence.
286
+
287
+ The Barcelona Supercomputing Center, as the owner and creator of the model, shall not be held liable for any outcomes resulting from third-party use.
288
+
289
+ ### License
290
+ [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)