abhik1505040 commited on
Commit
7b056bf
·
1 Parent(s): ec878e3

Added initial files

Browse files
README.md ADDED
@@ -0,0 +1,173 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - summarization
4
+ datasets:
5
+ - csebuetnlp/xlsum
6
+ languages:
7
+ - am
8
+ - ar
9
+ - az
10
+ - bn
11
+ - my
12
+ - zh
13
+ - en
14
+ - fr
15
+ - gu
16
+ - ha
17
+ - hi
18
+ - ig
19
+ - id
20
+ - ja
21
+ - rn
22
+ - ko
23
+ - ky
24
+ - mr
25
+ - ne
26
+ - om
27
+ - ps
28
+ - fa
29
+ - pcm
30
+ - pt
31
+ - pa
32
+ - ru
33
+ - gd
34
+ - sr
35
+ - si
36
+ - so
37
+ - es
38
+ - sw
39
+ - ta
40
+ - te
41
+ - th
42
+ - ti
43
+ - tr
44
+ - uk
45
+ - ur
46
+ - uz
47
+ - vi
48
+ - cy
49
+ - yo
50
+ licenses:
51
+ - cc-by-nc-sa-4.0
52
+ multilinguality:
53
+ - multilingual
54
+ paperswithcode_id: xl-sum
55
+ ---
56
+
57
+ # mT5-multilingual-XLSum
58
+
59
+ This repository contains the mT5 checkpoint finetuned on the 45 languages of [XL-Sum](https://huggingface.co/datasets/csebuetnlp/xlsum) dataset. For finetuning details and scripts,
60
+ see the [paper](https://aclanthology.org/2021.findings-acl.413/) and the [official repository](https://github.com/csebuetnlp/xl-sum).
61
+
62
+
63
+ ## Using this model in `transformers` (tested on 4.11.0.dev0)
64
+
65
+ ```python
66
+ from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
67
+
68
+ article_text = """Input article text"""
69
+
70
+ model_name = "csebuetnlp/mT5_multilingual_XLSum"
71
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
72
+ model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
73
+
74
+ input_ids = tokenizer.prepare_seq2seq_batch(
75
+ [article_text.strip()],
76
+ return_tensors="pt",
77
+ padding="max_length",
78
+ truncation=True,
79
+ max_length=512
80
+ )["input_ids"]
81
+
82
+ output_ids = model.generate(
83
+ input_ids=input_ids,
84
+ max_length=84,
85
+ no_repeat_ngram_size=2,
86
+ num_beams=4
87
+ )[0]
88
+
89
+ summary = tokenizer.decode(
90
+ output_ids,
91
+ skip_special_tokens=True,
92
+ clean_up_tokenization_spaces=False
93
+ )
94
+ print(summary)
95
+ ```
96
+
97
+ ## Benchmarks
98
+
99
+ Scores on test sets are given below.
100
+
101
+ Language | ROUGE-1 / ROUGE-2 / ROUGE-L
102
+ ---------|----------------------------
103
+ Amharic | 20.0485 / 7.4111 / 18.0753
104
+ Arabic | 34.9107 / 14.7937 / 29.1623
105
+ Azerbaijani | 21.4227 / 9.5214 / 19.3331
106
+ Bengali | 29.5653 / 12.1095 / 25.1315
107
+ Burmese | 15.9626 / 5.1477 / 14.1819
108
+ Chinese (Simplified) | 39.4071 / 17.7913 / 33.406
109
+ Chinese (Traditional) | 37.1866 / 17.1432 / 31.6184
110
+ English | 37.601 / 15.1536 / 29.8817
111
+ French | 35.3398 / 16.1739 / 28.2041
112
+ Gujarati | 21.9619 / 7.7417 / 19.86
113
+ Hausa | 39.4375 / 17.6786 / 31.6667
114
+ Hindi | 38.5882 / 16.8802 / 32.0132
115
+ Igbo | 31.6148 / 10.1605 / 24.5309
116
+ Indonesian | 37.0049 / 17.0181 / 30.7561
117
+ Japanese | 48.1544 / 23.8482 / 37.3636
118
+ Kirundi | 31.9907 / 14.3685 / 25.8305
119
+ Korean | 23.6745 / 11.4478 / 22.3619
120
+ Kyrgyz | 18.3751 / 7.9608 / 16.5033
121
+ Marathi | 22.0141 / 9.5439 / 19.9208
122
+ Nepali | 26.6547 / 10.2479 / 24.2847
123
+ Oromo | 18.7025 / 6.1694 / 16.1862
124
+ Pashto | 38.4743 / 15.5475 / 31.9065
125
+ Persian | 36.9425 / 16.1934 / 30.0701
126
+ Pidgin | 37.9574 / 15.1234 / 29.872
127
+ Portuguese | 37.1676 / 15.9022 / 28.5586
128
+ Punjabi | 30.6973 / 12.2058 / 25.515
129
+ Russian | 32.2164 / 13.6386 / 26.1689
130
+ Scottish Gaelic | 29.0231 / 10.9893 / 22.8814
131
+ Serbian (Cyrillic) | 23.7841 / 7.9816 / 20.1379
132
+ Serbian (Latin) | 21.6443 / 6.6573 / 18.2336
133
+ Sinhala | 27.2901 / 13.3815 / 23.4699
134
+ Somali | 31.5563 / 11.5818 / 24.2232
135
+ Spanish | 31.5071 / 11.8767 / 24.0746
136
+ Swahili | 37.6673 / 17.8534 / 30.9146
137
+ Tamil | 24.3326 / 11.0553 / 22.0741
138
+ Telugu | 19.8571 / 7.0337 / 17.6101
139
+ Thai | 37.3951 / 17.275 / 28.8796
140
+ Tigrinya | 25.321 / 8.0157 / 21.1729
141
+ Turkish | 32.9304 / 15.5709 / 29.2622
142
+ Ukrainian | 23.9908 / 10.1431 / 20.9199
143
+ Urdu | 39.5579 / 18.3733 / 32.8442
144
+ Uzbek | 16.8281 / 6.3406 / 15.4055
145
+ Vietnamese | 32.8826 / 16.2247 / 26.0844
146
+ Welsh | 32.6599 / 11.596 / 26.1164
147
+ Yoruba | 31.6595 / 11.6599 / 25.0898
148
+
149
+
150
+
151
+ ## Citation
152
+
153
+ If you use this model, please cite the following paper:
154
+ ```
155
+ @inproceedings{hasan-etal-2021-xl,
156
+ title = "{XL}-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages",
157
+ author = "Hasan, Tahmid and
158
+ Bhattacharjee, Abhik and
159
+ Islam, Md. Saiful and
160
+ Mubasshir, Kazi and
161
+ Li, Yuan-Fang and
162
+ Kang, Yong-Bin and
163
+ Rahman, M. Sohel and
164
+ Shahriyar, Rifat",
165
+ booktitle = "Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021",
166
+ month = aug,
167
+ year = "2021",
168
+ address = "Online",
169
+ publisher = "Association for Computational Linguistics",
170
+ url = "https://aclanthology.org/2021.findings-acl.413",
171
+ pages = "4693--4703",
172
+ }
173
+ ```
config.json ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "google/mt5-base",
3
+ "architectures": [
4
+ "MT5ForConditionalGeneration"
5
+ ],
6
+ "d_ff": 2048,
7
+ "d_kv": 64,
8
+ "d_model": 768,
9
+ "decoder_start_token_id": 0,
10
+ "dropout_rate": 0.1,
11
+ "eos_token_id": 1,
12
+ "feed_forward_proj": "gated-gelu",
13
+ "initializer_factor": 1.0,
14
+ "is_encoder_decoder": true,
15
+ "layer_norm_epsilon": 1e-06,
16
+ "length_penalty": 0.6,
17
+ "max_length": 84,
18
+ "model_type": "mt5",
19
+ "no_repeat_ngram_size": 2,
20
+ "num_beams": 4,
21
+ "num_decoder_layers": 12,
22
+ "num_heads": 12,
23
+ "num_layers": 12,
24
+ "output_past": true,
25
+ "pad_token_id": 0,
26
+ "relative_attention_num_buckets": 32,
27
+ "tie_word_embeddings": false,
28
+ "tokenizer_class": "T5Tokenizer",
29
+ "use_cache": true,
30
+ "vocab_size": 250112
31
+ }
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1899a041aceedfd0c9c67e87f2597bc597ce6f4c1f21b5d35a6325322608a898
3
+ size 2329707353
special_tokens_map.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"eos_token": "</s>", "unk_token": "<unk>", "pad_token": "<pad>"}
spiece.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ef78f86560d809067d12bac6c09f19a462cb3af3f54d2b8acbba26e1433125d6
3
+ size 4309802
tokenizer_config.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"eos_token": "</s>", "unk_token": "<unk>", "pad_token": "<pad>", "extra_ids": 0, "additional_special_tokens": null, "special_tokens_map_file": "/home/patrick/.cache/torch/transformers/685ac0ca8568ec593a48b61b0a3c272beee9bc194a3c7241d15dcadb5f875e53.f76030f3ec1b96a8199b2593390c610e76ca8028ef3d24680000619ffb646276", "tokenizer_file": null, "name_or_path": "google/mt5-base"}