Davlan commited on
Commit
3dacb9a
·
1 Parent(s): 1f13bbe

adding distil-bert-multilingual-masakhaner

Browse files
README.md ADDED
@@ -0,0 +1,80 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Hugging Face's logo
2
+ ---
3
+ language:
4
+ - ha
5
+ - ig
6
+ - rw
7
+ - lg
8
+ - luo
9
+ - pcm
10
+ - sw
11
+ - wo
12
+ - yo
13
+ - multilingual
14
+
15
+
16
+ datasets:
17
+ - masakhaner
18
+ ---
19
+ # bert-base-multilingual-cased-masakhaner
20
+ ## Model description
21
+ **distilbert-base-multilingual-cased-masakhaner** is the first **Named Entity Recognition** model for 9 African languages (Hausa, Igbo, Kinyarwanda, Luganda, Nigerian Pidgin, Swahilu, Wolof, and Yorùbá) based on a fine-tuned BERT base model. It has been trained to recognize four types of entities: dates & times (DATE), location (LOC), organizations (ORG), and person (PER).
22
+ Specifically, this model is a *distilbert-base-multilingual-cased* model that was fine-tuned on an aggregation of African language datasets obtained from Masakhane [MasakhaNER](https://github.com/masakhane-io/masakhane-ner) dataset.
23
+ ## Intended uses & limitations
24
+ #### How to use
25
+ You can use this model with Transformers *pipeline* for NER.
26
+ ```python
27
+ from transformers import AutoTokenizer, AutoModelForTokenClassification
28
+ from transformers import pipeline
29
+ tokenizer = AutoTokenizer.from_pretrained("Davlan/distilbert-base-multilingual-cased-masakhaner")
30
+ model = AutoModelForTokenClassification.from_pretrained("Davlan/distilbert-base-multilingual-cased-masakhaner")
31
+ nlp = pipeline("ner", model=model, tokenizer=tokenizer)
32
+ example = "Emir of Kano turban Zhang wey don spend 18 years for Nigeria"
33
+ ner_results = nlp(example)
34
+ print(ner_results)
35
+ ```
36
+ #### Limitations and bias
37
+ This model is limited by its training dataset of entity-annotated news articles from a specific span of time. This may not generalize well for all use cases in different domains.
38
+ ## Training data
39
+ This model was fine-tuned on 9 African NER datasets (Hausa, Igbo, Kinyarwanda, Luganda, Nigerian Pidgin, Swahilu, Wolof, and Yorùbá) Masakhane [MasakhaNER](https://github.com/masakhane-io/masakhane-ner) dataset
40
+
41
+ The training dataset distinguishes between the beginning and continuation of an entity so that if there are back-to-back entities of the same type, the model can output where the second entity begins. As in the dataset, each token will be classified as one of the following classes:
42
+ Abbreviation|Description
43
+ -|-
44
+ O|Outside of a named entity
45
+ B-DATE |Beginning of a DATE entity right after another DATE entity
46
+ I-DATE |DATE entity
47
+ B-PER |Beginning of a person’s name right after another person’s name
48
+ I-PER |Person’s name
49
+ B-ORG |Beginning of an organisation right after another organisation
50
+ I-ORG |Organisation
51
+ B-LOC |Beginning of a location right after another location
52
+ I-LOC |Location
53
+ ## Training procedure
54
+ This model was trained on a single NVIDIA V100 GPU with recommended hyperparameters from the [original MasakhaNER paper](https://arxiv.org/abs/2103.11811) which trained & evaluated the model on MasakhaNER corpus.
55
+ ## Eval results on Test set (F-score)
56
+ language|F1-score
57
+ -|-
58
+ hau |88.88
59
+ ibo |84.87
60
+ kin |74.19
61
+ lug |78.43
62
+ luo |73.32
63
+ pcm |87.98
64
+ swa |86.20
65
+ wol |64.67
66
+ yor |78.10
67
+
68
+ ### BibTeX entry and citation info
69
+ ```
70
+ @article{adelani21tacl,
71
+ title = {Masakha{NER}: Named Entity Recognition for African Languages},
72
+ author = {David Ifeoluwa Adelani and Jade Abbott and Graham Neubig and Daniel D'souza and Julia Kreutzer and Constantine Lignos and Chester Palen-Michel and Happy Buzaaba and Shruti Rijhwani and Sebastian Ruder and Stephen Mayhew and Israel Abebe Azime and Shamsuddeen Muhammad and Chris Chinenye Emezue and Joyce Nakatumba-Nabende and Perez Ogayo and Anuoluwapo Aremu and Catherine Gitau and Derguene Mbaye and Jesujoba Alabi and Seid Muhie Yimam and Tajuddeen Gwadabe and Ignatius Ezeani and Rubungo Andre Niyongabo and Jonathan Mukiibi and Verrah Otiende and Iroro Orife and Davis David and Samba Ngom and Tosin Adewumi and Paul Rayson and Mofetoluwa Adeyemi and Gerald Muriuki and Emmanuel Anebi and Chiamaka Chukwuneke and Nkiruka Odu and Eric Peter Wairagala and Samuel Oyerinde and Clemencia Siro and Tobius Saul Bateesa and Temilola Oloyede and Yvonne Wambui and Victor Akinode and Deborah Nabagereka and Maurice Katusiime and Ayodele Awokoya and Mouhamadane MBOUP and Dibora Gebreyohannes and Henok Tilaye and Kelechi Nwaike and Degaga Wolde and Abdoulaye Faye and Blessing Sibanda and Orevaoghene Ahia and Bonaventure F. P. Dossou and Kelechi Ogueji and Thierno Ibrahima DIOP and Abdoulaye Diallo and Adewale Akinfaderin and Tendai Marengereke and Salomey Osei},
73
+ journal = {Transactions of the Association for Computational Linguistics (TACL)},
74
+ month = {},
75
+ url = {https://arxiv.org/abs/2103.11811},
76
+ year = {2021}
77
+ }
78
+ ```
79
+
80
+
config.json ADDED
@@ -0,0 +1,45 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "distilbert-base-multilingual-cased",
3
+ "activation": "gelu",
4
+ "architectures": [
5
+ "DistilBertForTokenClassification"
6
+ ],
7
+ "attention_dropout": 0.1,
8
+ "dim": 768,
9
+ "dropout": 0.1,
10
+ "hidden_dim": 3072,
11
+ "id2label": {
12
+ "0": "O",
13
+ "1": "B-DATE",
14
+ "2": "I-DATE",
15
+ "3": "B-PER",
16
+ "4": "I-PER",
17
+ "5": "B-ORG",
18
+ "6": "I-ORG",
19
+ "7": "B-LOC",
20
+ "8": "I-LOC"
21
+ },
22
+ "initializer_range": 0.02,
23
+ "label2id": {
24
+ "B-DATE": 1,
25
+ "B-LOC": 7,
26
+ "B-ORG": 5,
27
+ "B-PER": 3,
28
+ "I-DATE": 2,
29
+ "I-LOC": 8,
30
+ "I-ORG": 6,
31
+ "I-PER": 4,
32
+ "O": 0
33
+ },
34
+ "max_position_embeddings": 512,
35
+ "model_type": "distilbert",
36
+ "n_heads": 12,
37
+ "n_layers": 6,
38
+ "output_past": true,
39
+ "pad_token_id": 0,
40
+ "qa_dropout": 0.1,
41
+ "seq_classif_dropout": 0.2,
42
+ "sinusoidal_pos_embds": false,
43
+ "tie_weights_": true,
44
+ "vocab_size": 119547
45
+ }
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f37f6133a103503c6f1c6bd6033be48e2004973707f07d1577bc5902a114f4d4
3
+ size 539006405
special_tokens_map.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]"}
tokenizer_config.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"do_lower_case": false, "unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]", "tokenize_chinese_chars": true, "strip_accents": null, "model_max_length": 512, "name_or_path": "distilbert-base-multilingual-cased"}
training_args.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:70dcd6144bdd33c6a9050de64ab7b1bee52ee0573ae42bef3ff212b357000b2d
3
+ size 1519
vocab.txt ADDED
The diff for this file is too large to render. See raw diff