4ldk commited on
Commit
c16c65d
·
verified ·
1 Parent(s): 5287593

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +98 -0
README.md ADDED
@@ -0,0 +1,98 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ tags:
4
+ - token-classification
5
+ datasets:
6
+ - conll2003
7
+ - conllpp
8
+ language:
9
+ - en
10
+ metrics:
11
+ - f1: 92.85
12
+ - f1(valid): 96.71
13
+ - f1(CoNLLpp(2023)): 92.35
14
+ - f1(CoNLLpp(CrossWeigh)): 94.26
15
+ ---
16
+
17
+
18
+ # Roberta-Base-CoNLL2003
19
+
20
+ This model is a fine-tuned version of [roberta-base](https://huggingface.co/roberta-base) on the conll2003 dataset.
21
+
22
+ ## Model Usage
23
+
24
+ We create the original tokenizer with [BPE-Dropout](https://aclanthology.org/2020.acl-main.170/).
25
+ So, you can't use AutoTokenizer but if subword normalization is not used, original RobertaTokenizer can be substituted.
26
+
27
+ Example and Tokenizer Repository: [github](https://github.com/4ldk/CoNLL2003_Choices)
28
+
29
+ ```python
30
+ from transformers import RobertaTokenizer, AutoModelForTokenClassification
31
+ from transformers import pipeline
32
+
33
+ tokenizer = RobertaTokenizer.from_pretrained("4ldk/Roberta-Base-CoNLL2003")
34
+ model = AutoModelForTokenClassification.from_pretrained("4ldk/Roberta-Base-CoNLL2003")
35
+
36
+ nlp = pipeline("ner", model=model, tokenizer=tokenizer, grouped_entities=True)
37
+ example = "My name is Philipp and live in Germany"
38
+
39
+ nlp(example)
40
+
41
+ ```
42
+
43
+
44
+ ## Training procedure
45
+
46
+ ### Training hyperparameters
47
+
48
+ The following hyperparameters were used during training:
49
+ - learning_rate: 5e-5
50
+ - train_batch_size: 32
51
+ - eval_batch_size: 32
52
+ - seed: 42
53
+ - optimizer: AdamW with betas=(0.9,0.999), epsilon=1e-08, and weight decay=0.01
54
+ - lr_scheduler_type: linear with warmup rate = 0.1
55
+ - num_epochs: 20
56
+ - subword regularization p = 0.0 (= trained without subword regularization)
57
+
58
+ And we add the sentences following the input sentence in the original dataset. Therefore, it cannot be reproduced from the dataset published on huggingface.
59
+ For detail in [our github repository](https://github.com/4ldk/CoNLL2003_Choices/blob/develop/src/utils.py)).
60
+
61
+ ### Training results
62
+
63
+ #### CoNNL2003
64
+
65
+ It achieves the following results on the evaluation set:
66
+ - Precision: 0.9707
67
+ - Recall: 0.9636
68
+ - F1: 0.9671
69
+
70
+ It achieves the following results on the test set:
71
+ - Precision: 0.9352
72
+ - Recall: 0.9218
73
+ - F1: 0.9285
74
+
75
+ #### CoNNLpp(2023)
76
+
77
+ [Do CoNLL-2003 Named Entity Taggers Still Work Well in 2023](
78
+ https://aclanthology.org/2023.acl-long.459.pdf)
79
+ ([github](https://github.com/ShuhengL/acl2023_conllpp))
80
+
81
+ - Precision: 0.9244
82
+ - Recall: 0.9225
83
+ - F1: 0.9235
84
+
85
+ #### CoNLLpp(CrossWeigh)
86
+
87
+ [CrossWeigh: Training Named Entity Tagger from Imperfect Annotations](https://aclanthology.org/D19-1519/)
88
+ ([github](https://github.com/ZihanWangKi/CrossWeigh))
89
+
90
+ - Precision: 0.9449
91
+ - Recall: 0.9403
92
+ - F1: 0.9426
93
+
94
+
95
+ ### Framework versions
96
+
97
+ - Transformers 4.35.2
98
+ - Pytorch 2.0.1+cu117