emilys commited on
Commit
6b7b2a0
·
verified ·
1 Parent(s): 8f96db2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +88 -0
README.md CHANGED
@@ -1,3 +1,91 @@
1
  ---
2
  license: cc-by-2.0
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: cc-by-2.0
3
+ language:
4
+ - en
5
+ pipeline_tag: token-classification
6
  ---
7
+
8
+ # Historical newspaper NER
9
+
10
+ ## Model description
11
+
12
+ **historical_newspaper_ner** is a fine-tuned Roberta-large model for use on text that may contain OCR errors.
13
+
14
+ It has been trained to recognize four types of entities: location (LOC), organizations (ORG), person (PER) and Miscellaneous (MISC).
15
+
16
+ It was trained on a custom historical newspaper dataset, with highly accurate labels. All data were double entered by two highly skilled Harvard undergraduates and all discrepancies were resolved by hand.
17
+
18
+
19
+ ## Intended uses
20
+
21
+ You can use this model with Transformers pipeline for NER.
22
+
23
+ ```python
24
+ from transformers import AutoTokenizer, AutoModelForTokenClassification
25
+ from transformers import pipeline
26
+
27
+ tokenizer = AutoTokenizer.from_pretrained("dell-research-harvard/historical_newspaper_ner")
28
+ model = AutoModelForTokenClassification.from_pretrained("dell-research-harvard/historical_newspaper_ner")
29
+
30
+ nlp = pipeline("ner", model=model, tokenizer=tokenizer)
31
+ example = "My name is Wolfgang and I live in Berlin"
32
+
33
+ ner_results = nlp(example)
34
+ print(ner_results)
35
+ ```
36
+
37
+ ## Limitations and bias
38
+
39
+ This model was trained on historical news and may reflect biases from a specific period of time. It may also not generalise well to other setting.
40
+ Additionally, the model occasionally tags subword tokens as entities and post-processing of results may be necessary to handle those cases.
41
+
42
+ ## Training data
43
+
44
+ The training dataset distinguishes between the beginning and continuation of an entity so that if there are back-to-back entities of the same type, the model can output where the second entity begins. Each token will be classified as one of the following classes:
45
+
46
+ Abbreviation|Description
47
+ -|-
48
+ O|Outside of a named entity
49
+ B-MISC |Beginning of a miscellaneous entity
50
+ I-MISC | Miscellaneous entity
51
+ B-PER |Beginning of a person’s name
52
+ I-PER |Person’s name
53
+ B-ORG |Beginning of an organization
54
+ I-ORG |organization
55
+ B-LOC |Beginning of a location
56
+ I-LOC |Location
57
+
58
+ This model was fine-tuned on historical English-language news that had been OCRd from American newspapers.
59
+ Unlike other NER datasets, this data has highly accurate labels. All data were double entered by two highly skilled Harvard undergraduates and all discrepancies were resolved by hand.
60
+
61
+
62
+ #### # of training examples per entity type
63
+ Dataset|Article|PER|ORG|LOC|MISC
64
+ -|-|-|-|-|-
65
+ Train|227|1345|450|1191|1037
66
+ Dev|48|231|59|192|149
67
+ Test|48|261|83|199|181
68
+
69
+
70
+ ## Training procedure
71
+
72
+ The data was used to fine-tune a Roberta-Large model (Liu et. al, 2020) at a learning rate of 4.7e-05 with a batch size of 128 for 184 epochs.
73
+
74
+
75
+ ## Eval results
76
+ entities|f1
77
+ -|-
78
+ PER | 94.3
79
+ ORG | 80.7
80
+ LOC | 90.8
81
+ MISC | 79.6
82
+ Overall (stringent) | 86.5
83
+ Overall (ignoring entity type) | 90.4
84
+
85
+
86
+
87
+
88
+ ## Notes
89
+
90
+ This model card was influence by that of [dslim/bert-base-NER](https://huggingface.co/dslim/bert-base-NER/edit/main/README.md)
91
+