Update README.md
Browse filesCorrected some typos
README.md
CHANGED
@@ -66,7 +66,7 @@ The model was developed by the Berlin State Library (SBB) in the [QURATOR](https
|
|
66 |
## Model Description
|
67 |
|
68 |
<!-- Provide a longer summary of what this model is/does. -->
|
69 |
-
A BERT model trained on three German corpora containing contemporary and historical texts for
|
70 |
It predicts the classes `PER`, `LOC` and `ORG`.
|
71 |
|
72 |
- **Developed by:** [Kai Labusch](https://huggingface.co/labusch), [Clemens Neudecker](https://huggingface.co/cneud), David Zellhöfer
|
@@ -100,9 +100,6 @@ Supported entity types are `PER`, `LOC` and `ORG`.
|
|
100 |
The model has been pre-trained on 2,333,647 pages of OCR-text of the digitized collections of Berlin State Library.
|
101 |
Therefore it is adapted to OCR-error prone historical German texts and might be used for particular applications that involve such text material.
|
102 |
|
103 |
-
|
104 |
-
|
105 |
-
|
106 |
## Out-of-Scope Use
|
107 |
|
108 |
<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
|
@@ -159,7 +156,7 @@ Since it is an incarnation of the original BERT-model published by Google, all t
|
|
159 |
<!-- This section describes the evaluation protocols and provides the results. -->
|
160 |
|
161 |
The model has been evaluated by 5-fold cross-validation on several German historical OCR ground truth datasets.
|
162 |
-
See publication for
|
163 |
|
164 |
## Testing Data, Factors & Metrics
|
165 |
|
@@ -168,29 +165,29 @@ See publication for detail.
|
|
168 |
<!-- This should link to a Data Card if possible. -->
|
169 |
|
170 |
Two different test sets contained in the CoNLL 2003 German Named Entity Recognition Ground Truth, i.e. TEST-A and TEST-B, have been used for testing (DE-CoNLL-TEST).
|
171 |
-
Additionally, historical OCR-based ground truth datasets have been used for testing - see publication for details and below.
|
172 |
|
173 |
|
174 |
### Factors
|
175 |
|
176 |
<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
|
177 |
|
178 |
-
The evaluation focuses on NER in historical German documents, see publication for details.
|
179 |
|
180 |
### Metrics
|
181 |
|
182 |
<!-- These are the evaluation metrics being used, ideally with a description of why. -->
|
183 |
|
184 |
Performance metrics used in evaluation is precision, recall and F1-score.
|
185 |
-
See
|
186 |
|
187 |
## Results
|
188 |
|
189 |
-
See publication.
|
190 |
|
191 |
# Model Examination
|
192 |
|
193 |
-
See publication.
|
194 |
|
195 |
# Environmental Impact
|
196 |
|
@@ -256,9 +253,9 @@ More information needed.
|
|
256 |
In addition to what has been documented above, it should be noted that there are two NER Ground Truth datasets available:
|
257 |
|
258 |
1) [Data provided for the 2020 HIPE campaign on named entity processing](https://impresso.github.io/CLEF-HIPE-2020/)
|
259 |
-
2) [Data
|
260 |
|
261 |
-
Furthermore, two papers have been published on NER/
|
262 |
|
263 |
1) [Entity Linking in Multilingual Newspapers and Classical Commentaries with BERT](http://ceur-ws.org/Vol-3180/paper-85.pdf)
|
264 |
2) [Named Entity Disambiguation and Linking Historic Newspaper OCR with BERT](http://ceur-ws.org/Vol-2696/paper_163.pdf)
|
|
|
66 |
## Model Description
|
67 |
|
68 |
<!-- Provide a longer summary of what this model is/does. -->
|
69 |
+
A BERT model trained on three German corpora containing contemporary and historical texts for Named Entity Recognition (NER) tasks.
|
70 |
It predicts the classes `PER`, `LOC` and `ORG`.
|
71 |
|
72 |
- **Developed by:** [Kai Labusch](https://huggingface.co/labusch), [Clemens Neudecker](https://huggingface.co/cneud), David Zellhöfer
|
|
|
100 |
The model has been pre-trained on 2,333,647 pages of OCR-text of the digitized collections of Berlin State Library.
|
101 |
Therefore it is adapted to OCR-error prone historical German texts and might be used for particular applications that involve such text material.
|
102 |
|
|
|
|
|
|
|
103 |
## Out-of-Scope Use
|
104 |
|
105 |
<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
|
|
|
156 |
<!-- This section describes the evaluation protocols and provides the results. -->
|
157 |
|
158 |
The model has been evaluated by 5-fold cross-validation on several German historical OCR ground truth datasets.
|
159 |
+
See [publication](https://konvens.org/proceedings/2019/papers/KONVENS2019_paper_4.pdf) for details.
|
160 |
|
161 |
## Testing Data, Factors & Metrics
|
162 |
|
|
|
165 |
<!-- This should link to a Data Card if possible. -->
|
166 |
|
167 |
Two different test sets contained in the CoNLL 2003 German Named Entity Recognition Ground Truth, i.e. TEST-A and TEST-B, have been used for testing (DE-CoNLL-TEST).
|
168 |
+
Additionally, historical OCR-based ground truth datasets have been used for testing - see [publication](https://konvens.org/proceedings/2019/papers/KONVENS2019_paper_4.pdf) for details and below.
|
169 |
|
170 |
|
171 |
### Factors
|
172 |
|
173 |
<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
|
174 |
|
175 |
+
The evaluation focuses on NER in historical German documents, see [publication](https://konvens.org/proceedings/2019/papers/KONVENS2019_paper_4.pdf) for details.
|
176 |
|
177 |
### Metrics
|
178 |
|
179 |
<!-- These are the evaluation metrics being used, ideally with a description of why. -->
|
180 |
|
181 |
Performance metrics used in evaluation is precision, recall and F1-score.
|
182 |
+
See [publication](https://konvens.org/proceedings/2019/papers/KONVENS2019_paper_4.pdf) for actual results in terms of these metrics.
|
183 |
|
184 |
## Results
|
185 |
|
186 |
+
See [publication](https://konvens.org/proceedings/2019/papers/KONVENS2019_paper_4.pdf).
|
187 |
|
188 |
# Model Examination
|
189 |
|
190 |
+
See [publication](https://konvens.org/proceedings/2019/papers/KONVENS2019_paper_4.pdf).
|
191 |
|
192 |
# Environmental Impact
|
193 |
|
|
|
253 |
In addition to what has been documented above, it should be noted that there are two NER Ground Truth datasets available:
|
254 |
|
255 |
1) [Data provided for the 2020 HIPE campaign on named entity processing](https://impresso.github.io/CLEF-HIPE-2020/)
|
256 |
+
2) [Data provided for the 2022 HIPE shared task on named entity processing](https://hipe-eval.github.io/HIPE-2022/)
|
257 |
|
258 |
+
Furthermore, two papers have been published on NER/EL, using BERT:
|
259 |
|
260 |
1) [Entity Linking in Multilingual Newspapers and Classical Commentaries with BERT](http://ceur-ws.org/Vol-3180/paper-85.pdf)
|
261 |
2) [Named Entity Disambiguation and Linking Historic Newspaper OCR with BERT](http://ceur-ws.org/Vol-2696/paper_163.pdf)
|