SBB
/

Token Classification
Transformers
PyTorch
German
bert
sequence-tagger-model
Inference Endpoints
cneud commited on
Commit
9663d01
·
1 Parent(s): a75dfa5

Update README.md

Browse files

Corrected some typos

Files changed (1) hide show
  1. README.md +9 -12
README.md CHANGED
@@ -66,7 +66,7 @@ The model was developed by the Berlin State Library (SBB) in the [QURATOR](https
66
  ## Model Description
67
 
68
  <!-- Provide a longer summary of what this model is/does. -->
69
- A BERT model trained on three German corpora containing contemporary and historical texts for named entity recognition tasks.
70
  It predicts the classes `PER`, `LOC` and `ORG`.
71
 
72
  - **Developed by:** [Kai Labusch](https://huggingface.co/labusch), [Clemens Neudecker](https://huggingface.co/cneud), David Zellhöfer
@@ -100,9 +100,6 @@ Supported entity types are `PER`, `LOC` and `ORG`.
100
  The model has been pre-trained on 2,333,647 pages of OCR-text of the digitized collections of Berlin State Library.
101
  Therefore it is adapted to OCR-error prone historical German texts and might be used for particular applications that involve such text material.
102
 
103
-
104
-
105
-
106
  ## Out-of-Scope Use
107
 
108
  <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
@@ -159,7 +156,7 @@ Since it is an incarnation of the original BERT-model published by Google, all t
159
  <!-- This section describes the evaluation protocols and provides the results. -->
160
 
161
  The model has been evaluated by 5-fold cross-validation on several German historical OCR ground truth datasets.
162
- See publication for detail.
163
 
164
  ## Testing Data, Factors & Metrics
165
 
@@ -168,29 +165,29 @@ See publication for detail.
168
  <!-- This should link to a Data Card if possible. -->
169
 
170
  Two different test sets contained in the CoNLL 2003 German Named Entity Recognition Ground Truth, i.e. TEST-A and TEST-B, have been used for testing (DE-CoNLL-TEST).
171
- Additionally, historical OCR-based ground truth datasets have been used for testing - see publication for details and below.
172
 
173
 
174
  ### Factors
175
 
176
  <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
177
 
178
- The evaluation focuses on NER in historical German documents, see publication for details.
179
 
180
  ### Metrics
181
 
182
  <!-- These are the evaluation metrics being used, ideally with a description of why. -->
183
 
184
  Performance metrics used in evaluation is precision, recall and F1-score.
185
- See paper for actual results in terms of these metrics.
186
 
187
  ## Results
188
 
189
- See publication.
190
 
191
  # Model Examination
192
 
193
- See publication.
194
 
195
  # Environmental Impact
196
 
@@ -256,9 +253,9 @@ More information needed.
256
  In addition to what has been documented above, it should be noted that there are two NER Ground Truth datasets available:
257
 
258
  1) [Data provided for the 2020 HIPE campaign on named entity processing](https://impresso.github.io/CLEF-HIPE-2020/)
259
- 2) [Data providided for the 2022 HIPE shared task on named entity processing](https://hipe-eval.github.io/HIPE-2022/)
260
 
261
- Furthermore, two papers have been published on NER/NED, using BERT:
262
 
263
  1) [Entity Linking in Multilingual Newspapers and Classical Commentaries with BERT](http://ceur-ws.org/Vol-3180/paper-85.pdf)
264
  2) [Named Entity Disambiguation and Linking Historic Newspaper OCR with BERT](http://ceur-ws.org/Vol-2696/paper_163.pdf)
 
66
  ## Model Description
67
 
68
  <!-- Provide a longer summary of what this model is/does. -->
69
+ A BERT model trained on three German corpora containing contemporary and historical texts for Named Entity Recognition (NER) tasks.
70
  It predicts the classes `PER`, `LOC` and `ORG`.
71
 
72
  - **Developed by:** [Kai Labusch](https://huggingface.co/labusch), [Clemens Neudecker](https://huggingface.co/cneud), David Zellhöfer
 
100
  The model has been pre-trained on 2,333,647 pages of OCR-text of the digitized collections of Berlin State Library.
101
  Therefore it is adapted to OCR-error prone historical German texts and might be used for particular applications that involve such text material.
102
 
 
 
 
103
  ## Out-of-Scope Use
104
 
105
  <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
 
156
  <!-- This section describes the evaluation protocols and provides the results. -->
157
 
158
  The model has been evaluated by 5-fold cross-validation on several German historical OCR ground truth datasets.
159
+ See [publication](https://konvens.org/proceedings/2019/papers/KONVENS2019_paper_4.pdf) for details.
160
 
161
  ## Testing Data, Factors & Metrics
162
 
 
165
  <!-- This should link to a Data Card if possible. -->
166
 
167
  Two different test sets contained in the CoNLL 2003 German Named Entity Recognition Ground Truth, i.e. TEST-A and TEST-B, have been used for testing (DE-CoNLL-TEST).
168
+ Additionally, historical OCR-based ground truth datasets have been used for testing - see [publication](https://konvens.org/proceedings/2019/papers/KONVENS2019_paper_4.pdf) for details and below.
169
 
170
 
171
  ### Factors
172
 
173
  <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
174
 
175
+ The evaluation focuses on NER in historical German documents, see [publication](https://konvens.org/proceedings/2019/papers/KONVENS2019_paper_4.pdf) for details.
176
 
177
  ### Metrics
178
 
179
  <!-- These are the evaluation metrics being used, ideally with a description of why. -->
180
 
181
  Performance metrics used in evaluation is precision, recall and F1-score.
182
+ See [publication](https://konvens.org/proceedings/2019/papers/KONVENS2019_paper_4.pdf) for actual results in terms of these metrics.
183
 
184
  ## Results
185
 
186
+ See [publication](https://konvens.org/proceedings/2019/papers/KONVENS2019_paper_4.pdf).
187
 
188
  # Model Examination
189
 
190
+ See [publication](https://konvens.org/proceedings/2019/papers/KONVENS2019_paper_4.pdf).
191
 
192
  # Environmental Impact
193
 
 
253
  In addition to what has been documented above, it should be noted that there are two NER Ground Truth datasets available:
254
 
255
  1) [Data provided for the 2020 HIPE campaign on named entity processing](https://impresso.github.io/CLEF-HIPE-2020/)
256
+ 2) [Data provided for the 2022 HIPE shared task on named entity processing](https://hipe-eval.github.io/HIPE-2022/)
257
 
258
+ Furthermore, two papers have been published on NER/EL, using BERT:
259
 
260
  1) [Entity Linking in Multilingual Newspapers and Classical Commentaries with BERT](http://ceur-ws.org/Vol-3180/paper-85.pdf)
261
  2) [Named Entity Disambiguation and Linking Historic Newspaper OCR with BERT](http://ceur-ws.org/Vol-2696/paper_163.pdf)