SBB
/

PyTorch
Files changed (1) hide show
  1. README.md +245 -0
README.md CHANGED
@@ -1,3 +1,248 @@
1
  ---
2
  license: apache-2.0
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
  ---
4
+
5
+ # Model Card for sbb_ned-en
6
+
7
+ <!-- Provide a quick summary of what the model is/does. -->
8
+
9
+ This model is part of a named entity disambiguation and linking system (NED, NEL).
10
+ The system was developed by Berlin State Library (SBB) in the [QURATOR](https://staatsbibliothek-berlin.de/die-staatsbibliothek/projekte/project-id-1060-2018) project.
11
+ Questions and comments about the model can be directed to Kai Labusch at [email protected] or Clemens Neudecker at [email protected].
12
+
13
+
14
+ # Table of Contents
15
+
16
+ - [Model Card for sbb_ned-en](#model-card-for-sbb_ned-en)
17
+ - [Table of Contents](#table-of-contents)
18
+ - [Model Details](#model-details)
19
+ - [Model Description](#model-description)
20
+ - [Uses](#uses)
21
+ - [Direct Use](#direct-use)
22
+ - [Downstream Use](#downstream-use)
23
+ - [Out-of-Scope Use](#out-of-scope-use)
24
+ - [Bias, Risks, and Limitations](#bias-risks-and-limitations)
25
+ - [Recommendations](#recommendations)
26
+ - [Training Details](#training-details)
27
+ - [Training Data](#training-data)
28
+ - [Training Procedure](#training-procedure)
29
+ - [Preprocessing](#preprocessing)
30
+ - [Speeds, Sizes, Times](#speeds-sizes-times)
31
+ - [Training Hyperparameters](#training-hyperparameters)
32
+ - [Training Results](#training-results)
33
+ - [Evaluation](#evaluation)
34
+ - [Testing Data, Factors and Metrics](#testing-data-factors-and-metrics)
35
+ - [Environmental Impact](#environmental-impact)
36
+ - [Technical Specifications](#technical-specifications)
37
+ - [Software](#software)
38
+ - [Citation](#citation)
39
+ - [More Information](#more-information)
40
+ - [Model Card Authors](#model-card-authors)
41
+ - [Model Card Contact](#model-card-contact)
42
+ - [How to Get Started with the Model](#how-to-get-started-with-the-model)
43
+
44
+
45
+ # Model Details
46
+
47
+ ## Model Description
48
+
49
+ <!-- Provide a longer summary of what this model is/does. -->
50
+ This model forms the core of a named entity disambiguation and linking system (NED, NEL) that consists of three components:
51
+ (i) Lookup of possible candidates in an approximative nearest neighbour (ANN) index that stores BERT embeddings.
52
+ (ii) Evaluation of each candidate by comparison of text passages of Wikipedia performed by a purpose-trained BERT model.
53
+ (iii) Final ranking of candidates on the basis of information gathered from previous steps.
54
+
55
+ This model is used in order to generate the BERT embeddings in step (i) and to perform the comparison of the text passages in step (ii).
56
+
57
+
58
+ - **Developed by:** [Kai Labusch](https://huggingface.co/labusch)
59
+ - **Shared by:** [Staatsbibliothek zu Berlin / Berlin State Library](https://huggingface.co/SBB)
60
+ - **Model type:** Language models
61
+ - **Language(s) (NLP):** en
62
+ - **License:** apache-2.0
63
+ - **Parent Model:** The BERT base multilingual cased model as provided by [Google](https://huggingface.co/bert-base-multilingual-cased)
64
+ - **Resources for more information:**
65
+ - [GitHub Repo](https://github.com/qurator-spk/sbb_ned/tree/6a2a48a9054b3a187b117e490513de5c41638844)
66
+ - Associated Paper 1 [CLEF 2020 HIPE paper](http://ceur-ws.org/Vol-2696/paper_163.pdf)
67
+ - Associated Paper 2 [CLEF 2022 HIPE paper](http://ceur-ws.org/Vol-3180/paper-85.pdf)
68
+
69
+ # Uses
70
+
71
+ <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
72
+
73
+ Disciplines such as the *digital humanities* create use cases for text and data mining or the semantic enrichment of full-texts with named entity recognition and linking, e.g., for the re-construction of historical social networks. NED/NEL opens up new posibilities for improved access to text, knowledge creation and clustering of texts. Linking against Wikidata-IDs makes it possible to join the linked texts with the world knowledge provided by Wikidata by means of arbitrary SPARQL queries.
74
+
75
+
76
+ ## Direct Use
77
+
78
+ <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
79
+ <!-- If the user enters content, print that. If not, but they enter a task in the list, use that. If neither, say "more info needed." -->
80
+
81
+ The NED/NEL system was developed on the basis of the [digitised collections of the Staatsbibliothek zu Berlin -- Berlin State Library](https://digital.staatsbibliothek-berlin.de/). The emphasis of this system is therefore on recognition and disambiguation of entities in historical texts.
82
+
83
+ ## Downstream Use
84
+
85
+ <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
86
+ <!-- If the user enters content, print that. If not, but they enter a task in the list, use that. If neither, say "more info needed." -->
87
+
88
+ Due to the historical nature of the documents being digitised in libraries, standard methods and procedures from the NLP domain typically require additional adaptation in order to successfully deal with the historical spelling variation and the remaining noise resulting from OCR errors. For use on other textual material, e.g. with an emphasis on entities comprised in other Wikipedias than the German, English and French ones, significant adaptations have to be performed. In such a case, the methodology used to develop the process as described in the related papers can serve as a showcase.
89
+
90
+ ## Out-of-Scope Use
91
+
92
+ <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
93
+ <!-- If the user enters content, print that. If not, but they enter a task in the list, use that. If neither, say "more info needed." -->
94
+
95
+ Though technically possible, named entity disambiguation and linking does not necessarily work well on contemporary data. This is because the disambiguation process relies on a subset of entities available on wikidata. In other words: In order to be reliably identified, those persons, places, or organizations have to be present in the extracted Wikidata.
96
+
97
+ # Bias, Risks, and Limitations
98
+
99
+ <!-- This section is meant to convey both technical and sociotechnical limitations. -->
100
+
101
+ The identification and disambiguation of named entities in historical and contemporary texts is a task contributing to knowledge creation aiming at enhancing scientific research and better discoverability of information in digitised historical texts. The aim of the development of these models was to improve this knowledge creation process, an endeavour that was not undertaken for profit. The results of the applied models are freely accessible for the users of the digitised collections of the Berlin State Library. Against this backdrop, ethical challenges cannot be identified; rather, improved access and semantic enrichment of the derived full-texts with NER and NEL serves every human being with access to the digital collections of the Berlin State Library. As a limitation, it has to be noted that in historical texts the vast majority of identified and disambiguated persons are white, heterosexual and male, whereas other groups (e.g., those defeated in a war, colonial subjects, or else) are often not mentioned in such texts or are not addressed as identifiable entities with full names.
102
+
103
+ The knowledge base has been directly derived from Wikidata and Wikipedia in a two-step process. In the first step, relevant entities have been selected by use of appropriate SPARQL queries on the basis of Wikidata. In the second step, for all selected entities relevant text comparison material has been extracted from Wikipedia.
104
+
105
+ ## Recommendations
106
+
107
+ <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
108
+
109
+ Disambiguation of named entities proves to be challenging beyond the task of automatically identifying entities. The existence of broad variations in the spelling of person and place names because of non-normalized orthography and linguistic change as well as changes in the naming of places according to the context adds to this challenge. Historical texts, especially newspapers, contain narrative descriptions and visual representations of minorities and disadvantaged groups without naming them; de-anonymizing such persons and groups is a research task in itself which has only been started to be tackled in the 2020's. The biggest potential for improvement of the NER / NEL / NED system is to be expected with improved OCR performance and NEL recall performance.
110
+
111
+ # Training Details
112
+
113
+ ## Training Data
114
+
115
+ <!-- This should link to a Data Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
116
+
117
+ Training data have been made available on Zenodo in the form of a sqlite databases for English text snippets. A data card for this data set is available on Zenodo. The English database is available at [10.5281/zenodo.7773987](https://doi.org/10.5281/zenodo.7773987).
118
+
119
+ ## Training Procedure
120
+
121
+ <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
122
+
123
+ Before entity disambiguation starts, the input text is run through a named entity recognition (NER) system that tags all person (PER), location (LOC) and organization (ORG) entities, [see the related NER model on Hugging Face](https://huggingface.co/models?other=doi:10.57967/hf/0403). A BERT based NER system that has been developed previously at SBB has been used and described in [this paper](https://corpora.linguistik.uni-erlangen.de/data/konvens/proceedings/papers/KONVENS2019_paper_4.pdf).
124
+
125
+ The entity linking and disambiguation works by comparison of continuous text snippets where the entities in question are mentioned. A purpose-trained BERT model (the evaluation model) performs that text comparison task. Therefore, a knowledge base that contains structured information like Wikidata is not sufficient. Rather, additional continuous text is needed where the entities that are part of the knowledge base are discussed, mentioned and referenced. Hence, the knowledge base is derived in such a way that each entity in it has a corresponding Wikipedia page, since the Wikipedia articles contain continuous texts that have been annotated by human authors with references that can serve as ground truth.
126
+
127
+ ### Preprocessing
128
+
129
+ See section above.
130
+
131
+ ### Speeds, Sizes, Times
132
+
133
+ <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
134
+ Since the NED models are purpose-trained BERT derivatives, all the speed and performance properties of standard BERT models apply.
135
+
136
+ The models were trained on a two-class classification task. Given a pair of sentences, the models decide if the two sentences reference to the same entity or not.
137
+
138
+ The construction of the training samples is implemented in the [data processor](https://github.com/qurator-spk/sbb_ned/blob/6a2a48a9054b3a187b117e490513de5c41638844/qurator/sbb_ned/ground_truth/data_processor.py) that can be found in the GitHub repo.
139
+
140
+ ### Training Hyperparameters
141
+
142
+ The training can be performed by the [ned-bert](https://github.com/qurator-spk/sbb_ned/blob/6a2a48a9054b3a187b117e490513de5c41638844/qurator/sbb_ned/models/bert.py) command line tool. After installation of the sbb_ned package, type "ned-bert --help" in order to get more information about its functionality.
143
+
144
+ The training hyperparamaters used can be found in the [Makefile](https://github.com/qurator-spk/sbb_ned/blob/6a2a48a9054b3a187b117e490513de5c41638844/Makefile). Here, the **de-ned-train-2**, **en-ned-train-1**, and **fr-ned-train-0** targets have been used in order to train the published models.
145
+
146
+ ### Training Results
147
+
148
+ During training, the [data processor](https://github.com/qurator-spk/sbb_ned/blob/6a2a48a9054b3a187b117e490513de5c41638844/qurator/sbb_ned/ground_truth/data_processor.py) that feeds the training process continuously generates new sentence pairs without repetition over the entire training period. The models have been trained for roughly two weeks on a V100 GPU. During the entire training period the cross entropy training loss was evaluted and continued to decrease.
149
+
150
+ # Evaluation
151
+
152
+ <!-- This section describes the evaluation protocols and provides the results, or cites relevant papers. -->
153
+ A first version of the system was evaluated at [CLEF 2020 HIPE](http://ceur-ws.org/Vol-2696/paper_163.pdf). Several lessons learned from that first evaluation were applied to the system and a second evaluation was performed at [CLEF 2022 HIPE](http://ceur-ws.org/Vol-3180/paper-85.pdf). The models published here are the ones that have been evaluated in the CLEF 2022 HIPE competition.
154
+
155
+ ## Testing Data, Factors and Metrics
156
+
157
+ Please consider the papers mentioned above. For a more complete overview about the used evaluation methodology read the [CLEF HIPE 2020 Overview Paper](https://ceur-ws.org/Vol-2696/paper_255.pdf) and the [CLEF HIPE 2022 Overview Paper](https://ceur-ws.org/Vol-3180/paper-83.pdf).
158
+
159
+ # Environmental Impact
160
+
161
+ <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
162
+
163
+ Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
164
+
165
+ - **Hardware Type:** V100.
166
+ - **Hours used:** Roughly 1-2 week(s).
167
+ - **Cloud Provider:** No cloud.
168
+ - **Compute Region:** Germany.
169
+ - **Carbon Emitted:** More information needed.
170
+
171
+ # Technical Specifications
172
+
173
+ ### Software
174
+
175
+ See the information and source code published on [GitHub](https://github.com/qurator-spk/sbb_ned/tree/6a2a48a9054b3a187b117e490513de5c41638844).
176
+
177
+ # Citation
178
+
179
+ <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
180
+
181
+
182
+ **BibTeX:**
183
+
184
+ ```bibtex
185
+ @inproceedings{labusch_named_2020,
186
+ title = {Named {Entity} {Disambiguation} and {Linking} on {Historic} {Newspaper} {OCR} with {BERT}},
187
+ url = {https://ceur-ws.org/Vol-2696/paper_163.pdf},
188
+ abstract = {In this paper, we propose a named entity disambiguation and linking (NED, NEL) system that consists of three components: (i) Lookup of possible candidates in an approximative nearest neighbour (ANN) index that stores BERT-embeddings. (ii) Evaluation of each candidate by comparison of text passages of Wikipedia performed by a purpose-trained BERT model. (iii) Final ranking of candidates on the basis of information gathered from previous steps. We participated in the CLEF 2020 HIPE NERC-COARSE and NEL-LIT tasks for German, French, and English. The CLEF HIPE 2020 results show that our NEL approach is competitive in terms of precision but has low recall performance due to insufficient knowledge base coverage of the test data.},
189
+ language = {en},
190
+ booktitle = {{CLEF}},
191
+ author = {Labusch, Kai and Neudecker, Clemens},
192
+ year = {2020},
193
+ pages = {14},
194
+ }
195
+ ```
196
+
197
+ **APA:**
198
+
199
+ (Labusch et al., 2020)
200
+
201
+
202
+ **BibTex**
203
+
204
+ ```bibtex
205
+ @inproceedings{labusch_entity_2022,
206
+ title = {Entity {Linking} in {Multilingual} {Newspapers} and {Classical} {Commentaries} with {BERT}},
207
+ url = {http://ceur-ws.org/Vol-3180/paper-85.pdf},
208
+ abstract = {Building on our BERT-based entity recognition and three stage entity linking (EL) system [1] that we evaluated in the CLEF HIPE 2020 challenge [2], we focused in the CLEF HIPE 2022 challenge [3] on the entity linking part by participation in the EL-only tasks. We submitted results for the multilingual newspaper challenge (MNC), the multilingual classical commentary challenge (MCC), and the global adaptation challenge (GAC). This working note presents the most important modifications of the entity linking system in comparison to the HIPE 2020 approach and the additional results that have been obtained on the ajmc, hipe2020, newseye, topres19th, and sonar datasets for German, French, and English. The results show that our entity linking approach can be applied to a broad range of text categories and qualities without heavy adaptation and reveals qualitative differences of the impact of hyperparameters on our system that need further investigation.},
209
+ language = {en},
210
+ booktitle = {{CLEF}},
211
+ author = {Labusch, Kai and Neudecker, Clemens},
212
+ year = {2022},
213
+ pages = {11},
214
+ }
215
+ ```
216
+
217
+ **APA:**
218
+
219
+ (Labusch et al., 2022)
220
+
221
+ # More Information
222
+
223
+ A demo of the named entity recognition and disambiguation tool can be found [here](https://ravius.sbb.berlin/sbb-tools/index.html?ppn=766355942&model_id=precomputed&el_model_id=precomputed&task=ner). Please note that the ppn (Pica Production Number) found in the link can be replaced by the ppn of any other work in the [digitised collections of the Staatsbibliothek zu Berlin / Berlin State Library](https://digital.staatsbibliothek-berlin.de/), provided that there is a fulltext of this work available.
224
+
225
+ **MD5 hash of the English pytorch_model.bin:**
226
+
227
+ 5e919636a824161a7f0a0a830c21577f
228
+
229
+ **SHA256 hash of the English pytorch_model.bin:**
230
+
231
+ 7e2f70f3520d18b06108200b5a6e268717c129e713ed741fece3e27df14ccaf4
232
+
233
+ # Model Card Authors
234
+
235
+ <!-- This section provides another layer of transparency and accountability. Whose views is this model card representing? How many voices were included in its construction? Etc. -->
236
+
237
+ [Kai Labusch]([email protected]) and [Jörg Lehmann]([email protected])
238
+
239
+ # Model Card Contact
240
+
241
+ Questions and comments about the model can be directed to Kai Labusch at [email protected], questions and comments about the model card can be directed to Jörg Lehmann at [email protected]
242
+
243
+ # How to Get Started with the Model
244
+
245
+ How to get started with this model is explained in the ReadMe file of the GitHub repository [over here](https://github.com/qurator-spk/sbb_ned/tree/6a2a48a9054b3a187b117e490513de5c41638844#readme).
246
+
247
+ Model Card as of September 12th, 2023
248
+