Commit
·
39e53a6
1
Parent(s):
f18c243
Update README.md
Browse files
README.md
CHANGED
@@ -1,4 +1,7 @@
|
|
1 |
---
|
|
|
|
|
|
|
2 |
library_name: span-marker
|
3 |
tags:
|
4 |
- span-marker
|
@@ -6,35 +9,68 @@ tags:
|
|
6 |
- ner
|
7 |
- named-entity-recognition
|
8 |
- generated_from_span_marker_trainer
|
|
|
|
|
9 |
metrics:
|
10 |
- precision
|
11 |
- recall
|
12 |
- f1
|
13 |
-
widget:
|
|
|
14 |
pipeline_tag: token-classification
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
15 |
---
|
16 |
|
17 |
-
# SpanMarker
|
18 |
|
19 |
-
This is a [SpanMarker](https://github.com/tomaarsen/SpanMarkerNER) model that can be used for Named Entity Recognition.
|
20 |
|
21 |
## Model Details
|
22 |
|
23 |
### Model Description
|
24 |
|
25 |
- **Model Type:** SpanMarker
|
26 |
-
|
27 |
- **Maximum Sequence Length:** 256 tokens
|
28 |
- **Maximum Entity Length:** 8 words
|
29 |
-
|
30 |
-
|
31 |
-
|
32 |
|
33 |
### Model Sources
|
34 |
|
35 |
- **Repository:** [SpanMarker on GitHub](https://github.com/tomaarsen/SpanMarkerNER)
|
36 |
- **Thesis:** [SpanMarker For Named Entity Recognition](https://raw.githubusercontent.com/tomaarsen/SpanMarkerNER/main/thesis.pdf)
|
37 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
38 |
## Uses
|
39 |
|
40 |
### Direct Use for Inference
|
@@ -43,35 +79,10 @@ This is a [SpanMarker](https://github.com/tomaarsen/SpanMarkerNER) model that ca
|
|
43 |
from span_marker import SpanMarkerModel
|
44 |
|
45 |
# Download from the 🤗 Hub
|
46 |
-
model = SpanMarkerModel.from_pretrained("
|
47 |
# Run inference
|
48 |
-
entities = model.predict("
|
49 |
-
```
|
50 |
-
|
51 |
-
### Downstream Use
|
52 |
-
You can finetune this model on your own dataset.
|
53 |
-
|
54 |
-
<details><summary>Click to expand</summary>
|
55 |
-
|
56 |
-
```python
|
57 |
-
from span_marker import SpanMarkerModel, Trainer
|
58 |
-
|
59 |
-
# Download from the 🤗 Hub
|
60 |
-
model = SpanMarkerModel.from_pretrained("span_marker_model_id")
|
61 |
-
|
62 |
-
# Specify a Dataset with "tokens" and "ner_tag" columns
|
63 |
-
dataset = load_dataset("conll2003") # For example CoNLL2003
|
64 |
-
|
65 |
-
# Initialize a Trainer using the pretrained model & dataset
|
66 |
-
trainer = Trainer(
|
67 |
-
model=model,
|
68 |
-
train_dataset=dataset["train"],
|
69 |
-
eval_dataset=dataset["validation"],
|
70 |
-
)
|
71 |
-
trainer.train()
|
72 |
-
trainer.save_model("span_marker_model_id-finetuned")
|
73 |
```
|
74 |
-
</details>
|
75 |
|
76 |
<!--
|
77 |
### Out-of-Scope Use
|
@@ -79,6 +90,12 @@ trainer.save_model("span_marker_model_id-finetuned")
|
|
79 |
*List how the model may foreseeably be misused and address what users ought not to do with the model.*
|
80 |
-->
|
81 |
|
|
|
|
|
|
|
|
|
|
|
|
|
82 |
<!--
|
83 |
## Bias, Risks and Limitations
|
84 |
|
@@ -93,6 +110,45 @@ trainer.save_model("span_marker_model_id-finetuned")
|
|
93 |
|
94 |
## Training Details
|
95 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
96 |
### Framework Versions
|
97 |
|
98 |
- Python: 3.10.12
|
@@ -130,4 +186,4 @@ trainer.save_model("span_marker_model_id-finetuned")
|
|
130 |
## Model Card Contact
|
131 |
|
132 |
*Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
|
133 |
-
-->
|
|
|
1 |
---
|
2 |
+
language:
|
3 |
+
- es
|
4 |
+
license: cc-by-4.0
|
5 |
library_name: span-marker
|
6 |
tags:
|
7 |
- span-marker
|
|
|
9 |
- ner
|
10 |
- named-entity-recognition
|
11 |
- generated_from_span_marker_trainer
|
12 |
+
datasets:
|
13 |
+
- conll2002
|
14 |
metrics:
|
15 |
- precision
|
16 |
- recall
|
17 |
- f1
|
18 |
+
widget:
|
19 |
+
- text: George Washington estuvo en Washington.
|
20 |
pipeline_tag: token-classification
|
21 |
+
base_model: PlanTL-GOB-ES/roberta-base-bne
|
22 |
+
model-index:
|
23 |
+
- name: SpanMarker with PlanTL-GOB-ES/roberta-base-bne on conll2002
|
24 |
+
results:
|
25 |
+
- task:
|
26 |
+
type: token-classification
|
27 |
+
name: Named Entity Recognition
|
28 |
+
dataset:
|
29 |
+
name: conll2002
|
30 |
+
type: conll2002
|
31 |
+
split: eval
|
32 |
+
metrics:
|
33 |
+
- type: f1
|
34 |
+
value: 0.871172868582195
|
35 |
+
name: F1
|
36 |
+
- type: precision
|
37 |
+
value: 0.888328530259366
|
38 |
+
name: Precision
|
39 |
+
- type: recall
|
40 |
+
value: 0.8546672828096118
|
41 |
+
name: Recall
|
42 |
---
|
43 |
|
44 |
+
# SpanMarker with PlanTL-GOB-ES/roberta-base-bne on conll2002
|
45 |
|
46 |
+
This is a [SpanMarker](https://github.com/tomaarsen/SpanMarkerNER) model trained on the [conll2002](https://huggingface.co/datasets/conll2002) dataset that can be used for Named Entity Recognition. This SpanMarker model uses [PlanTL-GOB-ES/roberta-base-bne](https://huggingface.co/models/PlanTL-GOB-ES/roberta-base-bne) as the underlying encoder.
|
47 |
|
48 |
## Model Details
|
49 |
|
50 |
### Model Description
|
51 |
|
52 |
- **Model Type:** SpanMarker
|
53 |
+
- **Encoder:** [PlanTL-GOB-ES/roberta-base-bne](https://huggingface.co/models/PlanTL-GOB-ES/roberta-base-bne)
|
54 |
- **Maximum Sequence Length:** 256 tokens
|
55 |
- **Maximum Entity Length:** 8 words
|
56 |
+
- **Training Dataset:** [conll2002](https://huggingface.co/datasets/conll2002)
|
57 |
+
- **Languages:** es
|
58 |
+
- **License:** cc-by-4.0
|
59 |
|
60 |
### Model Sources
|
61 |
|
62 |
- **Repository:** [SpanMarker on GitHub](https://github.com/tomaarsen/SpanMarkerNER)
|
63 |
- **Thesis:** [SpanMarker For Named Entity Recognition](https://raw.githubusercontent.com/tomaarsen/SpanMarkerNER/main/thesis.pdf)
|
64 |
|
65 |
+
### Model Labels
|
66 |
+
|
67 |
+
| Label | Examples |
|
68 |
+
|:------|:------------------------------------------------------------------|
|
69 |
+
| LOC | "Australia", "Victoria", "Melbourne" |
|
70 |
+
| MISC | "Ley", "Ciudad", "CrimeNet" |
|
71 |
+
| ORG | "Commonwealth", "EFE", "Tribunal Supremo" |
|
72 |
+
| PER | "Abogado General del Estado", "Daryl Williams", "Abogado General" |
|
73 |
+
|
74 |
## Uses
|
75 |
|
76 |
### Direct Use for Inference
|
|
|
79 |
from span_marker import SpanMarkerModel
|
80 |
|
81 |
# Download from the 🤗 Hub
|
82 |
+
model = SpanMarkerModel.from_pretrained("alvarobartt/span-marker-roberta-base-bne-conll-2002-es")
|
83 |
# Run inference
|
84 |
+
entities = model.predict("George Washington estuvo en Washington.")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
85 |
```
|
|
|
86 |
|
87 |
<!--
|
88 |
### Out-of-Scope Use
|
|
|
90 |
*List how the model may foreseeably be misused and address what users ought not to do with the model.*
|
91 |
-->
|
92 |
|
93 |
+
### ⚠️ Tokenizer Warning
|
94 |
+
|
95 |
+
The [PlanTL-GOB-ES/roberta-base-bne](https://huggingface.co/models/PlanTL-GOB-ES/roberta-base-bne) tokenizer distinguishes between punctuation directly attached to a word and punctuation separated from a word by a space. For example, `Paris.` and `Paris .` are tokenized into different tokens. During training, this model is only exposed to the latter style, i.e. all words are separated by a space. Consequently, the model may perform worse when the inference text is in the former style.
|
96 |
+
|
97 |
+
In short, it is recommended to preprocess your inference text such that all words and punctuation are separated by a space. One approach is to use the [spaCy integration](https://tomaarsen.github.io/SpanMarkerNER/notebooks/spacy_integration.html) which automatically separates all words and punctuation. Alternatively, some potential approaches to convert regular text into this format are NLTK [`word_tokenize`](https://www.nltk.org/api/nltk.tokenize.word_tokenize.html) or spaCy [`Doc`](https://spacy.io/api/doc#iter) and joining the resulting words with a space.
|
98 |
+
|
99 |
<!--
|
100 |
## Bias, Risks and Limitations
|
101 |
|
|
|
110 |
|
111 |
## Training Details
|
112 |
|
113 |
+
### Training Set Metrics
|
114 |
+
|
115 |
+
| Training set | Min | Median | Max |
|
116 |
+
|:----------------------|:----|:--------|:-----|
|
117 |
+
| Sentence length | 1 | 31.8052 | 1238 |
|
118 |
+
| Entities per sentence | 0 | 2.2586 | 160 |
|
119 |
+
|
120 |
+
### Training Hyperparameters
|
121 |
+
|
122 |
+
- learning_rate: 5e-05
|
123 |
+
- train_batch_size: 16
|
124 |
+
- eval_batch_size: 8
|
125 |
+
- seed: 42
|
126 |
+
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
|
127 |
+
- lr_scheduler_type: linear
|
128 |
+
- lr_scheduler_warmup_ratio: 0.1
|
129 |
+
- num_epochs: 2
|
130 |
+
|
131 |
+
### Training Results
|
132 |
+
|
133 |
+
| Epoch | Step | Validation Loss | Validation Precision | Validation Recall | Validation F1 | Validation Accuracy |
|
134 |
+
|:------:|:----:|:---------------:|:--------------------:|:-----------------:|:-------------:|:-------------------:|
|
135 |
+
| 0.1188 | 100 | 0.0704 | 0.0 | 0.0 | 0.0 | 0.8608 |
|
136 |
+
| 0.2375 | 200 | 0.0279 | 0.8765 | 0.4034 | 0.5525 | 0.9025 |
|
137 |
+
| 0.3563 | 300 | 0.0158 | 0.8381 | 0.7211 | 0.7752 | 0.9524 |
|
138 |
+
| 0.4751 | 400 | 0.0134 | 0.8525 | 0.7463 | 0.7959 | 0.9576 |
|
139 |
+
| 0.5938 | 500 | 0.0130 | 0.8844 | 0.7549 | 0.8145 | 0.9560 |
|
140 |
+
| 0.7126 | 600 | 0.0119 | 0.8480 | 0.8006 | 0.8236 | 0.9650 |
|
141 |
+
| 0.8314 | 700 | 0.0098 | 0.8794 | 0.8408 | 0.8597 | 0.9695 |
|
142 |
+
| 0.9501 | 800 | 0.0091 | 0.8842 | 0.8360 | 0.8594 | 0.9722 |
|
143 |
+
| 1.0689 | 900 | 0.0093 | 0.8976 | 0.8387 | 0.8672 | 0.9698 |
|
144 |
+
| 1.1876 | 1000 | 0.0094 | 0.8880 | 0.8517 | 0.8694 | 0.9739 |
|
145 |
+
| 1.3064 | 1100 | 0.0086 | 0.8920 | 0.8530 | 0.8721 | 0.9737 |
|
146 |
+
| 1.4252 | 1200 | 0.0092 | 0.8896 | 0.8452 | 0.8668 | 0.9728 |
|
147 |
+
| 1.5439 | 1300 | 0.0094 | 0.8765 | 0.8313 | 0.8533 | 0.9720 |
|
148 |
+
| 1.6627 | 1400 | 0.0089 | 0.8805 | 0.8445 | 0.8621 | 0.9720 |
|
149 |
+
| 1.7815 | 1500 | 0.0088 | 0.8834 | 0.8581 | 0.8706 | 0.9747 |
|
150 |
+
| 1.9002 | 1600 | 0.0088 | 0.8883 | 0.8547 | 0.8712 | 0.9747 |
|
151 |
+
|
152 |
### Framework Versions
|
153 |
|
154 |
- Python: 3.10.12
|
|
|
186 |
## Model Card Contact
|
187 |
|
188 |
*Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
|
189 |
+
-->
|