alvarobartt HF staff commited on
Commit
39e53a6
·
1 Parent(s): f18c243

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +91 -35
README.md CHANGED
@@ -1,4 +1,7 @@
1
  ---
 
 
 
2
  library_name: span-marker
3
  tags:
4
  - span-marker
@@ -6,35 +9,68 @@ tags:
6
  - ner
7
  - named-entity-recognition
8
  - generated_from_span_marker_trainer
 
 
9
  metrics:
10
  - precision
11
  - recall
12
  - f1
13
- widget: []
 
14
  pipeline_tag: token-classification
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
15
  ---
16
 
17
- # SpanMarker
18
 
19
- This is a [SpanMarker](https://github.com/tomaarsen/SpanMarkerNER) model that can be used for Named Entity Recognition.
20
 
21
  ## Model Details
22
 
23
  ### Model Description
24
 
25
  - **Model Type:** SpanMarker
26
- <!-- - **Encoder:** [Unknown](https://huggingface.co/models/unknown) -->
27
  - **Maximum Sequence Length:** 256 tokens
28
  - **Maximum Entity Length:** 8 words
29
- <!-- - **Training Dataset:** [Unknown](https://huggingface.co/datasets/unknown) -->
30
- <!-- - **Language:** Unknown -->
31
- <!-- - **License:** Unknown -->
32
 
33
  ### Model Sources
34
 
35
  - **Repository:** [SpanMarker on GitHub](https://github.com/tomaarsen/SpanMarkerNER)
36
  - **Thesis:** [SpanMarker For Named Entity Recognition](https://raw.githubusercontent.com/tomaarsen/SpanMarkerNER/main/thesis.pdf)
37
 
 
 
 
 
 
 
 
 
 
38
  ## Uses
39
 
40
  ### Direct Use for Inference
@@ -43,35 +79,10 @@ This is a [SpanMarker](https://github.com/tomaarsen/SpanMarkerNER) model that ca
43
  from span_marker import SpanMarkerModel
44
 
45
  # Download from the 🤗 Hub
46
- model = SpanMarkerModel.from_pretrained("span_marker_model_id")
47
  # Run inference
48
- entities = model.predict("Amelia Earhart flew her single engine Lockheed Vega 5B across the Atlantic to Paris.")
49
- ```
50
-
51
- ### Downstream Use
52
- You can finetune this model on your own dataset.
53
-
54
- <details><summary>Click to expand</summary>
55
-
56
- ```python
57
- from span_marker import SpanMarkerModel, Trainer
58
-
59
- # Download from the 🤗 Hub
60
- model = SpanMarkerModel.from_pretrained("span_marker_model_id")
61
-
62
- # Specify a Dataset with "tokens" and "ner_tag" columns
63
- dataset = load_dataset("conll2003") # For example CoNLL2003
64
-
65
- # Initialize a Trainer using the pretrained model & dataset
66
- trainer = Trainer(
67
- model=model,
68
- train_dataset=dataset["train"],
69
- eval_dataset=dataset["validation"],
70
- )
71
- trainer.train()
72
- trainer.save_model("span_marker_model_id-finetuned")
73
  ```
74
- </details>
75
 
76
  <!--
77
  ### Out-of-Scope Use
@@ -79,6 +90,12 @@ trainer.save_model("span_marker_model_id-finetuned")
79
  *List how the model may foreseeably be misused and address what users ought not to do with the model.*
80
  -->
81
 
 
 
 
 
 
 
82
  <!--
83
  ## Bias, Risks and Limitations
84
 
@@ -93,6 +110,45 @@ trainer.save_model("span_marker_model_id-finetuned")
93
 
94
  ## Training Details
95
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
96
  ### Framework Versions
97
 
98
  - Python: 3.10.12
@@ -130,4 +186,4 @@ trainer.save_model("span_marker_model_id-finetuned")
130
  ## Model Card Contact
131
 
132
  *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
133
- -->
 
1
  ---
2
+ language:
3
+ - es
4
+ license: cc-by-4.0
5
  library_name: span-marker
6
  tags:
7
  - span-marker
 
9
  - ner
10
  - named-entity-recognition
11
  - generated_from_span_marker_trainer
12
+ datasets:
13
+ - conll2002
14
  metrics:
15
  - precision
16
  - recall
17
  - f1
18
+ widget:
19
+ - text: George Washington estuvo en Washington.
20
  pipeline_tag: token-classification
21
+ base_model: PlanTL-GOB-ES/roberta-base-bne
22
+ model-index:
23
+ - name: SpanMarker with PlanTL-GOB-ES/roberta-base-bne on conll2002
24
+ results:
25
+ - task:
26
+ type: token-classification
27
+ name: Named Entity Recognition
28
+ dataset:
29
+ name: conll2002
30
+ type: conll2002
31
+ split: eval
32
+ metrics:
33
+ - type: f1
34
+ value: 0.871172868582195
35
+ name: F1
36
+ - type: precision
37
+ value: 0.888328530259366
38
+ name: Precision
39
+ - type: recall
40
+ value: 0.8546672828096118
41
+ name: Recall
42
  ---
43
 
44
+ # SpanMarker with PlanTL-GOB-ES/roberta-base-bne on conll2002
45
 
46
+ This is a [SpanMarker](https://github.com/tomaarsen/SpanMarkerNER) model trained on the [conll2002](https://huggingface.co/datasets/conll2002) dataset that can be used for Named Entity Recognition. This SpanMarker model uses [PlanTL-GOB-ES/roberta-base-bne](https://huggingface.co/models/PlanTL-GOB-ES/roberta-base-bne) as the underlying encoder.
47
 
48
  ## Model Details
49
 
50
  ### Model Description
51
 
52
  - **Model Type:** SpanMarker
53
+ - **Encoder:** [PlanTL-GOB-ES/roberta-base-bne](https://huggingface.co/models/PlanTL-GOB-ES/roberta-base-bne)
54
  - **Maximum Sequence Length:** 256 tokens
55
  - **Maximum Entity Length:** 8 words
56
+ - **Training Dataset:** [conll2002](https://huggingface.co/datasets/conll2002)
57
+ - **Languages:** es
58
+ - **License:** cc-by-4.0
59
 
60
  ### Model Sources
61
 
62
  - **Repository:** [SpanMarker on GitHub](https://github.com/tomaarsen/SpanMarkerNER)
63
  - **Thesis:** [SpanMarker For Named Entity Recognition](https://raw.githubusercontent.com/tomaarsen/SpanMarkerNER/main/thesis.pdf)
64
 
65
+ ### Model Labels
66
+
67
+ | Label | Examples |
68
+ |:------|:------------------------------------------------------------------|
69
+ | LOC | "Australia", "Victoria", "Melbourne" |
70
+ | MISC | "Ley", "Ciudad", "CrimeNet" |
71
+ | ORG | "Commonwealth", "EFE", "Tribunal Supremo" |
72
+ | PER | "Abogado General del Estado", "Daryl Williams", "Abogado General" |
73
+
74
  ## Uses
75
 
76
  ### Direct Use for Inference
 
79
  from span_marker import SpanMarkerModel
80
 
81
  # Download from the 🤗 Hub
82
+ model = SpanMarkerModel.from_pretrained("alvarobartt/span-marker-roberta-base-bne-conll-2002-es")
83
  # Run inference
84
+ entities = model.predict("George Washington estuvo en Washington.")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
85
  ```
 
86
 
87
  <!--
88
  ### Out-of-Scope Use
 
90
  *List how the model may foreseeably be misused and address what users ought not to do with the model.*
91
  -->
92
 
93
+ ### ⚠️ Tokenizer Warning
94
+
95
+ The [PlanTL-GOB-ES/roberta-base-bne](https://huggingface.co/models/PlanTL-GOB-ES/roberta-base-bne) tokenizer distinguishes between punctuation directly attached to a word and punctuation separated from a word by a space. For example, `Paris.` and `Paris .` are tokenized into different tokens. During training, this model is only exposed to the latter style, i.e. all words are separated by a space. Consequently, the model may perform worse when the inference text is in the former style.
96
+
97
+ In short, it is recommended to preprocess your inference text such that all words and punctuation are separated by a space. One approach is to use the [spaCy integration](https://tomaarsen.github.io/SpanMarkerNER/notebooks/spacy_integration.html) which automatically separates all words and punctuation. Alternatively, some potential approaches to convert regular text into this format are NLTK [`word_tokenize`](https://www.nltk.org/api/nltk.tokenize.word_tokenize.html) or spaCy [`Doc`](https://spacy.io/api/doc#iter) and joining the resulting words with a space.
98
+
99
  <!--
100
  ## Bias, Risks and Limitations
101
 
 
110
 
111
  ## Training Details
112
 
113
+ ### Training Set Metrics
114
+
115
+ | Training set | Min | Median | Max |
116
+ |:----------------------|:----|:--------|:-----|
117
+ | Sentence length | 1 | 31.8052 | 1238 |
118
+ | Entities per sentence | 0 | 2.2586 | 160 |
119
+
120
+ ### Training Hyperparameters
121
+
122
+ - learning_rate: 5e-05
123
+ - train_batch_size: 16
124
+ - eval_batch_size: 8
125
+ - seed: 42
126
+ - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
127
+ - lr_scheduler_type: linear
128
+ - lr_scheduler_warmup_ratio: 0.1
129
+ - num_epochs: 2
130
+
131
+ ### Training Results
132
+
133
+ | Epoch | Step | Validation Loss | Validation Precision | Validation Recall | Validation F1 | Validation Accuracy |
134
+ |:------:|:----:|:---------------:|:--------------------:|:-----------------:|:-------------:|:-------------------:|
135
+ | 0.1188 | 100 | 0.0704 | 0.0 | 0.0 | 0.0 | 0.8608 |
136
+ | 0.2375 | 200 | 0.0279 | 0.8765 | 0.4034 | 0.5525 | 0.9025 |
137
+ | 0.3563 | 300 | 0.0158 | 0.8381 | 0.7211 | 0.7752 | 0.9524 |
138
+ | 0.4751 | 400 | 0.0134 | 0.8525 | 0.7463 | 0.7959 | 0.9576 |
139
+ | 0.5938 | 500 | 0.0130 | 0.8844 | 0.7549 | 0.8145 | 0.9560 |
140
+ | 0.7126 | 600 | 0.0119 | 0.8480 | 0.8006 | 0.8236 | 0.9650 |
141
+ | 0.8314 | 700 | 0.0098 | 0.8794 | 0.8408 | 0.8597 | 0.9695 |
142
+ | 0.9501 | 800 | 0.0091 | 0.8842 | 0.8360 | 0.8594 | 0.9722 |
143
+ | 1.0689 | 900 | 0.0093 | 0.8976 | 0.8387 | 0.8672 | 0.9698 |
144
+ | 1.1876 | 1000 | 0.0094 | 0.8880 | 0.8517 | 0.8694 | 0.9739 |
145
+ | 1.3064 | 1100 | 0.0086 | 0.8920 | 0.8530 | 0.8721 | 0.9737 |
146
+ | 1.4252 | 1200 | 0.0092 | 0.8896 | 0.8452 | 0.8668 | 0.9728 |
147
+ | 1.5439 | 1300 | 0.0094 | 0.8765 | 0.8313 | 0.8533 | 0.9720 |
148
+ | 1.6627 | 1400 | 0.0089 | 0.8805 | 0.8445 | 0.8621 | 0.9720 |
149
+ | 1.7815 | 1500 | 0.0088 | 0.8834 | 0.8581 | 0.8706 | 0.9747 |
150
+ | 1.9002 | 1600 | 0.0088 | 0.8883 | 0.8547 | 0.8712 | 0.9747 |
151
+
152
  ### Framework Versions
153
 
154
  - Python: 3.10.12
 
186
  ## Model Card Contact
187
 
188
  *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
189
+ -->