alvarobartt
/

span-marker-roberta-base-bne-conll-2002-es

@@ -1,4 +1,7 @@
 ---
 library_name: span-marker
 tags:
 - span-marker
@@ -6,35 +9,68 @@ tags:
 - ner
 - named-entity-recognition
 - generated_from_span_marker_trainer
 metrics:
 - precision
 - recall
 - f1
-widget: []
 pipeline_tag: token-classification
 ---
-# SpanMarker
-This is a [SpanMarker](https://github.com/tomaarsen/SpanMarkerNER) model that can be used for Named Entity Recognition.
 ## Model Details
 ### Model Description
 - **Model Type:** SpanMarker
-<!-- - **Encoder:** [Unknown](https://huggingface.co/models/unknown) -->
 - **Maximum Sequence Length:** 256 tokens
 - **Maximum Entity Length:** 8 words
-<!-- - **Training Dataset:** [Unknown](https://huggingface.co/datasets/unknown) -->
-<!-- - **Language:** Unknown -->
-<!-- - **License:** Unknown -->
 ### Model Sources
 - **Repository:** [SpanMarker on GitHub](https://github.com/tomaarsen/SpanMarkerNER)
 - **Thesis:** [SpanMarker For Named Entity Recognition](https://raw.githubusercontent.com/tomaarsen/SpanMarkerNER/main/thesis.pdf)
 ## Uses
 ### Direct Use for Inference
@@ -43,35 +79,10 @@ This is a [SpanMarker](https://github.com/tomaarsen/SpanMarkerNER) model that ca
 from span_marker import SpanMarkerModel
 # Download from the 🤗 Hub
-model = SpanMarkerModel.from_pretrained("span_marker_model_id")
 # Run inference
-entities = model.predict("Amelia Earhart flew her single engine Lockheed Vega 5B across the Atlantic to Paris.")
-```
-### Downstream Use
-You can finetune this model on your own dataset.
-<details><summary>Click to expand</summary>
-```python
-from span_marker import SpanMarkerModel, Trainer
-# Download from the 🤗 Hub
-model = SpanMarkerModel.from_pretrained("span_marker_model_id")
-# Specify a Dataset with "tokens" and "ner_tag" columns
-dataset = load_dataset("conll2003") # For example CoNLL2003
-# Initialize a Trainer using the pretrained model & dataset
-trainer = Trainer(
-    model=model,
-    train_dataset=dataset["train"],
-    eval_dataset=dataset["validation"],
-)
-trainer.train()
-trainer.save_model("span_marker_model_id-finetuned")
 ```
-</details>
 <!--
 ### Out-of-Scope Use
@@ -79,6 +90,12 @@ trainer.save_model("span_marker_model_id-finetuned")
 *List how the model may foreseeably be misused and address what users ought not to do with the model.*
 -->
 <!--
 ## Bias, Risks and Limitations
@@ -93,6 +110,45 @@ trainer.save_model("span_marker_model_id-finetuned")
 ## Training Details
 ### Framework Versions
 - Python: 3.10.12
@@ -130,4 +186,4 @@ trainer.save_model("span_marker_model_id-finetuned")
 ## Model Card Contact
 *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
--->

 ---
+language:
+- es
+license: cc-by-4.0
 library_name: span-marker
 tags:
 - span-marker
 - ner
 - named-entity-recognition
 - generated_from_span_marker_trainer
+datasets:
+- conll2002
 metrics:
 - precision
 - recall
 - f1
+widget:
+- text: George Washington estuvo en Washington.
 pipeline_tag: token-classification
+base_model: PlanTL-GOB-ES/roberta-base-bne
+model-index:
+- name: SpanMarker with PlanTL-GOB-ES/roberta-base-bne on conll2002
+  results:
+  - task:
+      type: token-classification
+      name: Named Entity Recognition
+    dataset:
+      name: conll2002
+      type: conll2002
+      split: eval
+    metrics:
+    - type: f1
+      value: 0.871172868582195
+      name: F1
+    - type: precision
+      value: 0.888328530259366
+      name: Precision
+    - type: recall
+      value: 0.8546672828096118
+      name: Recall
 ---
+# SpanMarker with PlanTL-GOB-ES/roberta-base-bne on conll2002
+This is a [SpanMarker](https://github.com/tomaarsen/SpanMarkerNER) model trained on the [conll2002](https://huggingface.co/datasets/conll2002) dataset that can be used for Named Entity Recognition. This SpanMarker model uses [PlanTL-GOB-ES/roberta-base-bne](https://huggingface.co/models/PlanTL-GOB-ES/roberta-base-bne) as the underlying encoder.
 ## Model Details
 ### Model Description
 - **Model Type:** SpanMarker
+- **Encoder:** [PlanTL-GOB-ES/roberta-base-bne](https://huggingface.co/models/PlanTL-GOB-ES/roberta-base-bne)
 - **Maximum Sequence Length:** 256 tokens
 - **Maximum Entity Length:** 8 words
+- **Training Dataset:** [conll2002](https://huggingface.co/datasets/conll2002)
+- **Languages:** es
+- **License:** cc-by-4.0
 ### Model Sources
 - **Repository:** [SpanMarker on GitHub](https://github.com/tomaarsen/SpanMarkerNER)
 - **Thesis:** [SpanMarker For Named Entity Recognition](https://raw.githubusercontent.com/tomaarsen/SpanMarkerNER/main/thesis.pdf)
+### Model Labels
+| Label | Examples                                                          |
+|:------|:------------------------------------------------------------------|
+| LOC   | "Australia", "Victoria", "Melbourne"                              |
+| MISC  | "Ley", "Ciudad", "CrimeNet"                                       |
+| ORG   | "Commonwealth", "EFE", "Tribunal Supremo"                         |
+| PER   | "Abogado General del Estado", "Daryl Williams", "Abogado General" |
 ## Uses
 ### Direct Use for Inference
 from span_marker import SpanMarkerModel
 # Download from the 🤗 Hub
+model = SpanMarkerModel.from_pretrained("alvarobartt/span-marker-roberta-base-bne-conll-2002-es")
 # Run inference
+entities = model.predict("George Washington estuvo en Washington.")
 ```
 <!--
 ### Out-of-Scope Use
 *List how the model may foreseeably be misused and address what users ought not to do with the model.*
 -->
+### ⚠️ Tokenizer Warning
+The [PlanTL-GOB-ES/roberta-base-bne](https://huggingface.co/models/PlanTL-GOB-ES/roberta-base-bne) tokenizer distinguishes between punctuation directly attached to a word and punctuation separated from a word by a space. For example, `Paris.` and `Paris .` are tokenized into different tokens. During training, this model is only exposed to the latter style, i.e. all words are separated by a space. Consequently, the model may perform worse when the inference text is in the former style.
+In short, it is recommended to preprocess your inference text such that all words and punctuation are separated by a space. One approach is to use the [spaCy integration](https://tomaarsen.github.io/SpanMarkerNER/notebooks/spacy_integration.html) which automatically separates all words and punctuation. Alternatively, some potential approaches to convert regular text into this format are NLTK [`word_tokenize`](https://www.nltk.org/api/nltk.tokenize.word_tokenize.html) or spaCy [`Doc`](https://spacy.io/api/doc#iter) and joining the resulting words with a space.
 <!--
 ## Bias, Risks and Limitations
 ## Training Details
+### Training Set Metrics
+| Training set          | Min | Median  | Max  |
+|:----------------------|:----|:--------|:-----|
+| Sentence length       | 1   | 31.8052 | 1238 |
+| Entities per sentence | 0   | 2.2586  | 160  |
+### Training Hyperparameters
+- learning_rate: 5e-05
+- train_batch_size: 16
+- eval_batch_size: 8
+- seed: 42
+- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
+- lr_scheduler_type: linear
+- lr_scheduler_warmup_ratio: 0.1
+- num_epochs: 2
+### Training Results
+| Epoch  | Step | Validation Loss | Validation Precision | Validation Recall | Validation F1 | Validation Accuracy |
+|:------:|:----:|:---------------:|:--------------------:|:-----------------:|:-------------:|:-------------------:|
+| 0.1188 | 100  | 0.0704          | 0.0                  | 0.0               | 0.0           | 0.8608              |
+| 0.2375 | 200  | 0.0279          | 0.8765               | 0.4034            | 0.5525        | 0.9025              |
+| 0.3563 | 300  | 0.0158          | 0.8381               | 0.7211            | 0.7752        | 0.9524              |
+| 0.4751 | 400  | 0.0134          | 0.8525               | 0.7463            | 0.7959        | 0.9576              |
+| 0.5938 | 500  | 0.0130          | 0.8844               | 0.7549            | 0.8145        | 0.9560              |
+| 0.7126 | 600  | 0.0119          | 0.8480               | 0.8006            | 0.8236        | 0.9650              |
+| 0.8314 | 700  | 0.0098          | 0.8794               | 0.8408            | 0.8597        | 0.9695              |
+| 0.9501 | 800  | 0.0091          | 0.8842               | 0.8360            | 0.8594        | 0.9722              |
+| 1.0689 | 900  | 0.0093          | 0.8976               | 0.8387            | 0.8672        | 0.9698              |
+| 1.1876 | 1000 | 0.0094          | 0.8880               | 0.8517            | 0.8694        | 0.9739              |
+| 1.3064 | 1100 | 0.0086          | 0.8920               | 0.8530            | 0.8721        | 0.9737              |
+| 1.4252 | 1200 | 0.0092          | 0.8896               | 0.8452            | 0.8668        | 0.9728              |
+| 1.5439 | 1300 | 0.0094          | 0.8765               | 0.8313            | 0.8533        | 0.9720              |
+| 1.6627 | 1400 | 0.0089          | 0.8805               | 0.8445            | 0.8621        | 0.9720              |
+| 1.7815 | 1500 | 0.0088          | 0.8834               | 0.8581            | 0.8706        | 0.9747              |
+| 1.9002 | 1600 | 0.0088          | 0.8883               | 0.8547            | 0.8712        | 0.9747              |
 ### Framework Versions
 - Python: 3.10.12
 ## Model Card Contact
 *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
+-->