--- language: - es license: cc-by-4.0 library_name: span-marker tags: - span-marker - token-classification - ner - named-entity-recognition - generated_from_span_marker_trainer datasets: - conll2002 metrics: - precision - recall - f1 widget: - text: George Washington estuvo en Washington. pipeline_tag: token-classification base_model: PlanTL-GOB-ES/roberta-base-bne model-index: - name: SpanMarker with PlanTL-GOB-ES/roberta-base-bne on conll2002 results: - task: type: token-classification name: Named Entity Recognition dataset: name: conll2002 type: conll2002 split: eval metrics: - type: f1 value: 0.871172868582195 name: F1 - type: precision value: 0.888328530259366 name: Precision - type: recall value: 0.8546672828096118 name: Recall --- # SpanMarker with PlanTL-GOB-ES/roberta-base-bne on conll2002 This is a [SpanMarker](https://github.com/tomaarsen/SpanMarkerNER) model trained on the [conll2002](https://huggingface.co/datasets/conll2002) dataset that can be used for Named Entity Recognition. This SpanMarker model uses [PlanTL-GOB-ES/roberta-base-bne](https://huggingface.co/models/PlanTL-GOB-ES/roberta-base-bne) as the underlying encoder. ## Model Details ### Model Description - **Model Type:** SpanMarker - **Encoder:** [PlanTL-GOB-ES/roberta-base-bne](https://huggingface.co/models/PlanTL-GOB-ES/roberta-base-bne) - **Maximum Sequence Length:** 256 tokens - **Maximum Entity Length:** 8 words - **Training Dataset:** [conll2002](https://huggingface.co/datasets/conll2002) - **Languages:** es - **License:** cc-by-4.0 ### Model Sources - **Repository:** [SpanMarker on GitHub](https://github.com/tomaarsen/SpanMarkerNER) - **Thesis:** [SpanMarker For Named Entity Recognition](https://raw.githubusercontent.com/tomaarsen/SpanMarkerNER/main/thesis.pdf) ### Model Labels | Label | Examples | |:------|:------------------------------------------------------------------| | LOC | "Australia", "Victoria", "Melbourne" | | MISC | "Ley", "Ciudad", "CrimeNet" | | ORG | "Commonwealth", "EFE", "Tribunal Supremo" | | PER | "Abogado General del Estado", "Daryl Williams", "Abogado General" | ## Uses ### Direct Use for Inference ```python from span_marker import SpanMarkerModel # Download from the 🤗 Hub model = SpanMarkerModel.from_pretrained("alvarobartt/span-marker-roberta-base-bne-conll-2002-es") # Run inference entities = model.predict("George Washington estuvo en Washington.") ``` ### ⚠️ Tokenizer Warning The [PlanTL-GOB-ES/roberta-base-bne](https://huggingface.co/models/PlanTL-GOB-ES/roberta-base-bne) tokenizer distinguishes between punctuation directly attached to a word and punctuation separated from a word by a space. For example, `Paris.` and `Paris .` are tokenized into different tokens. During training, this model is only exposed to the latter style, i.e. all words are separated by a space. Consequently, the model may perform worse when the inference text is in the former style. In short, it is recommended to preprocess your inference text such that all words and punctuation are separated by a space. One approach is to use the [spaCy integration](https://tomaarsen.github.io/SpanMarkerNER/notebooks/spacy_integration.html) which automatically separates all words and punctuation. Alternatively, some potential approaches to convert regular text into this format are NLTK [`word_tokenize`](https://www.nltk.org/api/nltk.tokenize.word_tokenize.html) or spaCy [`Doc`](https://spacy.io/api/doc#iter) and joining the resulting words with a space. ## Training Details ### Training Set Metrics | Training set | Min | Median | Max | |:----------------------|:----|:--------|:-----| | Sentence length | 1 | 31.8052 | 1238 | | Entities per sentence | 0 | 2.2586 | 160 | ### Training Hyperparameters - learning_rate: 5e-05 - train_batch_size: 16 - eval_batch_size: 8 - seed: 42 - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 - lr_scheduler_type: linear - lr_scheduler_warmup_ratio: 0.1 - num_epochs: 2 ### Training Results | Epoch | Step | Validation Loss | Validation Precision | Validation Recall | Validation F1 | Validation Accuracy | |:------:|:----:|:---------------:|:--------------------:|:-----------------:|:-------------:|:-------------------:| | 0.1188 | 100 | 0.0704 | 0.0 | 0.0 | 0.0 | 0.8608 | | 0.2375 | 200 | 0.0279 | 0.8765 | 0.4034 | 0.5525 | 0.9025 | | 0.3563 | 300 | 0.0158 | 0.8381 | 0.7211 | 0.7752 | 0.9524 | | 0.4751 | 400 | 0.0134 | 0.8525 | 0.7463 | 0.7959 | 0.9576 | | 0.5938 | 500 | 0.0130 | 0.8844 | 0.7549 | 0.8145 | 0.9560 | | 0.7126 | 600 | 0.0119 | 0.8480 | 0.8006 | 0.8236 | 0.9650 | | 0.8314 | 700 | 0.0098 | 0.8794 | 0.8408 | 0.8597 | 0.9695 | | 0.9501 | 800 | 0.0091 | 0.8842 | 0.8360 | 0.8594 | 0.9722 | | 1.0689 | 900 | 0.0093 | 0.8976 | 0.8387 | 0.8672 | 0.9698 | | 1.1876 | 1000 | 0.0094 | 0.8880 | 0.8517 | 0.8694 | 0.9739 | | 1.3064 | 1100 | 0.0086 | 0.8920 | 0.8530 | 0.8721 | 0.9737 | | 1.4252 | 1200 | 0.0092 | 0.8896 | 0.8452 | 0.8668 | 0.9728 | | 1.5439 | 1300 | 0.0094 | 0.8765 | 0.8313 | 0.8533 | 0.9720 | | 1.6627 | 1400 | 0.0089 | 0.8805 | 0.8445 | 0.8621 | 0.9720 | | 1.7815 | 1500 | 0.0088 | 0.8834 | 0.8581 | 0.8706 | 0.9747 | | 1.9002 | 1600 | 0.0088 | 0.8883 | 0.8547 | 0.8712 | 0.9747 | ### Framework Versions - Python: 3.10.12 - SpanMarker: 1.3.1.dev - Transformers: 4.33.2 - PyTorch: 2.0.1+cu118 - Datasets: 2.14.5 - Tokenizers: 0.13.3 ## Citation ### BibTeX ``` @software{Aarsen_SpanMarker, author = {Aarsen, Tom}, license = {Apache-2.0}, title = {{SpanMarker for Named Entity Recognition}}, url = {https://github.com/tomaarsen/SpanMarkerNER} } ```