--- library_name: span-marker tags: - span-marker - token-classification - ner - named-entity-recognition - generated_from_span_marker_trainer datasets: - conll2003 metrics: - precision - recall - f1 widget: - text: Atlanta Games silver medal winner Edwards has called on other leading athletes to take part in the Sarajevo meeting--a goodwill gesture towards Bosnia as it recovers from the war in the Balkans--two days after the grand prix final in Milan. - text: Portsmouth:Middlesex 199 and 426 (J. Pooley 111,M. Ramprakash 108,M. Gatting 83), Hampshire 232 and 109-5. - text: Poland's Foreign Minister Dariusz Rosati will visit Yugoslavia on September 3 and 4 to revive a dialogue between the two governments which was effectively frozen in 1992,PAP news agency reported on Friday. - text: The authorities are apparently extremely afraid of any political and social discontent," said Xiao,in Manila to attend an Amnesty International conference on human rights in China. - text: American Nate Miller successfully defended his WBA cruiserweight title when he knocked out compatriot James Heath in the seventh round of their bout on Saturday. pipeline_tag: token-classification model-index: - name: SpanMarker results: - task: type: token-classification name: Named Entity Recognition dataset: name: Unknown type: conll2003 split: eval metrics: - type: f1 value: 0.9550004205568171 name: F1 - type: precision value: 0.9542780299209951 name: Precision - type: recall value: 0.9557239057239058 name: Recall --- # SpanMarker This is a [SpanMarker](https://github.com/tomaarsen/SpanMarkerNER) model trained on the [conll2003](https://huggingface.co/datasets/conll2003) dataset that can be used for Named Entity Recognition. ## Model Details Important Note: I used the Tokenizer from "roberta-base". ```diff from span_marker import SpanMarkerModel from span_marker.tokenizer import SpanMarkerTokenizer # Download from the 🤗 Hub model = SpanMarkerModel.from_pretrained("lambdavi/span-marker-luke-base-conll2003") +tokenizer = SpanMarkerTokenizer.from_pretrained("roberta-base", config=model.tokenizer.config) +model.set_tokenizer(tokenizer) # Run inference entities = model.predict("Portsmouth:Middlesex 199 and 426 (J. Pooley 111,M. Ramprakash 108,M. Gatting 83), Hampshire 232 and 109-5.") ``` ### Model Description - **Model Type:** SpanMarker - **Maximum Sequence Length:** 512 tokens - **Maximum Entity Length:** 8 words - **Training Dataset:** [conll2003](https://huggingface.co/datasets/conll2003) ### Model Sources - **Repository:** [SpanMarker on GitHub](https://github.com/tomaarsen/SpanMarkerNER) - **Thesis:** [SpanMarker For Named Entity Recognition](https://raw.githubusercontent.com/tomaarsen/SpanMarkerNER/main/thesis.pdf) ### Model Labels | Label | Examples | |:------|:--------------------------------------------------------------| | LOC | "Germany", "BRUSSELS", "Britain" | | MISC | "German", "British", "EU-wide" | | ORG | "European Commission", "EU", "European Union" | | PER | "Werner Zwingmann", "Nikolaus van der Pas", "Peter Blackburn" | ## Uses ### Direct Use for Inference ```python from span_marker import SpanMarkerModel from span_marker.tokenizer import SpanMarkerTokenizer # Download from the 🤗 Hub model = SpanMarkerModel.from_pretrained("lambdavi/span-marker-luke-base-conll2003") tokenizer = SpanMarkerTokenizer.from_pretrained("roberta-base", config=model.tokenizer.config) model.set_tokenizer(tokenizer) # Run inference entities = model.predict("Portsmouth:Middlesex 199 and 426 (J. Pooley 111,M. Ramprakash 108,M. Gatting 83), Hampshire 232 and 109-5.") ``` ### Downstream Use You can finetune this model on your own dataset.
Click to expand ```python from span_marker import SpanMarkerModel, Trainer # Download from the 🤗 Hub model = SpanMarkerModel.from_pretrained("span_marker_model_id") # Specify a Dataset with "tokens" and "ner_tag" columns dataset = load_dataset("conll2003") # For example CoNLL2003 # Initialize a Trainer using the pretrained model & dataset trainer = Trainer( model=model, train_dataset=dataset["train"], eval_dataset=dataset["validation"], ) trainer.train() trainer.save_model("span_marker_model_id-finetuned") ```
## Training Details ### Training Set Metrics | Training set | Min | Median | Max | |:----------------------|:----|:--------|:----| | Sentence length | 1 | 14.5019 | 113 | | Entities per sentence | 0 | 1.6736 | 20 | ### Training Hyperparameters - learning_rate: 1e-05 - train_batch_size: 8 - eval_batch_size: 8 - seed: 42 - gradient_accumulation_steps: 2 - total_train_batch_size: 16 - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 - lr_scheduler_type: linear - lr_scheduler_warmup_ratio: 0.1 - num_epochs: 5 ### Training Results | Epoch | Step | Validation Loss | Validation Precision | Validation Recall | Validation F1 | Validation Accuracy | |:-----:|:----:|:---------------:|:--------------------:|:-----------------:|:-------------:|:-------------------:| | 1.0 | 883 | 0.0123 | 0.9293 | 0.9274 | 0.9284 | 0.9848 | | 2.0 | 1766 | 0.0089 | 0.9412 | 0.9456 | 0.9434 | 0.9882 | | 3.0 | 2649 | 0.0077 | 0.9499 | 0.9505 | 0.9502 | 0.9893 | | 4.0 | 3532 | 0.0070 | 0.9527 | 0.9537 | 0.9532 | 0.9900 | | 5.0 | 4415 | 0.0068 | 0.9543 | 0.9557 | 0.9550 | 0.9902 | ### Framework Versions - Python: 3.10.12 - SpanMarker: 1.5.0 - Transformers: 4.36.0 - PyTorch: 2.0.0 - Datasets: 2.16.1 - Tokenizers: 0.15.0 ## Citation ### BibTeX ``` @software{Aarsen_SpanMarker, author = {Aarsen, Tom}, license = {Apache-2.0}, title = {{SpanMarker for Named Entity Recognition}}, url = {https://github.com/tomaarsen/SpanMarkerNER} } ```