Commit
·
929d13f
1
Parent(s):
39e53a6
Update README.md
Browse files
README.md
CHANGED
@@ -43,14 +43,14 @@ model-index:
|
|
43 |
|
44 |
# SpanMarker with PlanTL-GOB-ES/roberta-base-bne on conll2002
|
45 |
|
46 |
-
This is a [SpanMarker](https://github.com/tomaarsen/SpanMarkerNER) model trained on the [conll2002](https://huggingface.co/datasets/conll2002) dataset that can be used for Named Entity Recognition. This SpanMarker model uses [PlanTL-GOB-ES/roberta-base-bne](https://huggingface.co/
|
47 |
|
48 |
## Model Details
|
49 |
|
50 |
### Model Description
|
51 |
|
52 |
- **Model Type:** SpanMarker
|
53 |
-
- **Encoder:** [PlanTL-GOB-ES/roberta-base-bne](https://huggingface.co/
|
54 |
- **Maximum Sequence Length:** 256 tokens
|
55 |
- **Maximum Entity Length:** 8 words
|
56 |
- **Training Dataset:** [conll2002](https://huggingface.co/datasets/conll2002)
|
@@ -90,12 +90,6 @@ entities = model.predict("George Washington estuvo en Washington.")
|
|
90 |
*List how the model may foreseeably be misused and address what users ought not to do with the model.*
|
91 |
-->
|
92 |
|
93 |
-
### ⚠️ Tokenizer Warning
|
94 |
-
|
95 |
-
The [PlanTL-GOB-ES/roberta-base-bne](https://huggingface.co/models/PlanTL-GOB-ES/roberta-base-bne) tokenizer distinguishes between punctuation directly attached to a word and punctuation separated from a word by a space. For example, `Paris.` and `Paris .` are tokenized into different tokens. During training, this model is only exposed to the latter style, i.e. all words are separated by a space. Consequently, the model may perform worse when the inference text is in the former style.
|
96 |
-
|
97 |
-
In short, it is recommended to preprocess your inference text such that all words and punctuation are separated by a space. One approach is to use the [spaCy integration](https://tomaarsen.github.io/SpanMarkerNER/notebooks/spacy_integration.html) which automatically separates all words and punctuation. Alternatively, some potential approaches to convert regular text into this format are NLTK [`word_tokenize`](https://www.nltk.org/api/nltk.tokenize.word_tokenize.html) or spaCy [`Doc`](https://spacy.io/api/doc#iter) and joining the resulting words with a space.
|
98 |
-
|
99 |
<!--
|
100 |
## Bias, Risks and Limitations
|
101 |
|
|
|
43 |
|
44 |
# SpanMarker with PlanTL-GOB-ES/roberta-base-bne on conll2002
|
45 |
|
46 |
+
This is a [SpanMarker](https://github.com/tomaarsen/SpanMarkerNER) model trained on the [conll2002](https://huggingface.co/datasets/conll2002) dataset that can be used for Named Entity Recognition. This SpanMarker model uses [PlanTL-GOB-ES/roberta-base-bne](https://huggingface.co/PlanTL-GOB-ES/roberta-base-bne) as the underlying encoder.
|
47 |
|
48 |
## Model Details
|
49 |
|
50 |
### Model Description
|
51 |
|
52 |
- **Model Type:** SpanMarker
|
53 |
+
- **Encoder:** [PlanTL-GOB-ES/roberta-base-bne](https://huggingface.co/PlanTL-GOB-ES/roberta-base-bne)
|
54 |
- **Maximum Sequence Length:** 256 tokens
|
55 |
- **Maximum Entity Length:** 8 words
|
56 |
- **Training Dataset:** [conll2002](https://huggingface.co/datasets/conll2002)
|
|
|
90 |
*List how the model may foreseeably be misused and address what users ought not to do with the model.*
|
91 |
-->
|
92 |
|
|
|
|
|
|
|
|
|
|
|
|
|
93 |
<!--
|
94 |
## Bias, Risks and Limitations
|
95 |
|