alvarobartt HF staff commited on
Commit
929d13f
·
1 Parent(s): 39e53a6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -8
README.md CHANGED
@@ -43,14 +43,14 @@ model-index:
43
 
44
  # SpanMarker with PlanTL-GOB-ES/roberta-base-bne on conll2002
45
 
46
- This is a [SpanMarker](https://github.com/tomaarsen/SpanMarkerNER) model trained on the [conll2002](https://huggingface.co/datasets/conll2002) dataset that can be used for Named Entity Recognition. This SpanMarker model uses [PlanTL-GOB-ES/roberta-base-bne](https://huggingface.co/models/PlanTL-GOB-ES/roberta-base-bne) as the underlying encoder.
47
 
48
  ## Model Details
49
 
50
  ### Model Description
51
 
52
  - **Model Type:** SpanMarker
53
- - **Encoder:** [PlanTL-GOB-ES/roberta-base-bne](https://huggingface.co/models/PlanTL-GOB-ES/roberta-base-bne)
54
  - **Maximum Sequence Length:** 256 tokens
55
  - **Maximum Entity Length:** 8 words
56
  - **Training Dataset:** [conll2002](https://huggingface.co/datasets/conll2002)
@@ -90,12 +90,6 @@ entities = model.predict("George Washington estuvo en Washington.")
90
  *List how the model may foreseeably be misused and address what users ought not to do with the model.*
91
  -->
92
 
93
- ### ⚠️ Tokenizer Warning
94
-
95
- The [PlanTL-GOB-ES/roberta-base-bne](https://huggingface.co/models/PlanTL-GOB-ES/roberta-base-bne) tokenizer distinguishes between punctuation directly attached to a word and punctuation separated from a word by a space. For example, `Paris.` and `Paris .` are tokenized into different tokens. During training, this model is only exposed to the latter style, i.e. all words are separated by a space. Consequently, the model may perform worse when the inference text is in the former style.
96
-
97
- In short, it is recommended to preprocess your inference text such that all words and punctuation are separated by a space. One approach is to use the [spaCy integration](https://tomaarsen.github.io/SpanMarkerNER/notebooks/spacy_integration.html) which automatically separates all words and punctuation. Alternatively, some potential approaches to convert regular text into this format are NLTK [`word_tokenize`](https://www.nltk.org/api/nltk.tokenize.word_tokenize.html) or spaCy [`Doc`](https://spacy.io/api/doc#iter) and joining the resulting words with a space.
98
-
99
  <!--
100
  ## Bias, Risks and Limitations
101
 
 
43
 
44
  # SpanMarker with PlanTL-GOB-ES/roberta-base-bne on conll2002
45
 
46
+ This is a [SpanMarker](https://github.com/tomaarsen/SpanMarkerNER) model trained on the [conll2002](https://huggingface.co/datasets/conll2002) dataset that can be used for Named Entity Recognition. This SpanMarker model uses [PlanTL-GOB-ES/roberta-base-bne](https://huggingface.co/PlanTL-GOB-ES/roberta-base-bne) as the underlying encoder.
47
 
48
  ## Model Details
49
 
50
  ### Model Description
51
 
52
  - **Model Type:** SpanMarker
53
+ - **Encoder:** [PlanTL-GOB-ES/roberta-base-bne](https://huggingface.co/PlanTL-GOB-ES/roberta-base-bne)
54
  - **Maximum Sequence Length:** 256 tokens
55
  - **Maximum Entity Length:** 8 words
56
  - **Training Dataset:** [conll2002](https://huggingface.co/datasets/conll2002)
 
90
  *List how the model may foreseeably be misused and address what users ought not to do with the model.*
91
  -->
92
 
 
 
 
 
 
 
93
  <!--
94
  ## Bias, Risks and Limitations
95