GPT-2 Tokenizer with unmerged digits
A fork of the GPT-2 tokenizer, which removes multi-digit tokens:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('cyrilzhang/gpt2-numfix')
tokenizer('123.45') # [16, 17, 18, 13, 19, 20]
gpt2_tokenizer('123.45') # [10163, 13, 2231]
Backward-compatible:
tokenizer.decode([10163, 46387]) # '<unused123> pigeon'
gpt2_tokenizer.decode([10163, 46387]) # '123 pigeon'
- Downloads last month
- 5
Inference Providers
NEW
This model is not currently available via any of the supported third-party Inference Providers, and
HF Inference API was unable to determine this model’s pipeline type.