Spaces:
Running
A newer version of the Gradio SDK is available:
5.14.0
IndicTransTokenizer
The goal of this repository is to provide a simple, modular, and extendable tokenizer for IndicTrans2 and be compatible with the HuggingFace models released.
Pre-requisites
Python 3.8+
- Indic NLP Library
- Other requirements as listed in
requirements.txt
Configuration
- Editable installation (Note, this may take a while):
git clone https://github.com/VarunGumma/IndicTransTokenizer
cd IndicTransTokenizer
pip install --editable ./
Usage
import torch
from transformers import AutoModelForSeq2SeqLM
from IndicTransTokenizer import IndicProcessor, IndicTransTokenizer
tokenizer = IndicTransTokenizer(direction="en-indic")
ip = IndicProcessor(inference=True)
model = AutoModelForSeq2SeqLM.from_pretrained("ai4bharat/indictrans2-en-indic-dist-200M", trust_remote_code=True)
sentences = [
"This is a test sentence.",
"This is another longer different test sentence.",
"Please send an SMS to 9876543210 and an email on [email protected] by 15th October, 2023.",
]
batch = ip.preprocess_batch(sentences, src_lang="eng_Latn", tgt_lang="hin_Deva")
batch = tokenizer(batch, src=True, return_tensors="pt")
with torch.inference_mode():
outputs = model.generate(**batch, num_beams=5, num_return_sequences=1, max_length=256)
outputs = tokenizer.batch_decode(outputs, src=False)
outputs = ip.postprocess_batch(outputs, lang="hin_Deva")
print(outputs)
>>> ['यह एक परीक्षण वाक्य है।', 'यह एक और लंबा अलग परीक्षण वाक्य है।', 'कृपया 9876543210 पर एक एस. एम. एस. भेजें और 15 अक्टूबर, 2023 तक [email protected] पर एक ईमेल भेजें।']
For using the tokenizer to train/fine-tune the model, just set the inference
argument of IndicProcessor to False
.
Authors
- Varun Gumma ([email protected])
- Jay Gala ([email protected])
- Pranjal Agadh Chitale ([email protected])
- Raj Dabre ([email protected])
Bugs and Contribution
Since this a bleeding-edge module, you may encounter broken stuff and import issues once in a while. In case you encounter any bugs or want additional functionalities, please feel free to raise Issues
/Pull Requests
or contact the authors.
Citation
If you use our codebase, models or tokenizer, please do cite the following paper:
@article{
gala2023indictrans,
title={IndicTrans2: Towards High-Quality and Accessible Machine Translation Models for all 22 Scheduled Indian Languages},
author={Jay Gala and Pranjal A Chitale and A K Raghavan and Varun Gumma and Sumanth Doddapaneni and Aswanth Kumar M and Janki Atul Nawale and Anupama Sujatha and Ratish Puduppully and Vivek Raghavan and Pratyush Kumar and Mitesh M Khapra and Raj Dabre and Anoop Kunchukuttan},
journal={Transactions on Machine Learning Research},
issn={2835-8856},
year={2023},
url={https://openreview.net/forum?id=vfT4YuzAYA},
note={}
}
Note
This tokenizer module is currently not compatible with the PreTrainedTokenizer module from HuggingFace. Hence, we are actively looking for Pull Requests
to port this tokenizer to HF. Any leads on that front are welcome!