Safetensors
Hindi
English
mistral
ganga-en-hi-1b / README.md
aamodthakur's picture
Update README.md
3499f4c verified
---
license: apache-2.0
language:
- hi
- en
metrics:
- bleu
- chrf
datasets:
- ai4bharat/samanantar
- ai4bharat/indic-instruct-data-v0.1
base_model: LingoIITGN/ganga-1b
---
# Model Card for Ganga-en-hi-1b! 🌊
The model **``Ganga-en-hi-1b``** is a fine-tuned version of **``Ganga-1b``** for the English to Hindi Translation Task.
![image/png](/static-proxy?url=https%3A%2F%2Fcdn-uploads.huggingface.co%2Fproduction%2Fuploads%2F667b8f8ba271fc5a8e6929de%2FjG3tZnGPvH6vcGrvxO-YC.png%3C%2Fspan%3E)%3C!-- HTML_TAG_END -->
### Model Description πŸ“š
**Project Unity** is an initiative to address **India's linguistic diversity** and richness by creating a comprehensive resource covering the country's major languages. We strive to achieve state-of-the-art performance in understanding and generating text in **Indian languages**.
To achieve this, we train models on the monolingual regional languages of India. Our first release is the *Ganga-1B* model, *which has been trained on a large dataset of public domain web-crawled Hindi language data, including news articles, web documents, books, government publications, educational materials, and social media conversations (filtered for quality)*. Additionally, the dataset has been further curated by native Indian speakers to ensure high quality.
Significantly, the **Ganga-1B** model outperforms existing open-source models that support **Indian languages**, even at sizes of up to **7 billion parameters**.
- **Developed by:** [Lingo Research Group at IIT Gandhinagar](https://labs.iitgn.ac.in/lingo/)
- **Model type:** Autoregressive Language Model
- **Language(s) (NLP):** Bilingual (Primary: *Hindi* [**hi**], Secondary: *English* [**en**])
- **License:** Apache 2.0
## How to Get Started with the Model πŸ‘¨πŸ»β€πŸ’»
Use the code below to get started with the model.
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("LingoIITGN/ganga-1b")
model = AutoModelForCausalLM.from_pretrained("LingoIITGN/ganga-1b", device_map="auto")
input_text = "<bos>[INST]How are you?[/INST]"
input_ids = tokenizer.encode(input_text,
return_tensors="pt").to("cuda")
outputs = model.generate(input_ids, max_new_tokens=100)
print(tokenizer.decode(output[0][input_ids[0].shape[0]:], skip_special_tokens=True))
```
## Evaluation
[More Information Needed]
### Results πŸ†
<details open>
<summary>Metrics</summary>
<br>
| Model | Chrf<sub>Flores Dataset</sub> | Chrf<sub>IN22 Dataset</sub> |
|:-----------:|:---------:|:------:|
| ***Ganga-1b*** | ***40.75*** | ***37.54*** |
| Gemma-2b | 39.96 | 34.62 |
| Llama-8b | 35.04 | 30.03 |
</details>
## Summary
## Bias, Risks, and Limitations 🚨
### Recommendations ‼️
<span style="color:red">This model described is a research preview and is under ongoing iterative updations, and as such, it only provides limited safety measures. Additionally, it may generate offensive content. It is strictly prohibited to use the model for any illegal, harmful, violent, racist, or sexual purposes.</span>
## Model Card Contact βœ‰οΈ
[Lingo Research Group at IIT Gandhinagar, India](https://labs.iitgn.ac.in/lingo/) </br>
Mail at: [[email protected]]([email protected])