Safetensors
Hindi
English
mistral
aamodthakur commited on
Commit
1330fe5
ยท
verified ยท
1 Parent(s): a3b20a4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +122 -3
README.md CHANGED
@@ -1,3 +1,122 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - hi
5
+ - en
6
+ metrics:
7
+ - bleu
8
+ - chrf
9
+ datasets:
10
+ - ai4bharat/samanantar
11
+ - ai4bharat/indic-instruct-data-v0.1
12
+ base_model: LingoIITGN/ganga-1b
13
+ ---
14
+
15
+ # Model Card for Ganga-1b! ๐ŸŒŠ
16
+
17
+ The base model **``Ganga-1b``** trained on a monolingual **Hindi** language dataset as part of ***Project Unity***. We propose the name *Ganga* ๐ŸŒŠ to honor the longest river flowing through the Hindi-speaking region of India ๐Ÿ‡ฎ๐Ÿ‡ณ.
18
+
19
+ *(The first pre-trained Hindi model by any academic research lab in India ๐Ÿ‡ฎ๐Ÿ‡ณ!)**
20
+
21
+
22
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/667b8f8ba271fc5a8e6929de/jG3tZnGPvH6vcGrvxO-YC.png)
23
+
24
+
25
+
26
+ ### Model Description ๐Ÿ“š
27
+
28
+ **Project Unity** is an initiative to address **India's linguistic diversity** and richness by creating a comprehensive resource covering the country's major languages. We strive to achieve state-of-the-art performance in understanding and generating text in **Indian languages**.
29
+ To achieve this, we train models on the monolingual regional languages of India. Our first release is the *Ganga-1B* model, *which has been trained on a large dataset of public domain web-crawled Hindi language data, including news articles, web documents, books, government publications, educational materials, and social media conversations (filtered for quality)*. Additionally, the dataset has been further curated by native Indian speakers to ensure high quality.
30
+ Significantly, the **Ganga-1B** model outperforms existing open-source models that support **Indian languages**, even at sizes of up to **7 billion parameters**.
31
+
32
+
33
+
34
+ - **Developed by:** [Lingo Research Group at IIT Gandhinagar](https://labs.iitgn.ac.in/lingo/)
35
+ - **Model type:** Autoregressive Language Model
36
+ - **Language(s) (NLP):** Bilingual (Primary: *Hindi* [**hi**], Secondary: *English* [**en**])
37
+ - **License:** Apache 2.0
38
+
39
+
40
+
41
+ ## How to Get Started with the Model ๐Ÿ‘จ๐Ÿปโ€๐Ÿ’ป
42
+
43
+ Use the code below to get started with the model.
44
+
45
+ ```python
46
+ from transformers import AutoModelForCausalLM, AutoTokenizer
47
+
48
+ tokenizer = AutoTokenizer.from_pretrained("LingoIITGN/ganga-1b")
49
+ model = AutoModelForCausalLM.from_pretrained("LingoIITGN/ganga-1b", device_map="auto")
50
+
51
+ input_text = "<bos>[INST]How are you?[/INST]"
52
+ input_ids = tokenizer.encode(input_text,
53
+ return_tensors="pt").to("cuda")
54
+
55
+ outputs = model.generate(input_ids, max_new_tokens=100)
56
+
57
+ print(tokenizer.decode(output[0][input_ids[0].shape[0]:], skip_special_tokens=True))
58
+
59
+ ```
60
+
61
+ ## Technical Specifications ๐Ÿค–
62
+
63
+ - **Precision**: *Float32*
64
+ - **Context Length**: *2,048*
65
+ - **Learning Rate**: *4e-4*
66
+ - **Optimizer**: *AdamW*
67
+ - **LR Scheduler**: *Cosine*
68
+
69
+ ### Model Architecture and Objective
70
+
71
+
72
+ Ganga-1b is a decoder-only transformer model, featuring the following specifications:
73
+
74
+
75
+ * Layers: 16
76
+ * Attention heads: 32
77
+ * Embedding dimension: 2,048
78
+ * Vocabulary size: 30,000
79
+ * Sliding window: 512
80
+ * Intermediate dimension: 7,168
81
+
82
+
83
+ ## Evaluation
84
+ [More Information Needed]
85
+
86
+ ### Results ๐Ÿ†
87
+
88
+ <details open>
89
+ <summary>Metrics</summary>
90
+ <br>
91
+
92
+ | Model | PPL<sub>Flores Dataset</sub> | PPL<sub>IN22 Dataset</sub> |
93
+ |:-----------:|:---------:|:------:|
94
+ | ***Ganga-1b*** | ***40.75*** | ***37.54*** |
95
+ | Gemma-2b | 39.96 | 34.62 |
96
+ | Gemma-9b | 48.91 | 40.86 |
97
+ | Llama-8b | 35.04 | 30.03 |
98
+ | IndicTrans | 60.17 | 57.73 |
99
+ | Airavata-7b | 57.41 | 54.90 |
100
+
101
+ </details>
102
+
103
+
104
+ ## Summary
105
+
106
+
107
+
108
+ ## Bias, Risks, and Limitations ๐Ÿšจ
109
+
110
+
111
+ ### Recommendations โ€ผ๏ธ
112
+
113
+ <span style="color:red">This model described is a research preview and is under ongoing iterative updations, and as such, it only provides limited safety measures. Additionally, it may generate offensive content. It is strictly prohibited to use the model for any illegal, harmful, violent, racist, or sexual purposes.</span>
114
+
115
+ ## More Information
116
+
117
+ **DEMO:** [https://huggingface.co/spaces/Lingo-IITGN/ganga-1b](https://huggingface.co/spaces/Lingo-IITGN/ganga-1b)
118
+
119
+ ## Model Card Contact โœ‰๏ธ
120
+
121
+ [Lingo Research Group at IIT Gandhinagar, India](https://labs.iitgn.ac.in/lingo/) </br>
122