Smiral commited on
Commit
cee0227
·
0 Parent(s):

Release of KooBERT

Browse files
.gitattributes ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tflite filter=lfs diff=lfs merge=lfs -text
29
+ *.tgz filter=lfs diff=lfs merge=lfs -text
30
+ *.wasm filter=lfs diff=lfs merge=lfs -text
31
+ *.xz filter=lfs diff=lfs merge=lfs -text
32
+ *.zip filter=lfs diff=lfs merge=lfs -text
33
+ *.zst filter=lfs diff=lfs merge=lfs -text
34
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,186 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ # For reference on model card metadata, see the spec: https://github.com/huggingface/hub-docs/blob/main/modelcard.md?plain=1
3
+ # Doc / guide: https://huggingface.co/docs/hub/model-cards
4
+ {}
5
+ ---
6
+
7
+ # Model Card for KooBERT
8
+
9
+ KooBERT is a masked language model trained on data from the multilingual micro-blogging social media platform [Koo India](https://www.kooapp.com/). <br>
10
+ This model was built in collaboration with Koo India and AI4Bharat.
11
+
12
+ ## Model Details
13
+
14
+ ### Model Description
15
+
16
+ On Koo platform, we have microblogs (Koos) which are limited to 400 characters and are available in multiple languages.
17
+ The model was trained on a dataset that contains multilingual koos from Jan 2020 to Nov 2022 on masked language modeling task.
18
+
19
+
20
+ - **Model type:** BERT based pretrained model
21
+ - **Language(s) (NLP):** assamese, bengali, english, gujarati, hindi, kannada, malayalam, marathi, nigeran english, oriya, punjabi, tamil, telugu
22
+ - **License:** KooBERT released under the MIT License.
23
+
24
+ ## Uses
25
+
26
+ This model can be used to perform downstream tasks like content classification, toxicity detection, etc. for supported Indic languages <br>
27
+ It can also be used with sentence-transformers library for the creation of multilingual vector embeddings for other uses.
28
+ <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
29
+ ## Bias, Risks, and Limitations
30
+ As with any machine learning model, KooBERT may have limitations and biases. It is important to keep in mind that this model was trained on Koo Social Media data and may not generalize well to other domains. It is also possible that the model may have biases in the data it was trained on, which may affect its predictions. It is recommended to evaluate the model on your specific use case and data to ensure it is appropriate for your needs.
31
+
32
+ ## How to Get Started with the Model
33
+
34
+ Use the code below to get started with the model for general finetuning tasks. Please note this is just a sample for finetuning.
35
+
36
+ ```
37
+ import torch
38
+ from datasets import load_dataset, load_metric
39
+ from transformers import AutoTokenizer, AutoModel, TrainingArguments, Trainer
40
+ import evaluate
41
+ metric = evaluate.load("accuracy")
42
+ def compute_metrics(eval_pred):
43
+ logits, labels = eval_pred
44
+ predictions = np.argmax(logits, axis=-1)
45
+ return metric.compute(predictions=predictions, references=labels)
46
+
47
+ # Load the CoLA dataset
48
+ cola_dataset = load_dataset("glue", "cola")
49
+
50
+ cola_dataset = cola_dataset.rename_column('label', 'labels')
51
+ cola_dataset = cola_dataset.rename_column('sentence', 'text')
52
+
53
+ # Load the tokenizer and model
54
+ tokenizer = AutoTokenizer.from_pretrained("Koodsml/KooBERT")
55
+ model = AutoModel.from_pretrained("Koodsml/KooBERT", num_labels=2)
56
+
57
+ def tokenize_function(examples):
58
+ return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=128)
59
+
60
+
61
+ cola_dataset = cola_dataset.map(tokenize_function, batched=True)
62
+
63
+ # Set the device
64
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
65
+ model.to(device)
66
+
67
+ # Define the training arguments
68
+ training_args = TrainingArguments(
69
+ output_dir='./results',
70
+ evaluation_strategy='epoch',
71
+ # eval_steps=100,
72
+ save_total_limit=1,
73
+ learning_rate=2e-5,
74
+ per_device_train_batch_size=8,
75
+ per_device_eval_batch_size=8,
76
+ num_train_epochs=3,
77
+ weight_decay=0.01,
78
+ push_to_hub=False,
79
+ )
80
+
81
+ # Define the trainer
82
+ trainer = Trainer(
83
+ model=model,
84
+ args=training_args,
85
+ train_dataset=cola_dataset['train'],
86
+ eval_dataset=cola_dataset['validation'],
87
+ # tokenizer=tokenizer,
88
+ compute_metrics=compute_metrics
89
+ )
90
+
91
+ # Fine-tune on the CoLA dataset
92
+ trainer.train()
93
+
94
+ # Evaluate on the CoLA dataset
95
+ eval_results = trainer.evaluate(eval_dataset=cola_dataset['validation'])
96
+ print(eval_results)
97
+ ```
98
+
99
+ We can also use KooBERT with the sentence-transformers library to create multilingual vector embeddings. Here is an example:
100
+ ```
101
+ from sentence_transformers import SentenceTransformer
102
+
103
+ # Load the KooBERT model
104
+ koo_model = SentenceTransformer('Koodsml/KooBERT', device="cuda")
105
+
106
+ # Define the text
107
+ text = "यह हमेशा से हमारी सोच है"
108
+
109
+ # Get the embedding
110
+ embedding = koo_model.encode(text)
111
+ print(embedding)
112
+ ```
113
+
114
+ ## Training Details
115
+
116
+ ### Training Data
117
+
118
+ <!-- This should link to a Data Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
119
+ Following is the distribution of tokens over languages:
120
+
121
+ | Language | Koos | Avg Tokens per Koo | Total Tokens |
122
+ |------------------|-------------|---------------------|--------------|
123
+ | assamese | 562,050 | 16.4414198 | 9,240,900 |
124
+ | bengali | 2,110,380 | 12.08918773 | 25,512,780 |
125
+ | english | 17,889,600 | 10.93732057 | 195,664,290 |
126
+ | gujarati | 1,825,770 | 14.33965395 | 26,180,910 |
127
+ | hindi | 35,948,760 | 16.2337502 | 583,583,190 |
128
+ | kannada | 2,653,860 | 12.04577107 | 31,967,790 |
129
+ | malayalam | 71,370 | 10.32744851 | 737,070 |
130
+ | marathi | 1,894,080 | 14.81544602 | 28,061,640 |
131
+ | nigeran english | 255,330 | 17.11350018 | 4,369,590 |
132
+ | oriya | 87,930 | 14.1941317 | 1,248,090 |
133
+ | punjabi | 940,260 | 18.59961075 | 17,488,470 |
134
+ | tamil | 1,687,710 | 12.12147822 | 20,457,540 |
135
+ | telugu | 2,471,940 | 10.55735576 | 26,097,150 |
136
+
137
+
138
+ Total Koos = 68,399,040<br>
139
+ Total Tokens = 970,609,410
140
+
141
+ ### Training Procedure
142
+
143
+ <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
144
+
145
+ #### Preprocessing
146
+ Personal Identifiable Information (PII) was removed from data before training on microblogs.
147
+ Temperature Sampling to upsample low resource languages. We used a temperature of value of 0.7 (Refer Sec 3.1 https://arxiv.org/pdf/1901.07291.pdf)
148
+
149
+
150
+ #### Training Hyperparameters
151
+
152
+ - **Training regime**
153
+ Training steps - 1M steps
154
+ Warm - 10k steps
155
+ Learning Rate - 5e-4
156
+ Scheduler - Linear Decay
157
+ Optimizer - Adam
158
+ Batch Size of 4096 sequences
159
+ Precision - fp32
160
+
161
+
162
+ ## Evaluation
163
+
164
+ <!-- This section describes the evaluation protocols and provides the results. -->
165
+ The model has not been benchmarked yet. We shall be releasing the benchmark data in a future update.
166
+
167
+
168
+ ## Model Examination
169
+
170
+ <!-- Relevant interpretability work for the model goes here -->
171
+
172
+ ### Model Architecture and Objective
173
+
174
+ KooBERT is pretrained with BERT Architecture on Masked Language Modeling with a vocabulary size of 128k and max sequence length of 128 tokens.
175
+
176
+ ### Compute Infrastructure
177
+
178
+ KooBERT was trained on TPU v3 with 128 cores which took over 5 days.
179
+
180
+
181
+ ## Contributors
182
+
183
+ Mitesh Khapra ([[email protected]](mailto:[email protected]))- IITM AI4Bharat<br>
184
+ Sumanth Doddapaneni ([[email protected]](mailto:[email protected]))- IITM AI4Bharat<br>
185
+ Smiral Rashinkar ([[email protected]](mailto:[email protected]))- Koo India
186
+
config.json ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "/data/sumanth/conversion/koo/",
3
+ "architectures": [
4
+ "BertForMaskedLM"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "classifier_dropout": null,
8
+ "embedding_size": 768,
9
+ "hidden_act": "gelu",
10
+ "hidden_dropout_prob": 0.1,
11
+ "hidden_size": 768,
12
+ "initializer_range": 0.02,
13
+ "intermediate_size": 3072,
14
+ "layer_norm_eps": 1e-12,
15
+ "max_position_embeddings": 128,
16
+ "model_type": "bert",
17
+ "num_attention_heads": 12,
18
+ "num_hidden_layers": 12,
19
+ "pad_token_id": 0,
20
+ "position_embedding_type": "absolute",
21
+ "torch_dtype": "float32",
22
+ "transformers_version": "4.21.1",
23
+ "type_vocab_size": 2,
24
+ "use_cache": true,
25
+ "vocab_size": 128000
26
+ }
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1e528f093db26d5c350a82ab3a6d2797f672ad7f2738d9de1268f45d47f09548
3
+ size 736785579
special_tokens_map.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": "[CLS]",
3
+ "mask_token": "[MASK]",
4
+ "pad_token": "[PAD]",
5
+ "sep_token": "[SEP]",
6
+ "unk_token": "[UNK]"
7
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ {
2
+ "tokenizer_class": "PreTrainedTokenizerFast"
3
+ }