File size: 4,181 Bytes
31550c0
64c6c15
 
31550c0
 
 
991f574
 
 
331d884
64c6c15
 
 
 
31550c0
 
 
 
 
 
 
 
 
 
 
 
 
991f574
 
 
 
 
 
31550c0
 
 
991f574
31550c0
 
 
61c3c4f
 
 
 
 
 
 
 
 
 
 
31550c0
 
 
991f574
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
31550c0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
991f574
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
---
language:
- en
license: mit
tags:
- generated_from_trainer
metrics:
- accuracy
- code_eval
pipeline_tag: fill-mask
base_model: xlm-roberta-base
model-index:
- name: xlm-roberta-base-finetuned-wikitext2
  results: []
---

<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment. -->

# xlm-roberta-base-finetuned-wikitext2

This model is a fine-tuned version of [xlm-roberta-base](https://huggingface.co/xlm-roberta-base) on the None dataset.
It achieves the following results on the evaluation set:
- Loss: 2.0384

## Model description

We developed a language model for Telugu using the dataset called Telugu_books, which is from the Kaggle platform, and the dataset contains Telugu data,
there are only a few language models are developed for regional languages like Telugu, Hindi, Kannada...etc, 
so we built a dedicated language model especially for the Telugu language.
The model aim is to predict a Telugu word that is masked in a given Telugu sentence by using Masked Language Modeling of BERT [Bidirectional Encoder Representation from Transformers]
and we achieved state-of-the-art performance in it.


## Intended uses & limitations

Using this model we can predict the exact and contextual word which is already masked in a given Telugu sentence and we achieved state-of-the-art performance in it.

## Training and evaluation data

Training data:
Required libraries like Trainer and Training arguments are imported from 
transformers library. The after giving the Training arguments with our data we 
train the model using train() method which takes 1 to 1 ½ hour depending upon 
the size of our input data

Evaluating data:
In the hugging face after opening our model page there is an API in which We 
give a Telugu Sentence as input with <mask> keyword and click the compute
button then the predicted words with their probabilities are displayed. Then we 
check that words with the actual words and evaluated

## Training procedure

Step-1: Collecting Data
From the Kaggle Telugu dataset is collected. It contains Telugu paragraphs from 
different books.

Step2: Pre-processing Data
The collected data is pre-processed using different pre-processing techniques 
and splitting the large Telugu Sentence into small sentences.

Step-3: Connecting to Hugging Face
Hugging Face provides a token with which we can log in using a notebook 
function and the rest of the work we do will be exported to the platform 
automatically.

Step-4: Loading pre-trained model and tokenizer
The pre-trained model and tokenizer from xlm-roberta-base are loaded for 
training our Telugu data

Step-5: Training the model
Required libraries like Trainer and Training arguments are imported from 
Transformers library. The after giving the Training arguments with our data we 
train the model using the train() method which takes 1 to 1 ½ hours depending upon 
the size of our input data

Step-6: Pushing model and tokenizer 
Then trainer.push_to_hub() and tokenizer.push_to_hub() methods are used to 
export our trained model and its tokenizers which are used for the mapping of 
words in prediction. 

Step-7: Testing
In the hugging face after opening our model page there is an API in which We 
give a Telugu Sentence as input with <mask> keyword and click the compute
button then the predicted words with their probabilities are displayed. Then we 
check that words with the actual words and evaluated

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 2e-05
- train_batch_size: 8
- eval_batch_size: 8
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- num_epochs: 3.0

### Training results

| Training Loss | Epoch | Step | Validation Loss |
|:-------------:|:-----:|:----:|:---------------:|
| 2.4192        | 1.0   | 1250 | 2.1557          |
| 2.2859        | 2.0   | 2500 | 2.0632          |
| 2.2311        | 3.0   | 3750 | 2.0083          |


### Framework versions

- Transformers 4.24.0
- Pytorch 1.12.1+cu113
- Datasets 2.7.1
- Tokenizers 0.13.2