ravikumar1478 commited on
Commit
991f574
·
1 Parent(s): 447708c

Update README.md

Browse files

We developed a language model for Telugu using the dataset called Telugu_books, which is from the Kaggle platform, and the dataset contains Telugu data,
there are only a few language models are developed for regional languages like Telugu, Hindi, Kannada...etc,
so we built a dedicated language model especially for the Telugu language.
The model aim is to predict a Telugu word that is masked in a given Telugu sentence by using Masked Language Modeling of BERT [Bidirectional Encoder Representation from Transformers]
and we achieved state-of-the-art performance in it.

Files changed (1) hide show
  1. README.md +48 -3
README.md CHANGED
@@ -5,6 +5,12 @@ tags:
5
  model-index:
6
  - name: xlm-roberta-base-finetuned-wikitext2
7
  results: []
 
 
 
 
 
 
8
  ---
9
 
10
  <!-- This model card has been generated automatically according to the information the Trainer had access to. You
@@ -18,11 +24,16 @@ It achieves the following results on the evaluation set:
18
 
19
  ## Model description
20
 
21
- More information needed
 
 
 
 
 
22
 
23
  ## Intended uses & limitations
24
 
25
- More information needed
26
 
27
  ## Training and evaluation data
28
 
@@ -30,6 +41,40 @@ More information needed
30
 
31
  ## Training procedure
32
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
33
  ### Training hyperparameters
34
 
35
  The following hyperparameters were used during training:
@@ -55,4 +100,4 @@ The following hyperparameters were used during training:
55
  - Transformers 4.24.0
56
  - Pytorch 1.12.1+cu113
57
  - Datasets 2.7.1
58
- - Tokenizers 0.13.2
 
5
  model-index:
6
  - name: xlm-roberta-base-finetuned-wikitext2
7
  results: []
8
+ language:
9
+ - en
10
+ metrics:
11
+ - accuracy
12
+ - code_eval
13
+ pipeline_tag: text-generation
14
  ---
15
 
16
  <!-- This model card has been generated automatically according to the information the Trainer had access to. You
 
24
 
25
  ## Model description
26
 
27
+ We developed a language model for Telugu using the dataset called Telugu_books, which is from the Kaggle platform, and the dataset contains Telugu data,
28
+ there are only a few language models are developed for regional languages like Telugu, Hindi, Kannada...etc,
29
+ so we built a dedicated language model especially for the Telugu language.
30
+ The model aim is to predict a Telugu word that is masked in a given Telugu sentence by using Masked Language Modeling of BERT [Bidirectional Encoder Representation from Transformers]
31
+ and we achieved state-of-the-art performance in it.
32
+
33
 
34
  ## Intended uses & limitations
35
 
36
+ Using this model we can predict the exact and contextual word which is already masked in a given Telugu sentence and we achieved state-of-the-art performance in it.
37
 
38
  ## Training and evaluation data
39
 
 
41
 
42
  ## Training procedure
43
 
44
+ Step-1: Collecting Data
45
+ From the Kaggle Telugu dataset is collected. It contains Telugu paragraphs from
46
+ different books.
47
+
48
+ Step2: Pre-processing Data
49
+ The collected data is pre-processed using different pre-processing techniques
50
+ and splitting the large Telugu Sentence into small sentences.
51
+
52
+ Step-3: Connecting to Hugging Face
53
+ Hugging Face provides a token with which we can log in using a notebook
54
+ function and the rest of the work we do will be exported to the platform
55
+ automatically.
56
+
57
+ Step-4: Loading pre-trained model and tokenizer
58
+ The pre-trained model and tokenizer from xlm-roberta-base are loaded for
59
+ training our Telugu data
60
+
61
+ Step-5: Training the model
62
+ Required libraries like Trainer and Training arguments are imported from
63
+ Transformers library. The after giving the Training arguments with our data we
64
+ train the model using the train() method which takes 1 to 1 ½ hours depending upon
65
+ the size of our input data
66
+
67
+ Step-6: Pushing model and tokenizer
68
+ Then trainer.push_to_hub() and tokenizer.push_to_hub() methods are used to
69
+ export our trained model and its tokenizers which are used for the mapping of
70
+ words in prediction.
71
+
72
+ Step-7: Testing
73
+ In the hugging face after opening our model page there is an API in which We
74
+ give a Telugu Sentence as input with <mask> keyword and click the compute
75
+ button then the predicted words with their probabilities are displayed. Then we
76
+ check that words with the actual words and evaluated
77
+
78
  ### Training hyperparameters
79
 
80
  The following hyperparameters were used during training:
 
100
  - Transformers 4.24.0
101
  - Pytorch 1.12.1+cu113
102
  - Datasets 2.7.1
103
+ - Tokenizers 0.13.2