rahular commited on
Commit
11ef4a6
·
1 Parent(s): be9d155

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +23 -34
README.md CHANGED
@@ -4,24 +4,16 @@
4
  {}
5
  ---
6
 
7
- # Varta T5 model
8
-
9
- <!-- Provide a quick summary of what the model is/does. -->
10
 
11
  ## Model Description
12
- Varta T5 is a model pre-trained on full training set from Varta on English and 14 Indic languages (Assamese, Bhojpuri, Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Nepali, Oriya, Punjabi, Tamil, Telugu, and Urdu) using span corruption and gap-sentence generation as objectives from scratch.
13
- Varta is a large-scale headline-generation dataset for Indic languages, including 41.8 million news articles in 14 different Indic languages (and English), which come from a variety of high-quality sources.
14
-
15
-
16
- The dataset and the model were introduced in [this paper](https://arxiv.org/pdf/2305.05858.pdf). The code was released in [this repository](https://github.com/rahular/varta). The data was released in [this bucket](https://console.cloud.google.com/storage/browser/varta-eu/data-release).
17
 
 
 
18
 
19
  ## Uses
20
-
21
- <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
22
-
23
- You can use the raw model for language modelling, but it's mostly intended to be fine-tuned on a downstream task. <br>
24
-
25
 
26
  Note that the text-to-text framework allows us to use the same model on any NLP task, including text generation tasks (e.g., machine translation, document summarization, question answering), and classification tasks (e.g., sentiment analysis).
27
 
@@ -39,7 +31,7 @@ This work is mainly dedicated to the curation of a new multilingual dataset for
39
 
40
  ## How to Get Started with the Model
41
 
42
- You can use this model directly with a pipeline for language modeling.
43
 
44
  ```python
45
  from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
@@ -54,37 +46,34 @@ model = AutoModelForSeq2SeqLM.from_pretrained("rahular/varta-t5")
54
 
55
  ### Training Data
56
  Varta contains 41.8 million high-quality news articles in 14 Indic languages and English.
57
- With 34.5 million non-English article-headline pairs, it is the largest headline-generation dataset of its kind.
58
 
59
  ### Pretraining
60
- We use span corruption and gap-sentence generation as the pretraining objectives.
61
- Both objectives are sampled uniformly during pretraining.
62
- Span corruption is similar to masked language modeling except that instead of masking random tokens, we mask spans of tokens with an average length of 3.
63
- In gap-sentence prediction, whole sentences are masked instead of spans. We follow the original work, and select sentences based on their `importance'.
64
- Rouge-1 F1-score between the sentence and the document is used as a proxy for importance.
65
- We use 0.15 and 0.2 as the masking ratios for span corruption and gap-sentence generation, respectively.
66
 
67
  Since data sizes across languages in Varta vary from 1.5K (Bhojpuri) to 14.4M articles (Hindi), we use standard temperature-based sampling to upsample data when necessary.
68
 
69
- We pretrain Varta-T5 using the T5 1.1 base architecture with 12 encoder and decoder layers.
70
- We train with maximum sequence lengths of 512 and 256 for the encoder and decoder respectively.
71
- We use 12 attention heads with an embedding dimension of 768 and a feed-forward width of 2048.
72
- We use a 128K sentencepiece vocabulary.
73
- In total, the model has 395M parameters.
74
- The model is trained with Adafactor optimizer with a warm-up of 10K steps.
75
- We use an initial learning rate of 1e-3 and use square root decay till we reach 2M steps.
76
- We use an effective batch size of 256 and train the model on TPU v3-8 chips.
77
- The model takes 11 days to train.
78
 
79
  <!-- This should link to a Data Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
80
 
81
  ### Evaluation Results
82
- To come.
83
 
84
  ## Citation
85
-
86
- <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
87
-
88
  ```
89
  @misc{aralikatte2023varta,
90
  title={V\=arta: A Large-Scale Headline-Generation Dataset for Indic Languages},
 
4
  {}
5
  ---
6
 
7
+ # Varta-T5
 
 
8
 
9
  ## Model Description
10
+ Varta-BERT is a model pre-trained on the `full` training set of Varta in 14 Indic languages (Assamese, Bhojpuri, Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Nepali, Oriya, Punjabi, Tamil, Telugu, and Urdu) and English, using a masked language modeling (MLM) objective.
 
 
 
 
11
 
12
+ Varta is a large-scale news corpus for Indic languages, including 41.8 million news articles in 14 different Indic languages (and English), which come from a variety of high-quality sources.
13
+ The dataset and the model are introduced in [this paper](https://arxiv.org/abs/2305.05858). The code is released in [this repository](https://github.com/rahular/varta). The data is released in [this bucket](https://console.cloud.google.com/storage/browser/varta-eu/data-release).
14
 
15
  ## Uses
16
+ You can use this model for causal language modeling, but it's mostly intended to be fine-tuned on a downstream task.
 
 
 
 
17
 
18
  Note that the text-to-text framework allows us to use the same model on any NLP task, including text generation tasks (e.g., machine translation, document summarization, question answering), and classification tasks (e.g., sentiment analysis).
19
 
 
31
 
32
  ## How to Get Started with the Model
33
 
34
+ You can use this model directly for span in-filling.
35
 
36
  ```python
37
  from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
 
46
 
47
  ### Training Data
48
  Varta contains 41.8 million high-quality news articles in 14 Indic languages and English.
49
+ With 34.5 million non-English article-headline pairs, it is the largest document-level dataset of its kind.
50
 
51
  ### Pretraining
52
+ - We use span corruption and gap-sentence generation as the pretraining objectives.
53
+ - Both objectives are sampled uniformly during pretraining.
54
+ - Span corruption is similar to masked language modeling except that instead of masking random tokens, we mask spans of tokens with an average length of 3.
55
+ - In gap-sentence prediction, whole sentences are masked instead of spans. We follow [the original work](https://arxiv.org/abs/1912.08777), and select sentences based on their `importance'.
56
+ - Rouge-1 F1-score between the sentence and the document is used as a proxy for importance.
57
+ - We use 0.15 and 0.2 as the masking ratios for span corruption and gap-sentence generation, respectively.
58
 
59
  Since data sizes across languages in Varta vary from 1.5K (Bhojpuri) to 14.4M articles (Hindi), we use standard temperature-based sampling to upsample data when necessary.
60
 
61
+ - We pretrain Varta-T5 using the T5 1.1 base architecture with 12 encoder and decoder layers.
62
+ - We train with maximum sequence lengths of 512 and 256 for the encoder and decoder respectively.
63
+ - We use 12 attention heads with an embedding dimension of 768 and a feed-forward width of 2048.
64
+ - We use a 128K sentencepiece vocabulary.
65
+ - In total, the model has 395M parameters.
66
+ - The model is trained with Adafactor optimizer with a warm-up of 10K steps.
67
+ - We use an initial learning rate of 1e-3 and use square root decay till we reach 2M steps.
68
+ - We use an effective batch size of 256 and train the model on TPU v3-8 chips.
69
+ - The model takes 11 days to train.
70
 
71
  <!-- This should link to a Data Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
72
 
73
  ### Evaluation Results
74
+ Please see [the paper](https://arxiv.org/pdf/2305.05858.pdf).
75
 
76
  ## Citation
 
 
 
77
  ```
78
  @misc{aralikatte2023varta,
79
  title={V\=arta: A Large-Scale Headline-Generation Dataset for Indic Languages},