Update README.md
Browse files
README.md
CHANGED
@@ -4,24 +4,16 @@
|
|
4 |
{}
|
5 |
---
|
6 |
|
7 |
-
# Varta
|
8 |
-
|
9 |
-
<!-- Provide a quick summary of what the model is/does. -->
|
10 |
|
11 |
## Model Description
|
12 |
-
Varta
|
13 |
-
Varta is a large-scale headline-generation dataset for Indic languages, including 41.8 million news articles in 14 different Indic languages (and English), which come from a variety of high-quality sources.
|
14 |
-
|
15 |
-
|
16 |
-
The dataset and the model were introduced in [this paper](https://arxiv.org/pdf/2305.05858.pdf). The code was released in [this repository](https://github.com/rahular/varta). The data was released in [this bucket](https://console.cloud.google.com/storage/browser/varta-eu/data-release).
|
17 |
|
|
|
|
|
18 |
|
19 |
## Uses
|
20 |
-
|
21 |
-
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
|
22 |
-
|
23 |
-
You can use the raw model for language modelling, but it's mostly intended to be fine-tuned on a downstream task. <br>
|
24 |
-
|
25 |
|
26 |
Note that the text-to-text framework allows us to use the same model on any NLP task, including text generation tasks (e.g., machine translation, document summarization, question answering), and classification tasks (e.g., sentiment analysis).
|
27 |
|
@@ -39,7 +31,7 @@ This work is mainly dedicated to the curation of a new multilingual dataset for
|
|
39 |
|
40 |
## How to Get Started with the Model
|
41 |
|
42 |
-
You can use this model directly
|
43 |
|
44 |
```python
|
45 |
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
|
@@ -54,37 +46,34 @@ model = AutoModelForSeq2SeqLM.from_pretrained("rahular/varta-t5")
|
|
54 |
|
55 |
### Training Data
|
56 |
Varta contains 41.8 million high-quality news articles in 14 Indic languages and English.
|
57 |
-
With 34.5 million non-English article-headline pairs, it is the largest
|
58 |
|
59 |
### Pretraining
|
60 |
-
We use span corruption and gap-sentence generation as the pretraining objectives.
|
61 |
-
Both objectives are sampled uniformly during pretraining.
|
62 |
-
Span corruption is similar to masked language modeling except that instead of masking random tokens, we mask spans of tokens with an average length of 3.
|
63 |
-
In gap-sentence prediction, whole sentences are masked instead of spans. We follow the original work, and select sentences based on their `importance'.
|
64 |
-
Rouge-1 F1-score between the sentence and the document is used as a proxy for importance.
|
65 |
-
We use 0.15 and 0.2 as the masking ratios for span corruption and gap-sentence generation, respectively.
|
66 |
|
67 |
Since data sizes across languages in Varta vary from 1.5K (Bhojpuri) to 14.4M articles (Hindi), we use standard temperature-based sampling to upsample data when necessary.
|
68 |
|
69 |
-
We pretrain Varta-T5 using the T5 1.1 base architecture with 12 encoder and decoder layers.
|
70 |
-
We train with maximum sequence lengths of 512 and 256 for the encoder and decoder respectively.
|
71 |
-
We use 12 attention heads with an embedding dimension of 768 and a feed-forward width of 2048.
|
72 |
-
We use a 128K sentencepiece vocabulary.
|
73 |
-
In total, the model has 395M parameters.
|
74 |
-
The model is trained with Adafactor optimizer with a warm-up of 10K steps.
|
75 |
-
We use an initial learning rate of 1e-3 and use square root decay till we reach 2M steps.
|
76 |
-
We use an effective batch size of 256 and train the model on TPU v3-8 chips.
|
77 |
-
The model takes 11 days to train.
|
78 |
|
79 |
<!-- This should link to a Data Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
|
80 |
|
81 |
### Evaluation Results
|
82 |
-
|
83 |
|
84 |
## Citation
|
85 |
-
|
86 |
-
<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
|
87 |
-
|
88 |
```
|
89 |
@misc{aralikatte2023varta,
|
90 |
title={V\=arta: A Large-Scale Headline-Generation Dataset for Indic Languages},
|
|
|
4 |
{}
|
5 |
---
|
6 |
|
7 |
+
# Varta-T5
|
|
|
|
|
8 |
|
9 |
## Model Description
|
10 |
+
Varta-BERT is a model pre-trained on the `full` training set of Varta in 14 Indic languages (Assamese, Bhojpuri, Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Nepali, Oriya, Punjabi, Tamil, Telugu, and Urdu) and English, using a masked language modeling (MLM) objective.
|
|
|
|
|
|
|
|
|
11 |
|
12 |
+
Varta is a large-scale news corpus for Indic languages, including 41.8 million news articles in 14 different Indic languages (and English), which come from a variety of high-quality sources.
|
13 |
+
The dataset and the model are introduced in [this paper](https://arxiv.org/abs/2305.05858). The code is released in [this repository](https://github.com/rahular/varta). The data is released in [this bucket](https://console.cloud.google.com/storage/browser/varta-eu/data-release).
|
14 |
|
15 |
## Uses
|
16 |
+
You can use this model for causal language modeling, but it's mostly intended to be fine-tuned on a downstream task.
|
|
|
|
|
|
|
|
|
17 |
|
18 |
Note that the text-to-text framework allows us to use the same model on any NLP task, including text generation tasks (e.g., machine translation, document summarization, question answering), and classification tasks (e.g., sentiment analysis).
|
19 |
|
|
|
31 |
|
32 |
## How to Get Started with the Model
|
33 |
|
34 |
+
You can use this model directly for span in-filling.
|
35 |
|
36 |
```python
|
37 |
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
|
|
|
46 |
|
47 |
### Training Data
|
48 |
Varta contains 41.8 million high-quality news articles in 14 Indic languages and English.
|
49 |
+
With 34.5 million non-English article-headline pairs, it is the largest document-level dataset of its kind.
|
50 |
|
51 |
### Pretraining
|
52 |
+
- We use span corruption and gap-sentence generation as the pretraining objectives.
|
53 |
+
- Both objectives are sampled uniformly during pretraining.
|
54 |
+
- Span corruption is similar to masked language modeling except that instead of masking random tokens, we mask spans of tokens with an average length of 3.
|
55 |
+
- In gap-sentence prediction, whole sentences are masked instead of spans. We follow [the original work](https://arxiv.org/abs/1912.08777), and select sentences based on their `importance'.
|
56 |
+
- Rouge-1 F1-score between the sentence and the document is used as a proxy for importance.
|
57 |
+
- We use 0.15 and 0.2 as the masking ratios for span corruption and gap-sentence generation, respectively.
|
58 |
|
59 |
Since data sizes across languages in Varta vary from 1.5K (Bhojpuri) to 14.4M articles (Hindi), we use standard temperature-based sampling to upsample data when necessary.
|
60 |
|
61 |
+
- We pretrain Varta-T5 using the T5 1.1 base architecture with 12 encoder and decoder layers.
|
62 |
+
- We train with maximum sequence lengths of 512 and 256 for the encoder and decoder respectively.
|
63 |
+
- We use 12 attention heads with an embedding dimension of 768 and a feed-forward width of 2048.
|
64 |
+
- We use a 128K sentencepiece vocabulary.
|
65 |
+
- In total, the model has 395M parameters.
|
66 |
+
- The model is trained with Adafactor optimizer with a warm-up of 10K steps.
|
67 |
+
- We use an initial learning rate of 1e-3 and use square root decay till we reach 2M steps.
|
68 |
+
- We use an effective batch size of 256 and train the model on TPU v3-8 chips.
|
69 |
+
- The model takes 11 days to train.
|
70 |
|
71 |
<!-- This should link to a Data Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
|
72 |
|
73 |
### Evaluation Results
|
74 |
+
Please see [the paper](https://arxiv.org/pdf/2305.05858.pdf).
|
75 |
|
76 |
## Citation
|
|
|
|
|
|
|
77 |
```
|
78 |
@misc{aralikatte2023varta,
|
79 |
title={V\=arta: A Large-Scale Headline-Generation Dataset for Indic Languages},
|