language:
- nl
datasets:
- yhavinga/mc4_nl_cleaned
- yhavinga/ccmatrix
tags:
- t5
- translation
- seq2seq
pipeline_tag: translation
widget:
- text: >-
It is a painful and tragic spectacle that rises before me: I have drawn
back the curtain from the rottenness of man. This word, in my mouth, is at
least free from one suspicion: that it involves a moral accusation against
humanity. It is used--and I wish to emphasize the fact again--without any
moral significance: and this is so far true that the rottenness I speak of
is most apparent to me precisely in those quarters where there has been
most aspiration, hitherto, toward 'virtue' and 'godliness.'
- text: >-
For once Fletcher’s sedate features showed a certain lightness. 'I believe
I will linger awhile longer.' He indicated a holoscreen which was
displaying the image from an external camera. Cloud-splattered landscape
was rolling past, pastel greens, browns, and blues illuminated by Duke’s
radiance. 'It is not often a mortal man is permitted to view a world over
the shoulder of angels.'
license: apache-2.0
t5-base-36L-ccmatrix-multi
A t5-base-36L-dutch-english-cased model finetuned on Dutch to English and English to Dutch translation with the CCMatrix dataset. Evaluation metrics of this model are listed in the Translation models section below.
This t5 eff model has 728M parameters.
It was pre-trained on the dataset
mc4_nl_cleaned
config large_en_nl
for 1 epoch(s) and a duration of 17d15h,
with a sequence length of 512, batch size 512 and 212963 total steps.
Pre-training evaluation loss and accuracy are 1,05 and 0,76.
Tokenizer
The model uses a cased SentencePiece tokenizer configured with the Nmt, NFKC, Replace multi-space to single-space
normalizers
and has 32003 tokens.
It was trained on Dutch and English with scripts from the Huggingface Transformers Flax examples.
See ./raw/main/tokenizer.json for details.
Dataset
All models listed below are trained on cleaned Dutch mC4, which is the original mC4, except
- Documents that contained words from a selection of the Dutch and English List of Dirty Naught Obscene and Otherwise Bad Words are removed
- Sentences with less than 3 words are removed
- Sentences with a word of more than 1000 characters are removed
- Documents with less than 5 sentences are removed
- Documents with "javascript", "lorum ipsum", "terms of use", "privacy policy", "cookie policy", "uses cookies", "use of cookies", "use cookies", "elementen ontbreken", "deze printversie" are removed.
The Dutch and English models are trained on a 50/50% mix of Dutch mC4 and English C4.
Models
Three types of models have been trained. t5-base-dutch
is the only model with an original T5 config.
The other model types t5-v1.1 and t5-eff have gated-relu
instead of relu
as activation function,
and trained with a drop-out of 0.0
unless training would diverge (t5-v1.1-large-dutch-cased
).
The T5-eff models are models with mostly different numbers of layers. The table will list
the several dimensions of these models. Note that efficient
is a misnomer for models with few layers,
e.g. t5-xl-4L-dutch-english-cased
, that is not efficient and one of the worst models on downstream summarization.
t5-base-dutch | t5-v1.1-base-dutch-uncased | t5-v1.1-base-dutch-cased | t5-v1.1-large-dutch-cased | t5-v1_1-base-dutch-english-cased | t5-v1_1-base-dutch-english-cased-1024 | t5-small-24L-dutch-english | t5-xl-4L-dutch-english-cased | t5-base-36L-dutch-english-cased | t5-eff-xl-8l-dutch-english-cased | t5-eff-large-8l-dutch-english-cased | |
---|---|---|---|---|---|---|---|---|---|---|---|
type | t5 | t5-v1.1 | t5-v1.1 | t5-v1.1 | t5-v1.1 | t5-v1.1 | t5 eff | t5 eff | t5 eff | t5 eff | t5 eff |
d_model | 768 | 768 | 768 | 1024 | 768 | 768 | 512 | 2048 | 768 | 1024 | 1024 |
d_ff | 3072 | 2048 | 2048 | 2816 | 2048 | 2048 | 1920 | 5120 | 2560 | 16384 | 4096 |
num_heads | 12 | 12 | 12 | 16 | 12 | 12 | 8 | 32 | 12 | 32 | 16 |
d_kv | 64 | 64 | 64 | 64 | 64 | 64 | 64 | 64 | 64 | 128 | 64 |
num_layers | 12 | 12 | 12 | 24 | 12 | 12 | 24 | 4 | 36 | 8 | 8 |
num parameters | 223M | 248M | 248M | 783M | 248M | 248M | 250M | 585M | 729M | 1241M | 335M |
feed_forward_proj | relu | gated-gelu | gated-gelu | gated-gelu | gated-gelu | gated-gelu | gated-gelu | gated-gelu | gated-gelu | gated-gelu | gated-gelu |
dropout | 0.1 | 0.0 | 0.0 | 0.1 | 0.0 | 0.0 | 0.0 | 0.1 | 0.0 | 0.0 | 0.0 |
dataset | mc4_nl_cleaned | mc4_nl_cleaned full | mc4_nl_cleaned full | mc4_nl_cleaned | mc4_nl_cleaned small_en_nl | mc4_nl_cleaned large_en_nl | mc4_nl_cleaned large_en_nl | mc4_nl_cleaned large_en_nl | mc4_nl_cleaned large_en_nl | mc4_nl_cleaned large_en_nl | mc4_nl_cleaned large_en_nl |
tr. seq len | 512 | 1024 | 1024 | 512 | 512 | 1024 | 512 | 512 | 512 | 512 | 512 |
batch size | 128 | 64 | 64 | 64 | 128 | 64 | 128 | 512 | 512 | 64 | 128 |
total steps | 527500 | 1014525 | 1210154 | 2427498 | 2839630 | 1520k/3397024 | 851852 | 212963 | 212963 | 538k/1703705 | 851850 |
epochs | 1 | 2 | 2 | 2 | 10 | 4 | 1 | 1 | 1 | 1 | 1 |
duration | 2d9h | 5d5h | 6d6h | 8d13h | 11d18h | 9d1h | 4d10h | 6d1h | 17d15h | 4d 19h | 3d 23h |
optimizer | adafactor | adafactor | adafactor | adafactor | adafactor | adafactor | adafactor | adafactor | adafactor | adafactor | adafactor |
lr | 0.005 | 0.005 | 0.005 | 0.005 | 0.005 | 0.005 | 0.005 | 0.005 | 0.009 | 0.005 | 0.005 |
warmup | 10000.0 | 10000.0 | 10000.0 | 10000.0 | 10000.0 | 5000.0 | 20000.0 | 2500.0 | 1000.0 | 1500.0 | 1500.0 |
eval loss | 1,38 | 1,20 | 0,96 | 1,07 | 1,11 | 1,13 | 1,18 | 1,27 | 1,05 | 1,3019 | 1,15 |
eval acc | 0,70 | 0,73 | 0,78 | 0,76 | 0,75 | 0,74 | 0,74 | 0,72 | 0,76 | 0,71 | 0,74 |
Evaluation on summarization
The models below have been evaluated on the summarization downstream task on 50K samples from the CNN Dailymail dataset. All models were fine-tuned with the AdamW optimizer with a batch size of 128 and constant learning rate of 1e-3 after a warmup of 64 steps, with a label smoothing factor of 0.05. Article and summary token lengths were set to 1024 and 142.
t5-base-dutch | t5-v1.1-base-dutch-uncased | t5-v1.1-base-dutch-cased | t5-v1_1-base-dutch-english-cased | t5-v1_1-base-dutch-english-cased-1024 | t5-small-24L-dutch-english | t5-xl-4L-dutch-english-cased | t5-base-36L-dutch-english-cased | t5-eff-large-8l-dutch-english-cased | mt5-base | |
---|---|---|---|---|---|---|---|---|---|---|
rouge1 | 33.0313 | 33.8432 | 34.0906 | 33.1116 | 34.6465 | 34.376 | 30.8983 | 35.0931 | 33.9293 | 33.6466 |
rouge2 | 12.9452 | 13.7706 | 13.6203 | 13.275 | 13.8525 | 13.8939 | 11.6005 | 14.3823 | 13.6274 | 13.1085 |
rougeL | 23.7204 | 24.5642 | 24.7304 | 24.3561 | 24.721 | 25.2496 | 22.6536 | 25.3213 | 24.5595 | 23.909 |
rougeLsum | 29.842 | 30.7783 | 31.1438 | 30.0548 | 31.6104 | 31.3838 | 27.8467 | 32.3526 | 30.952 | 30.5054 |
gen_len | 90.488 | 91.832 | 92.122 | 89.583 | 98.333 | 90.442 | 92.342 | 96.832 | 95.057 | 96.312 |
num parameters | 223M | 248M | 248M | 248M | 248M | 250M | 585M | 729M | 335M | 582M |
samples_per_second | 3.195 | 3.039 | 3.0 | 3.216 | 2.974 | 1.594 | 2.47 | 0.623 | 3.087 | 1.201 |
Translation models
The small 24L and base 36L models have been fine-tuned for translation on the CCMatrix dataset.
The models named *-multi
support both directions of translation. The models are trained on CCMatrix only. As this is
a really large dataset with over 100M Dutch-English sentence pairs, the models are trained on a fraction of it,
refer to the table below for how long. Evaluation is performed on a CCMatrix section not trained on, but also
on Tatoeba and Opus Books. The _bp
columns list the brevity penalty. The avg_bleu
score is the bleu score
averaged over all three evaluation datasets.
The translation metrics are listed in the table below:
t5-base-36L-ccmatrix-en-nl | t5-base-36L-ccmatrix-multi | t5-base-36L-ccmatrix-multi | t5-small-24L-ccmatrix-multi | t5-small-24L-ccmatrix-multi | |
---|---|---|---|---|---|
id | 0 | 14 | 15 | 16 | 20 |
source_lang | en | en | nl | en | nl |
target_lang | nl | nl | en | nl | en |
source_prefix | translate English to Dutch: | translate English to Dutch: | translate Dutch to English: | translate English to Dutch: | translate Dutch to English: |
tatoeba_bp | 0.9897614370103832 | 0.9736173618072754 | 0.943521164106552 | 0.9760983304454847 | 0.9406676405486575 |
ccmatrix_bp | 0.9590750786190209 | 0.9536276245543676 | 0.9635673583308255 | 0.9517934939463099 | 0.9585648049711814 |
opus_books_bp | 0.7478011343203491 | 0.7950194726093107 | 0.9362852511299413 | 0.770498474692027 | 0.8870675076932444 |
tatoeba_score | 50.63006965176505 | 46.580601850286214 | 52.82030981131822 | 46.419809813946046 | 51.67887417355214 |
ccmatrix_score | 60.33227938980884 | 56.81297258845844 | 62.836646082246254 | 57.404319674892406 | 63.08633155239932 |
opus_books_score | 10.405013868050663 | 13.477997378535864 | 24.93113308798125 | 12.927244801365507 | 23.418552148252047 |
avg_bleu | 40.455787636541515 | 38.95719060576017 | 46.86269632718191 | 38.91712476340132 | 46.0612526247345 |
total steps | 78125 | 390625 | 390625 | 390625 | 390625 |
duration | 14h | 101h | 101h | 74h | 74h |
num_parameters | 728928000 | 728928000 | 728928000 | 249991680 | 249991680 |
label_smoothing_factor | 0.09 | 0.15 | 0.15 | 0.1 | 0.1 |
learning_rate | 0.0001 | 5e-05 | 5e-05 | 0.0005 | 0.0005 |
Acknowledgements
This project would not have been possible without compute generously provided by Google through the TPU Research Cloud. The HuggingFace 🤗 ecosystem and was also instrumental all parts of the training. Logging metrics to Weights & Biases made it possible to keep track of many models and orchestrate hyper-parameter sweeps with insightful visualizations. I cannot imagine how I would have completed this project otherwise. The following repositories where helpful in setting up the TPU-VM, and getting an idea what sensible hyper-parameters are for training gpt2 from scratch.
Created by Yeb Havinga