yhavinga
/

t5-v1.1-base-dutch-cnn-test

@@ -1,25 +1,77 @@
 ---
 tags:
-- pytorch
-- t5
 - seq2seq
-- summarization
-language:
-- dutch
-datasets:
-- cnn_dm_nl
-widget:
-- text: "(CNN) Skywatchers in West-Noord-Amerika zijn in voor een traktatie: een bijna vijf minuten totale maansverduistering vanmorgen. Hier is hoe het zich ontvouwt:. Het begon om 3:16 a.m. Pacific Daylight Tijd, toen de maan begon te bewegen in de schaduw van de Aarde. Voor het volgende uur en 45 minuten, die schaduw zal bewegen over de maan en verzwolgen het om 4:58 a.m. Pacific Time. De totale verduistering zal slechts vier minuten en 43 seconden duren, en NASA zegt dat maakt het de kortste van de eeuw. Kijken live op NASA TV. Terwijl mensen ten westen van de Mississippi River zal het beste uitzicht hebben, ten minste een gedeeltelijke verduistering zal zichtbaar zijn over de hele natie. Maar zonsopgang zal de show te onderbreken op de Oostkust. Delen van Zuid-Amerika, India, China en China Een maansverduistering gebeurt wanneer de zon, de aarde en de maan een rechte lijn vormen in de ruimte, met de aarde in het midden. De zon schijnt op de Aarde en creëert een schaduw. Als de maan dieper in die schaduw beweegt, lijkt het donker te worden en lijkt zelfs een roodachtige kleur te zijn. Waarom rood? Omdat de atmosfeer van de Aarde het grootste deel van het blauwe licht filtert. Sommige mensen hebben het effect van de \"bloedmaan\" bijgenaamd. NASA zegt dat maansverduisteringen meestal ten minste twee keer per jaar plaatsvinden, maar deze verduistering is de derde in een reeks van vier op een rij, bekend als een \"tetrad.\" De eerste was op 15 april 2014. De tweede was in september 2014, de volgende is zaterdag en er zal er een meer zijn, op 28 september. Als je meer wilt weten over de verduistering, NASA astronoom Mitzi Adam. Deel uw foto's met CNN iReport."
-- text: "(CNN) Filipino's worden gewaarschuwd om op wacht te staan voor flash overstromingen en aardverschuivingen als tropische storm Maysak benaderde de Aziatische eiland natie zaterdag. Slechts een paar dagen geleden, Maysak kreeg super tyfoon status dankzij zijn aanhoudende 150 km/h winden. Het heeft sindsdien verloren veel stoom als het naar het westen in de Stille Oceaan heeft gedraaid. Het is nu geclassificeerd als een tropische storm, volgens de Filipijnse nationale weerdienst, die noemt het een andere naam, Chedeng. Het heeft stabiele winden van meer dan 70 km/h (115 km/h) en gusts tot 90 km/h vanaf 17.00 uur (5 uur ET) Zaterdag. Toch, dat betekent niet dat Maysak zal geen pak een wallop. Autoriteiten nam preventieve stappen om mensen veilig te houden zoals barring outdoor activiteiten zoals zwemmen, surfen, di. Gabriel Llave, een ramp ambtenaar, vertelde PNA dat toeristen die aankomen zaterdag in en rond de kustplaats van Aurora \"zal niet worden geaccepteerd door de eigenaren van hotels, resorts, herbergen en dergelijke... en zal worden geadviseerd om terug te keren naar hun respectievelijke plaatsen.\" Aldczar Aurelio, een meteoroloog met de Filippijnse Atmosferische, Geofysische en Astronomische Diensten Administratie (PAGASA), zei dat de storm was gecentreerd 200 mijl ten zuidwesten van de provincie Aurora vanaf 5 uur (5 uur ET) en richting het westen op een 12.5 mph clip. Het is verwacht dat landval zondagochtend maken op de zuidoostelijke kust van de provincie Isabela en zijn uit de Filippijnen tegen maandag. Ahead van de storm. Isabela Gov. Faustino Dry III waarschuwde zaterdag dat bewoners moet handelen als deze zal maken landfall zondagochtend op de zuidoostelijke kust van de provincie Isabela en zijn uit de Filippijnen voor maandag."
 ---
-# Work in progress. Jan 2022.
-# A collection of Dutch T5 models
-* Many thanks to the [Google TPU Research Cloud](https://sites.research.google/trc/about/) for providing access to a TPU cluster!
-* Continuation of work started during the [Hugging Face community week](https://discuss.huggingface.co/t/open-to-the-community-community-week-using-jax-flax-for-nlp-cv/7104), organized by [HuggingFace](https://huggingface.co/) and TPU usage sponsored by Google, for the project [Pre-train T5 from scratch in Dutch](https://discuss.huggingface.co/t/pretrain-t5-from-scratch-in-dutch/8109).
-* Using improved training script - no more exceptions during training, so no restarting required.
-* All models trained with tensorboard metrics.
-* Thanks to @gsarti for creating the [t5-flax-gcp repository](https://github.com/gsarti/t5-flax-gcp)!

 ---
+language:
+- nl
+datasets:
+- yhavinga/mc4_nl_cleaned
+- ml6team/cnn_dailymail_nl
 tags:
 - seq2seq
+- lm-head
+license: apache-2.0
+inference: false
 ---
+# T5-base finetune for CNN news summarization in Dutch 🇳🇱
+This model is [t5-v1.1-base-dutch-cased](https://huggingface.co/yhavinga/t5-v1.1-base-dutch-cased) finetuned on [CNN Dailymail NL](https://huggingface.co/datasets/ml6team/cnn_dailymail_nl)
+The inference widget on the right has been turned off. For a demo of the Dutch CNN summarization models, head over to the
+Hugging Face Spaces for the **[Netherformer 📰](https://huggingface.co/spaces/flax-community/netherformer)** example application!
+## Tokenizer
+* Tokenizer trained from scratch for Dutch on mC4 nl cleaned with scripts from the Huggingface
+  Transformers [Flax examples](https://github.com/huggingface/transformers/tree/master/examples/flax/language-modeling).
+## Dataset
+All models listed below are trained on of the `full` configuration (39B tokens) of
+[cleaned Dutch mC4](https://huggingface.co/datasets/yhavinga/mc4_nl_cleaned),
+which is the original mC4, except
+  * Documents that contained words from a selection of the Dutch and English [List of Dirty Naught Obscene and Otherwise Bad Words](https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words) are removed
+  * Sentences with less than 3 words are removed
+  * Sentences with a word of more than 1000 characters are removed
+  * Documents with less than 5 sentences are removed
+  * Documents with "javascript", "lorum ipsum", "terms of use", "privacy policy", "cookie policy", "uses cookies",
+    "use of cookies", "use cookies", "elementen ontbreken", "deze printversie" are removed.
+## Models
+* The first model, `t5-base-dutch` is a re-training of the Dutch T5 base v1.0 model trained during the Flax/Jax community
+  week. With training complete, accuracy was improved from 0,64 to 0,70.
+* The second two models are a uncased and cased version of `t5-v1.1-base`, again pre-trained from scratch on Dutch,
+  with a tokenizer also trained from scratch. The t5 v1.1 models are slightly different from the t5 models, and the
+  base models are trained with a dropout of 0.0. For fine-tuning it is intended to set this back to 0.1.
+* The large cased model is a pre-trained Dutch version of `t5-v1.1-large`. Training of t5-v1.1-large proved difficult.
+  Without dropout regularization, the training would diverge at a certain point. With dropout training went better,
+  be it much slower than training the t5-model. At some point convergance was too slow to warrant further training.
+  The latest checkpoint, training scripts and metrics are available for reference. For actual fine-tuning the cased
+  base model is probably the better choice.
+|                            | model   | train seq len | acc      | loss     | batch size | epochs | steps   | dropout | optim     | lr   | duration |
+|----------------------------|---------|---------------|----------|----------|------------|--------|---------|---------|-----------|------|----------|
+| t5-base-dutch              | T5      | 512           | 0,70     | 1,38     | 128        | 1      | 528481  | 0.1     | adafactor | 5e-3 | 2d 9h    |
+| t5-v1.1-base-dutch-uncased | t5-v1.1 | 1024          | 0,73     | 1,20     | 64         | 2      | 1014525 | 0.0     | adafactor | 5e-3 | 5d 5h    |
+| t5-v1.1-base-dutch-cased   | t5-v1.1 | 1024          | **0,78** | **0,96** | 64         | 2      | 1210000 | 0.0     | adafactor | 5e-3 | 6d 6h    |
+| t5-v1.1-large-dutch-cased  | t5-v1.1 | 512           | 0,76     | 1,07     | 64         | 1      | 1120000 | 0.1     | adafactor | 5e-3 | 86 13h   |
+The cased t5-v1.1 Dutch models were fine-tuned on summarizing the CNN Daily Mail dataset.
+|                              | model   | input len | target len | Rouge1 | Rouge2 | RougeL | RougeLsum | Test Gen Len | epochs | batch size | steps | duration |
+|------------------------------|---------|-----------|------------|--------|--------|--------|-----------|--------------|--------|------------|-------|----------|
+| t5-v1.1-base-dutch-cnn-test  | t5-v1.1 | 1024      | 96         | 34,8   | 13,6   | 25,2   | 32,1      | 79           | 6      | 64         | 26916 | 2h 40m   |
+| t5-v1.1-large-dutch-cnn-test | t5-v1.1 | 1024      | 96         | 34,4   | 13,6   | 25,3   | 31,7      | 81           | 5      | 16         | 89720 | 11h      |
+## Acknowledgements
+This project would not have been possible without compute generously provided by Google through the
+[TPU Research Cloud](https://sites.research.google/trc/). The HuggingFace 🤗 ecosystem was also
+instrumental in many, if not all parts of the training. The following repositories where helpful in setting up the TPU-VM,
+and getting an idea what sensible hyper-parameters are for training gpt2 from scratch.
+* [Gsarti's Pretrain and Fine-tune a T5 model with Flax on GCP](https://github.com/gsarti/t5-flax-gcp)
+* [Flax/Jax Community week t5-base-dutch](https://huggingface.co/flax-community/t5-base-dutch)
+Created by [Yeb Havinga](https://www.linkedin.com/in/yeb-havinga-86530825/)