Continued pretraining
Hi,
commenting on your message on the model card.
We have had continued pretraining on our backlog for quite some long time already and that will be our next project.
I have already done continued pretraining trials using Unsloth based on Llama3.1-8B but didn't have time to test those. We want to try to do it properly and I have been building Finnish "Fineweb-edu" dataset for a while but it will still take time as I have only one gpu atm (Gpu poor as I am). We have read the paper from Japan which you are probably referring to (https://arxiv.org/pdf/2404.17790v1 ) and have been interested in this space from the times we saw https://huggingface.co/danish-foundation-models/munin-7b-alpha/tree/main.
Have you planned on trying this out with more data from HPLT/CulturaX/etch
Also Japanese researchers used a mix of English/Japanese data.
Did you try out playing with the tokenizer modifications?
This was more of a test to see if this works at all. Also Wikipedia's dataset isn't several hundred gigabytes so it's easier to work with. This also cost like 10 dollars training on Runpod. Training on more data will obviously cost more. I didn't really deal with tokenization I probably should because Llama 3's tokenizer is pretty bad for Finnish (almost doubling tokens compared to English).
I might try to work with that huge 162gb HPLT dataset but I'll see if my PC will like it. Otherwise I'll probably look for some smaller datasets I can actually work with.
Yeah thanks for commenting.
Great trial and hopefully we can also try this out and see how it boosts models.
Hopefully I can get the 7b finetune out of the way soon and can focus more on this kind of stuff. Curiously monitoring your trials, would be cool if you could run some Finnish benchmarks on these continuously pretrained models