Schaapje-2B-Pretrained
Model description
This continual pretrained model is pretrained on roughly 2.4 Billion tokens of Dutch language data based on Wikipedia and MC4.
Primary objective with continual pretraining on Dutch was to make the model more 'fluent' when using the Dutch language. It will also have gained some additional Dutch knowledge.
As a base model the IBM Granite 3.0 2B Instruct model was used.
See ibm-granite/granite-3.0-2b-instruct for all information about the IBM Granite foundation model.
Model usage
A basic example of how to use this continual pretrained model.
!! IMPORTANT NOTE !! As this is an instruct model that was continual pretrained on dutch data there is some degredation in the performance regarding instruction-following. This custom pretrained model should be further finetuned with SFT in which the embedding and lm_head layer are also trained. Given a proper SFT dataset in dutch this will restore the instruction following/EOS token functionality. See the SFT training notebook for Schaapje on one of the ways on how to do this.
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
device = 'cuda'
model_name = 'robinsmits/Schaapje-2B-Pretrained'
model = AutoModelForCausalLM.from_pretrained(model_name,
device_map = "auto",
torch_dtype = torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(model_name)
messages = [{"role": "user", "content": "Hoi hoe gaat het ermee?"}]
chat = tokenizer.apply_chat_template(messages,
tokenize = False,
add_generation_prompt = True)
input_tokens = tokenizer(chat, return_tensors = "pt").to('cuda')
output = model.generate(**input_tokens,
max_new_tokens = 512,
do_sample = True)
output = tokenizer.decode(output[0], skip_special_tokens = False)
print(output)
Intended uses & limitations
As with all LLM's this model can also experience bias and hallucinations. Regardless of how you use this model always perform the necessary testing and validation.
Datasets and Licenses
The datasets used for the continual pretraining had different licenses:
- wikimedia/wikipedia: cc-by-sa-3.0
- yhavinga/mc4_nl_cleaned: ODB-BY
Model Training
The continual pretraining notebook is available at the following link: Schaapje_2B_Pretrained
Training was performed with Google Colab PRO on a A100 - 40GB in multiple sessions. As the amount of data was more than would fit within the maximum 24 hour session that Google Colab PRO allows I split the dataset in 5 roughly equal parts. Training for each part lasted around 18 to 24 hours. The 'resume_from_checkpoint' was used to continue pretraining in a proper way.
Continual Pretraining dataset was created with the script: prepare_pretraining_datasets
- Downloads last month
- 502
Model tree for robinsmits/Schaapje-2B-Pretrained
Base model
ibm-granite/granite-3.0-2b-base