Schaapje logo

Schaapje-2B-Pretrained

Model description

This continual pretrained model is pretrained on roughly 2.4 Billion tokens of Dutch language data based on Wikipedia and MC4.

Primary objective with continual pretraining on Dutch was to make the model more 'fluent' when using the Dutch language. It will also have gained some additional Dutch knowledge.

As a base model the IBM Granite 3.0 2B Instruct model was used.

See ibm-granite/granite-3.0-2b-instruct for all information about the IBM Granite foundation model.

Model usage

A basic example of how to use this continual pretrained model.

!! IMPORTANT NOTE !! As this is an instruct model that was continual pretrained on dutch data there is some degredation in the performance regarding instruction-following. This custom pretrained model should be further finetuned with SFT in which the embedding and lm_head layer are also trained. Given a proper SFT dataset in dutch this will restore the instruction following/EOS token functionality. See the SFT training notebook for Schaapje on one of the ways on how to do this.

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

device = 'cuda'
model_name = 'robinsmits/Schaapje-2B-Pretrained'

model = AutoModelForCausalLM.from_pretrained(model_name, 
                                             device_map = "auto", 
                                             torch_dtype = torch.bfloat16)

tokenizer = AutoTokenizer.from_pretrained(model_name)

messages = [{"role": "user", "content": "Hoi hoe gaat het ermee?"}]

chat = tokenizer.apply_chat_template(messages, 
                                     tokenize = False, 
                                     add_generation_prompt = True)

input_tokens = tokenizer(chat, return_tensors = "pt").to('cuda')

output = model.generate(**input_tokens, 
                        max_new_tokens = 512,
                        do_sample = True)

output = tokenizer.decode(output[0], skip_special_tokens = False)
print(output)

Intended uses & limitations

As with all LLM's this model can also experience bias and hallucinations. Regardless of how you use this model always perform the necessary testing and validation.

Datasets and Licenses

The datasets used for the continual pretraining had different licenses:

wikimedia/wikipedia: cc-by-sa-3.0
yhavinga/mc4_nl_cleaned: ODB-BY

Model Training

The continual pretraining notebook is available at the following link: Schaapje_2B_Pretrained

Training was performed with Google Colab PRO on a A100 - 40GB in multiple sessions. As the amount of data was more than would fit within the maximum 24 hour session that Google Colab PRO allows I split the dataset in 5 roughly equal parts. Training for each part lasted around 18 to 24 hours. The 'resume_from_checkpoint' was used to continue pretraining in a proper way.

Continual Pretraining dataset was created with the script: prepare_pretraining_datasets

robinsmits
/

Schaapje-2B-Pretrained

Schaapje-2B-Pretrained

Model description

Model usage

Intended uses & limitations

Datasets and Licenses

Model Training

Model tree for robinsmits/Schaapje-2B-Pretrained

Datasets used to train robinsmits/Schaapje-2B-Pretrained

Collection including robinsmits/Schaapje-2B-Pretrained

Schaapje 2B Chat V1.0