---
datasets:
- wikimedia/wikipedia
- yhavinga/mc4_nl_cleaned
language:
- nl
base_model:
- ibm-granite/granite-3.0-2b-instruct
pipeline_tag: text-generation
library_name: transformers
tags:
- granite
- granite 3.0
- schaapje
inference: false
license: apache-2.0
---
<p align="center">
  <img src="sheep.png" alt="Schaapje logo" width="750"/>
</p>

# Schaapje-2B-Pretrained

## Model description

This continual pretrained model is pretrained on roughly 2.4 Billion tokens of Dutch language data based on Wikipedia and MC4.

Primary objective with continual pretraining on Dutch was to make the model more 'fluent' when using the Dutch language. It will also have gained some additional Dutch knowledge.

As a base model the IBM Granite 3.0 2B Instruct model was used.

See [ibm-granite/granite-3.0-2b-instruct](https://huggingface.co/ibm-granite/granite-3.0-2b-instruct) for all information about the IBM Granite foundation model.

## Model usage

A basic example of how to use this continual pretrained model.

!! IMPORTANT NOTE !!
As this is an instruct model that was continual pretrained on dutch data there is some degredation in the performance regarding instruction-following. This custom pretrained model should be  further finetuned with SFT in which the embedding and lm_head layer are also trained. Given a proper SFT dataset in dutch this will restore the instruction following/EOS token functionality.
See the SFT training notebook for Schaapje on one of the ways on how to do this.

```
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

device = 'cuda'
model_name = 'robinsmits/Schaapje-2B-Pretrained'

model = AutoModelForCausalLM.from_pretrained(model_name, 
                                             device_map = "auto", 
                                             torch_dtype = torch.bfloat16)

tokenizer = AutoTokenizer.from_pretrained(model_name)

messages = [{"role": "user", "content": "Hoi hoe gaat het ermee?"}]

chat = tokenizer.apply_chat_template(messages, 
                                     tokenize = False, 
                                     add_generation_prompt = True)

input_tokens = tokenizer(chat, return_tensors = "pt").to('cuda')

output = model.generate(**input_tokens, 
                        max_new_tokens = 512,
                        do_sample = True)

output = tokenizer.decode(output[0], skip_special_tokens = False)
print(output)
```

## Intended uses & limitations

As with all LLM's this model can also experience bias and hallucinations. Regardless of how you use this model always perform the necessary testing and validation.

## Datasets and Licenses

The datasets used for the continual pretraining had different licenses:
- [wikimedia/wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia): cc-by-sa-3.0
- [yhavinga/mc4_nl_cleaned](https://huggingface.co/datasets/yhavinga/mc4_nl_cleaned): ODB-BY

##  Model Training

The continual pretraining notebook is available at the following link: [Schaapje_2B_Pretrained](https://github.com/RobinSmits/Schaapje/blob/main/Schaapje_2B_Pretrained.ipynb)

Training was performed with Google Colab PRO on a A100 - 40GB in multiple sessions. As the amount of data was more than would fit within the maximum 24 hour session that Google Colab PRO allows I split the dataset in 5 roughly equal parts. Training for each part lasted around 18 to 24 hours. The 'resume_from_checkpoint' was used to continue pretraining in a proper way.

Continual Pretraining dataset was created with the script: [prepare_pretraining_datasets](https://github.com/RobinSmits/Schaapje/blob/main/prepare_pretraining_datasets.py)