--- datasets: - wikimedia/wikipedia - yhavinga/mc4_nl_cleaned language: - nl base_model: - ibm-granite/granite-3.0-2b-instruct pipeline_tag: text-generation library_name: transformers tags: - granite - granite 3.0 - schaapje inference: false license: apache-2.0 ---

Schaapje logo

# Schaapje-2B-Pretrained ## Model description This continual pretrained model is pretrained on roughly 2.4 Billion tokens of Dutch language data based on Wikipedia and MC4. Primary objective with continual pretraining on Dutch was to make the model more 'fluent' when using the Dutch language. It will also have gained some additional Dutch knowledge. As a base model the IBM Granite 3.0 2B Instruct model was used. See [ibm-granite/granite-3.0-2b-instruct](https://huggingface.co/ibm-granite/granite-3.0-2b-instruct) for all information about the IBM Granite foundation model. ## Model usage A basic example of how to use this continual pretrained model. !! IMPORTANT NOTE !! As this is an instruct model that was continual pretrained on dutch data there is some degredation in the performance regarding instruction-following. This custom pretrained model should be further finetuned with SFT in which the embedding and lm_head layer are also trained. Given a proper SFT dataset in dutch this will restore the instruction following/EOS token functionality. See the SFT training notebook for Schaapje on one of the ways on how to do this. ``` import torch from transformers import AutoTokenizer, AutoModelForCausalLM device = 'cuda' model_name = 'robinsmits/Schaapje-2B-Pretrained' model = AutoModelForCausalLM.from_pretrained(model_name, device_map = "auto", torch_dtype = torch.bfloat16) tokenizer = AutoTokenizer.from_pretrained(model_name) messages = [{"role": "user", "content": "Hoi hoe gaat het ermee?"}] chat = tokenizer.apply_chat_template(messages, tokenize = False, add_generation_prompt = True) input_tokens = tokenizer(chat, return_tensors = "pt").to('cuda') output = model.generate(**input_tokens, max_new_tokens = 512, do_sample = True) output = tokenizer.decode(output[0], skip_special_tokens = False) print(output) ``` ## Intended uses & limitations As with all LLM's this model can also experience bias and hallucinations. Regardless of how you use this model always perform the necessary testing and validation. ## Datasets and Licenses The datasets used for the continual pretraining had different licenses: - [wikimedia/wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia): cc-by-sa-3.0 - [yhavinga/mc4_nl_cleaned](https://huggingface.co/datasets/yhavinga/mc4_nl_cleaned): ODB-BY ## Model Training The continual pretraining notebook is available at the following link: [Schaapje_2B_Pretrained](https://github.com/RobinSmits/Schaapje/blob/main/Schaapje_2B_Pretrained.ipynb) Training was performed with Google Colab PRO on a A100 - 40GB in multiple sessions. As the amount of data was more than would fit within the maximum 24 hour session that Google Colab PRO allows I split the dataset in 5 roughly equal parts. Training for each part lasted around 18 to 24 hours. The 'resume_from_checkpoint' was used to continue pretraining in a proper way. Continual Pretraining dataset was created with the script: [prepare_pretraining_datasets](https://github.com/RobinSmits/Schaapje/blob/main/prepare_pretraining_datasets.py)