anakin87's picture
Update README.md
d221934 verified
metadata
license: gemma
language:
  - it
  - en
base_model:
  - VAGOsolutions/SauerkrautLM-gemma-2-9b-it
pipeline_tag: text-generation
library_name: transformers
datasets:
  - mii-llm/argilla-math-preferences-it
  - ruggsea/wsdm2024-cot-dataset
  - anakin87/evol-dpo-ita-reranked
  - mlabonne/orpo-dpo-mix-40k

Gemma 2 9B Neogenesis ITA

Fine-tuned version of VAGOsolutions/SauerkrautLM-gemma-2-9b-it optimized for better performance in Italian.

  • Good model with 9.24 billion parameters
  • Supports 8k context length

Need a smaller model? Try gemma-2-2b-neogenesis-ita.

๐ŸŽฎ Usage

๐Ÿ’ฌ๐Ÿ‡ฎ๐Ÿ‡น Try the model on Hugging Face Spaces

Text generation with Transformers

import torch
from transformers import pipeline

model_id="anakin87/gemma-2-9b-neogenesis-ita"

pipe = pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device="cuda",
)

messages = [{"role": "user", "content": "Cos'รจ l'interesse composto? Spiega in maniera semplice e chiara."}]
outputs = pipe(messages, max_new_tokens=500)

print(outputs[0]["generated_text"][1]["content"])

๐Ÿ† Evaluation Results

The model was submitted and evaluated in the Open Ita LLM Leaderboard, the most popular leaderboard for Italian Language Models.

Model MMLU_IT ARC_IT HELLASWAG_IT Average
google/gemma-2-9b-it 65.67 55.6 68.95 63.41
VAGOsolutions/SauerkrautLM-gemma-2-9b-it 65.76 61.25 72.10 66.37
anakin87/gemma-2-9b-neogenesis-ita 65.82 61.25 73.29 66.79

These results establish this model as a strong 9B model for Italian, outperforming 13-14B models and even surpassing some in the 30-70B range.

๐Ÿ”ง Training details

The model was fine-tuned using Hugging Face TRL and applying Direct Preference Optimization.

I adopted a relatively new technique for parameter-efficient learning: Spectrum. The idea is to train only the layers of the model with high Signal-to-Noise Ratio (SNR) and โ„๏ธ freeze the rest. Specifically, training focused on the top 20% most informative layers.

Batch size: 16; learning rate: 1e-6; epochs: 1.

The training process took approximately 12 hours on a single NVIDIA A100 GPU (80GB VRAM).

For the training code, see the DPO section in this ๐Ÿ““ Kaggle notebook, modified to use a different base model, hyperparameters, and no on-policy data.

๐Ÿ—ƒ๏ธ Training data

The model was trained primarily on Italian data, with a small portion of English data included.

For Direct Preference Optimization

๐Ÿ™ Thanks to the authors for providing these datasets.

๐Ÿ›ก๏ธ Safety

While this model was not specifically fine-tuned for safety, its selective training with the Spectrum technique helps preserve certain safety features from the original model.