metadata

license: gemma
language:
  - it
  - en
base_model:
  - VAGOsolutions/SauerkrautLM-gemma-2-9b-it
pipeline_tag: text-generation
library_name: transformers
datasets:
  - mii-llm/argilla-math-preferences-it
  - ruggsea/wsdm2024-cot-dataset
  - anakin87/evol-dpo-ita-reranked
  - mlabonne/orpo-dpo-mix-40k

Gemma 2 9B Neogenesis ITA

Fine-tuned version of VAGOsolutions/SauerkrautLM-gemma-2-9b-it optimized for better performance in Italian.

Good model with 9.24 billion parameters
Supports 8k context length

Need a smaller model? Try gemma-2-2b-neogenesis-ita.

🎮 Usage

💬🇮🇹 Try the model on Hugging Face Spaces

Text generation with Transformers

import torch
from transformers import pipeline

model_id="anakin87/gemma-2-9b-neogenesis-ita"

pipe = pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device="cuda",
)

messages = [{"role": "user", "content": "Cos'è l'interesse composto? Spiega in maniera semplice e chiara."}]
outputs = pipe(messages, max_new_tokens=500)

print(outputs[0]["generated_text"][1]["content"])

🏆 Evaluation Results

The model was submitted and evaluated in the Open Ita LLM Leaderboard, the most popular leaderboard for Italian Language Models.

Model	MMLU_IT	ARC_IT	HELLASWAG_IT	Average
google/gemma-2-9b-it	65.67	55.6	68.95	63.41
VAGOsolutions/SauerkrautLM-gemma-2-9b-it	65.76	61.25	72.10	66.37
anakin87/gemma-2-9b-neogenesis-ita	65.82	61.25	73.29	66.79

These results establish this model as a strong 9B model for Italian, outperforming 13-14B models and even surpassing some in the 30-70B range.

🔧 Training details

The model was fine-tuned using Hugging Face TRL and applying Direct Preference Optimization.

I adopted a relatively new technique for parameter-efficient learning: Spectrum. The idea is to train only the layers of the model with high Signal-to-Noise Ratio (SNR) and ❄️ freeze the rest. Specifically, training focused on the top 20% most informative layers.

Batch size: 16; learning rate: 1e-6; epochs: 1.

The training process took approximately 12 hours on a single NVIDIA A100 GPU (80GB VRAM).

For the training code, see the DPO section in this 📓 Kaggle notebook, modified to use a different base model, hyperparameters, and no on-policy data.

🗃️ Training data

The model was trained primarily on Italian data, with a small portion of English data included.

For Direct Preference Optimization

Italian data
English data
- mlabonne/orpo-dpo-mix-40k

🙏 Thanks to the authors for providing these datasets.

🛡️ Safety

While this model was not specifically fine-tuned for safety, its selective training with the Spectrum technique helps preserve certain safety features from the original model.