distily_experiments_loss_reverse_kl

This student model is distilled from the teacher model Qwen/Qwen2-0.5B-Instruct using the dataset (unspecified).

The Distily library was used for this distillation.

It achieves the following results on the evaluation set:

  • eval_enwikippl: 2760.3779
  • eval_frwikippl: 28158.2578
  • eval_zhwikippl: 441247.4688
  • eval_loss: 3.1654
  • eval_runtime: 90.8509
  • eval_samples_per_second: 11.007
  • eval_steps_per_second: 2.752

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • distillation_strategy: logits_activations
  • loss_fn: reverse_kl
  • train_embeddings: True
  • learning_rate: 4e-05
  • train_batch_size: 4
  • eval_batch_size: 4
  • seed: 42
  • gradient_accumulation_steps: 4
  • total_train_batch_size: 16
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: constant
  • num_epochs: 1.0

Resource Usage

Peak GPU Memory: 19.8832 GB

Eval-Phase Metrics

step epoch enwikippl frwikippl loss runtime samples_per_second steps_per_second zhwikippl
teacher eval 13.0697 11.6518 21.6262
0 0 180187.8438 182062.6875 131.8108 90.6539 11.031 2.758 181762.375
500 0.0808 14699.2041 52797.9922 6.0418 90.8884 11.003 2.751 371252.0312
1000 0.1616 8812.4561 47709.9297 4.9882 90.8533 11.007 2.752 384212.3438
1500 0.2424 7321.3081 44922.375 4.6195 90.7179 11.023 2.756 400192.5625
2000 0.3232 6277.4165 42254.6719 4.2012 90.8257 11.01 2.753 423631.0938
2500 0.4040 5452.0264 39927.7812 3.9955 90.7803 11.016 2.754 445022.5938
3000 0.4848 4708.5049 37660.8359 3.7784 90.8232 11.01 2.753 447453.4375
3500 0.5657 4329.6147 35350.4805 3.6816 90.8654 11.005 2.751 455292.8125
4000 0.6465 3840.0864 33493.6836 3.5800 90.7858 11.015 2.754 446474.3125
4500 0.7273 3495.4482 31764.3340 3.4447 90.8083 11.012 2.753 447611.3438
5000 0.8081 3245.5376 30812.8379 3.3323 90.7976 11.014 2.753 448982.8438
5500 0.8889 3057.9595 29516.0742 3.2926 90.7385 11.021 2.755 459842.8125
6000 0.9697 2831.3643 28517.0625 3.1956 90.7677 11.017 2.754 441979.4375
6187 0.9999 2760.3779 28158.2578 3.1654 90.8509 11.007 2.752 441247.4688

Framework versions

  • Distily 0.1.0
  • Transformers 4.43.3
  • Pytorch 2.3.0
  • Datasets 2.20.0
Downloads last month
2
Safetensors
Model size
494M params
Tensor type
BF16
·
Inference API
Unable to determine this model’s pipeline type. Check the docs .

Model tree for lapp0/distily_experiments_loss_reverse_kl

Base model

Qwen/Qwen2-0.5B
Quantized
(51)
this model