sft_dpo_p

This model is a fine-tuned version of mistralai/Mistral-Nemo-Instruct-2407 on the heat_transfer_dpo_p dataset. It achieves the following results on the evaluation set:

Loss: 0.1569
Rewards/chosen: 0.3090
Rewards/rejected: -5.2240
Rewards/accuracies: 0.9520
Rewards/margins: 5.5331
Logps/chosen: -1.4012
Logps/rejected: -57.0955
Logits/chosen: -0.1708
Logits/rejected: -0.2166

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e-06
train_batch_size: 4
eval_batch_size: 4
seed: 42
distributed_type: multi-GPU
num_devices: 2
total_train_batch_size: 8
total_eval_batch_size: 8
optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
lr_scheduler_type: cosine
lr_scheduler_warmup_ratio: 0.1
num_epochs: 1

Training results

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/chosen	Logps/rejected	Logits/chosen	Logits/rejected
0.3669	0.0533	60	0.3126	0.3606	-0.9629	0.9150	1.3235	-0.8857	-14.4843	-0.5259	-0.5415
0.2995	0.1067	120	0.2095	0.2729	-3.2809	0.9320	3.5538	-1.7626	-37.6640	-0.2224	-0.2795
0.0686	0.16	180	0.2650	0.2280	-4.0377	0.9220	4.2657	-2.2109	-45.2318	-0.1560	-0.2160
0.1007	0.2133	240	0.2294	0.2211	-4.3632	0.9340	4.5843	-2.2807	-48.4872	-0.1604	-0.2090
0.2146	0.2667	300	0.1389	0.3621	-3.4515	0.9390	3.8136	-0.8700	-39.3696	-0.2215	-0.2535
0.0175	0.32	360	0.1924	0.2508	-4.5680	0.9430	4.8188	-1.9836	-50.5354	-0.1839	-0.2427
0.2375	0.3733	420	0.2330	0.2380	-4.5576	0.9310	4.7956	-2.1114	-50.4313	-0.1628	-0.2199
0.2265	0.4267	480	0.2988	0.1994	-4.5453	0.9190	4.7447	-2.4975	-50.3082	-0.1496	-0.2141
0.0854	0.48	540	0.1945	0.2575	-4.3099	0.9370	4.5674	-1.9162	-47.9538	-0.1301	-0.1829
0.2707	0.5333	600	0.1508	0.3076	-4.9413	0.9500	5.2489	-1.4153	-54.2679	-0.1536	-0.2036
0.161	0.5867	660	0.1841	0.2792	-5.1292	0.9470	5.4084	-1.6994	-56.1473	-0.1543	-0.2038
0.4007	0.64	720	0.1888	0.2476	-5.0702	0.9480	5.3178	-2.0148	-55.5571	-0.1643	-0.2078
0.1186	0.6933	780	0.2090	0.2271	-5.1242	0.9450	5.3513	-2.2203	-56.0969	-0.1519	-0.1959
0.148	0.7467	840	0.1778	0.2731	-5.1445	0.9470	5.4176	-1.7601	-56.3004	-0.1673	-0.2100
0.12	0.8	900	0.1519	0.3056	-5.1776	0.9520	5.4832	-1.4355	-56.6311	-0.1742	-0.2169
0.1522	0.8533	960	0.1528	0.3085	-5.2151	0.9520	5.5236	-1.4062	-57.0058	-0.1666	-0.2108
0.1224	0.9067	1020	0.1497	0.3084	-5.2228	0.9550	5.5312	-1.4068	-57.0827	-0.1706	-0.2145
0.0707	0.96	1080	0.1587	0.3037	-5.2156	0.9510	5.5192	-1.4542	-57.0105	-0.1721	-0.2193

Framework versions

PEFT 0.12.0
Transformers 4.46.0
Pytorch 2.4.0+cu121
Datasets 2.21.0
Tokenizers 0.20.1

Howard881010
/

heat_transfer_sft_dpo_p

sft_dpo_p

Model description

Intended uses & limitations

Training and evaluation data

Training procedure

Training hyperparameters

Training results

Framework versions

Model tree for Howard881010/heat_transfer_sft_dpo_p

Evaluation results