dpo

This model is a fine-tuned version of mistralai/Mistral-Nemo-Instruct-2407 on the heat_transfer_dpo_fs dataset. It achieves the following results on the evaluation set:

Loss: 0.1941
Rewards/chosen: -0.0331
Rewards/rejected: -3.1999
Rewards/accuracies: 0.9226
Rewards/margins: 3.1668
Logps/chosen: -1.8895
Logps/rejected: -33.5195
Logits/chosen: -1.2198
Logits/rejected: -1.2315

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e-06
train_batch_size: 7
eval_batch_size: 7
seed: 42
distributed_type: multi-GPU
num_devices: 2
total_train_batch_size: 14
total_eval_batch_size: 14
optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
lr_scheduler_type: cosine
lr_scheduler_warmup_ratio: 0.1
num_epochs: 2

Training results

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/chosen	Logps/rejected	Logits/chosen	Logits/rejected
0.692	0.0933	60	0.6930	-0.0035	-0.0037	0.4831	0.0003	-1.5933	-1.5578	-1.3225	-1.3224
0.6852	0.1866	120	0.6736	0.0133	-0.0301	0.6339	0.0433	-1.4261	-1.8214	-1.3307	-1.3305
0.6513	0.2799	180	0.6289	-0.0874	-0.3014	0.6796	0.2140	-2.4330	-4.5348	-1.3347	-1.3351
0.5901	0.3733	240	0.5247	-0.1472	-0.7974	0.7470	0.6502	-3.0306	-9.4947	-1.3616	-1.3634
0.4131	0.4666	300	0.5557	-0.2727	-1.2844	0.7173	1.0117	-4.2856	-14.3649	-1.3547	-1.3596
0.3288	0.5599	360	0.3651	-0.1389	-1.6263	0.8562	1.4874	-2.9477	-17.7834	-1.3326	-1.3381
0.3723	0.6532	420	0.4056	-0.1975	-1.9240	0.8125	1.7265	-3.5336	-20.7607	-1.3157	-1.3211
0.2432	0.7465	480	0.3918	-0.1403	-1.8206	0.8095	1.6803	-2.9622	-19.7268	-1.2997	-1.3060
0.3456	0.8398	540	0.3036	-0.0659	-1.9517	0.8671	1.8858	-2.2175	-21.0373	-1.2860	-1.2914
0.3651	0.9331	600	0.2770	-0.0762	-2.3462	0.8879	2.2700	-2.3211	-24.9826	-1.2661	-1.2733
0.2788	1.0264	660	0.2802	-0.1009	-2.6298	0.8829	2.5289	-2.5679	-27.8189	-1.2552	-1.2633
0.2522	1.1198	720	0.2631	-0.0485	-2.3300	0.8938	2.2815	-2.0434	-24.8206	-1.2537	-1.2607
0.2458	1.2131	780	0.2431	-0.0498	-2.5135	0.9117	2.4637	-2.0572	-26.6558	-1.2477	-1.2548
0.193	1.3064	840	0.2387	-0.0474	-2.6414	0.9038	2.5939	-2.0333	-27.9347	-1.2430	-1.2504
0.2013	1.3997	900	0.2212	-0.0433	-2.7423	0.9157	2.6991	-1.9913	-28.9442	-1.2349	-1.2436
0.2382	1.4930	960	0.2145	-0.0570	-3.0965	0.9157	3.0395	-2.1286	-32.4857	-1.2230	-1.2335
0.1884	1.5863	1020	0.2086	-0.0365	-3.0158	0.9177	2.9793	-1.9241	-31.6789	-1.2285	-1.2385
0.2342	1.6796	1080	0.2047	-0.0424	-3.0708	0.9147	3.0284	-1.9832	-32.2288	-1.2207	-1.2312
0.2003	1.7729	1140	0.1988	-0.0416	-3.1710	0.9206	3.1294	-1.9752	-33.2306	-1.2183	-1.2294
0.134	1.8663	1200	0.1975	-0.0410	-3.1898	0.9206	3.1489	-1.9684	-33.4189	-1.2191	-1.2302
0.1411	1.9596	1260	0.1944	-0.0376	-3.2242	0.9266	3.1866	-1.9343	-33.7627	-1.2250	-1.2363

Framework versions

PEFT 0.12.0
Transformers 4.46.0
Pytorch 2.4.0+cu121
Datasets 2.21.0
Tokenizers 0.20.1

Howard881010
/

heat_transfer_dpo_fs

dpo

Model description

Intended uses & limitations

Training and evaluation data

Training procedure

Training hyperparameters

Training results

Framework versions

Model tree for Howard881010/heat_transfer_dpo_fs

Evaluation results