dpo_p

This model is a fine-tuned version of mistralai/Mistral-Nemo-Instruct-2407 on the heat_transfer_dpo_p dataset. It achieves the following results on the evaluation set:

Loss: 0.1692
Rewards/chosen: 0.0877
Rewards/rejected: -4.1618
Rewards/accuracies: 0.9435
Rewards/margins: 4.2496
Logps/chosen: -3.6031
Logps/rejected: -46.4845
Logits/chosen: -1.1815
Logits/rejected: -1.2052

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e-06
train_batch_size: 7
eval_batch_size: 7
seed: 42
distributed_type: multi-GPU
num_devices: 2
total_train_batch_size: 14
total_eval_batch_size: 14
optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
lr_scheduler_type: cosine
lr_scheduler_warmup_ratio: 0.1
num_epochs: 2

Training results

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/chosen	Logps/rejected	Logits/chosen	Logits/rejected
0.6888	0.0933	60	0.7026	0.0852	0.0954	0.4722	-0.0102	-3.6290	-3.9127	-1.3205	-1.3206
0.6874	0.1866	120	0.6799	-0.0264	-0.0577	0.5853	0.0313	-4.7445	-5.4437	-1.3197	-1.3201
0.6277	0.2799	180	0.6050	-0.0526	-0.3283	0.6865	0.2757	-5.0064	-8.1496	-1.3104	-1.3121
0.6972	0.3733	240	0.6916	0.2062	0.0775	0.5645	0.1287	-2.4188	-4.0918	-1.3059	-1.3064
0.5403	0.4666	300	0.5434	-0.0861	-0.7153	0.7351	0.6292	-5.3416	-12.0196	-1.3176	-1.3214
0.4851	0.5599	360	0.4736	0.0745	-0.6669	0.7738	0.7414	-3.7352	-11.5354	-1.3169	-1.3211
0.5212	0.6532	420	0.4008	0.1432	-0.9171	0.8403	1.0603	-3.0484	-14.0373	-1.3134	-1.3191
0.2776	0.7465	480	0.3285	0.1142	-1.6779	0.8512	1.7921	-3.3384	-21.6450	-1.2922	-1.3021
0.351	0.8398	540	0.2724	0.1235	-2.0395	0.8770	2.1629	-3.2460	-25.2612	-1.2861	-1.2980
0.3464	0.9331	600	0.2994	0.0036	-2.1200	0.8700	2.1236	-4.4449	-26.0666	-1.2775	-1.2895
0.1758	1.0264	660	0.2081	0.1320	-2.7773	0.9137	2.9092	-3.1609	-32.6392	-1.2568	-1.2733
0.1554	1.1198	720	0.1848	0.0998	-3.1629	0.9246	3.2628	-3.4824	-36.4958	-1.2340	-1.2530
0.1542	1.2131	780	0.1818	0.0788	-3.7795	0.9345	3.8583	-3.6926	-42.6612	-1.2215	-1.2440
0.1354	1.3064	840	0.2401	0.0439	-3.8429	0.9147	3.8868	-4.0414	-43.2950	-1.2040	-1.2276
0.2017	1.3997	900	0.2583	0.0451	-3.7989	0.9147	3.8440	-4.0291	-42.8554	-1.2056	-1.2287
0.1909	1.4930	960	0.1759	0.0940	-3.8068	0.9395	3.9008	-3.5403	-42.9342	-1.2013	-1.2244
0.1503	1.5863	1020	0.1781	0.0949	-4.0544	0.9385	4.1493	-3.5316	-45.4105	-1.1901	-1.2136
0.199	1.6796	1080	0.1939	0.0256	-4.1360	0.9335	4.1616	-4.2245	-46.2266	-1.1883	-1.2111
0.2059	1.7729	1140	0.1670	0.0688	-4.1823	0.9405	4.2511	-3.7922	-46.6892	-1.1819	-1.2056
0.1566	1.8663	1200	0.1590	0.0963	-4.1650	0.9464	4.2613	-3.5175	-46.5159	-1.1893	-1.2134
0.1869	1.9596	1260	0.1640	0.0816	-4.1815	0.9454	4.2631	-3.6648	-46.6814	-1.1877	-1.2113

Framework versions

PEFT 0.12.0
Transformers 4.46.0
Pytorch 2.4.0+cu121
Datasets 2.21.0
Tokenizers 0.20.1

Howard881010
/

heat_transfer_dpo_p

dpo_p

Model description

Intended uses & limitations

Training and evaluation data

Training procedure

Training hyperparameters

Training results

Framework versions

Model tree for Howard881010/heat_transfer_dpo_p

Evaluation results