Llama-2-7b-hf-DPO-LookAhead-5_TTree1.4_TT0.9_TP0.7_TE0.2_V1

This model is a fine-tuned version of meta-llama/Llama-2-7b-hf on the None dataset. It achieves the following results on the evaluation set:

Model description

More information needed

More information needed

More information needed

The following hyperparameters were used during training:

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.6952	0.3016	87	0.6794	-0.0405	-0.0725	0.5833	0.0320	-163.4170	-148.2627	0.3515	0.3604
0.6655	0.6031	174	0.6384	0.0391	-0.0895	0.5	0.1287	-163.5874	-147.4663	0.3348	0.3431
0.6246	0.9047	261	0.6568	0.1297	0.0077	0.5833	0.1220	-162.6151	-146.5603	0.2825	0.2904
0.3939	1.2062	348	0.6986	-0.2304	-0.4082	0.5833	0.1778	-166.7741	-150.1618	0.1283	0.1335
0.3329	1.5078	435	0.7227	-0.5473	-0.6512	0.5833	0.1039	-169.2040	-153.3306	-0.0449	-0.0420
0.6015	1.8094	522	0.7035	-1.0222	-1.2334	0.5	0.2112	-175.0264	-158.0799	-0.0987	-0.0963
0.0646	2.1109	609	0.7550	-1.6915	-1.8415	0.5	0.1500	-181.1071	-164.7728	-0.2277	-0.2271
0.1952	2.4125	696	0.8210	-2.1941	-2.2483	0.5833	0.0542	-185.1751	-169.7991	-0.3347	-0.3356
0.0774	2.7140	783	0.8779	-2.5074	-2.4835	0.5833	-0.0240	-187.5269	-172.9323	-0.3789	-0.3801