Llama-2-7b-hf-DPO-LookAhead3_FullEval_TTree1.4_TLoop0.7_TEval0.2_Filter0.2_V4.0

This model is a fine-tuned version of meta-llama/Llama-2-7b-hf on the None dataset. It achieves the following results on the evaluation set:

Model description

More information needed

More information needed

More information needed

The following hyperparameters were used during training:

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.7006	0.3051	54	0.6866	0.0193	0.0054	0.625	0.0139	-86.7576	-83.4686	-0.8794	-0.8889
0.69	0.6102	108	0.6159	-0.0023	-0.1736	0.875	0.1712	-88.5472	-83.6846	-0.8944	-0.9046
0.5649	0.9153	162	0.5807	-0.1149	-0.3833	0.875	0.2684	-90.6444	-84.8100	-0.9769	-0.9857
0.3921	1.2203	216	0.5138	-0.6026	-1.0626	0.875	0.4600	-97.4372	-89.6870	-1.0866	-1.0941
0.2459	1.5254	270	0.4782	-0.8139	-1.3669	0.875	0.5530	-100.4805	-91.7997	-1.1226	-1.1302
0.3946	1.8305	324	0.5178	-1.1731	-1.6961	0.75	0.5230	-103.7727	-95.3921	-1.3492	-1.3554
0.1509	2.1356	378	0.4919	-1.6892	-2.4213	0.75	0.7321	-111.0249	-100.5536	-1.5040	-1.5090
0.3279	2.4407	432	0.4825	-2.1908	-3.0498	0.75	0.8590	-117.3094	-105.5691	-1.6421	-1.6462
0.1453	2.7458	486	0.4718	-2.4268	-3.3611	0.75	0.9343	-120.4226	-107.9291	-1.6517	-1.6556