puneeshkhanna commited on
Commit
debc1ce
·
verified ·
1 Parent(s): eef6d24

Update eval results with multi turn flag

Browse files
Files changed (1) hide show
  1. README.md +27 -27
README.md CHANGED
@@ -90,7 +90,7 @@ print(response)
90
  ## Benchmarks
91
  We report in the following table our internal pipeline benchmarks.
92
  - We use [lm-evaluation harness](https://github.com/EleutherAI/lm-evaluation-harness).
93
- - We report **raw scores** obtained by applying chat template **without fewshot_as_multiturn** (unlike Llama3.1).
94
  - We use same batch-size across all models.
95
 
96
  <table border="1" style="width: 100%; text-align: center; border-collapse: collapse;">
@@ -116,17 +116,17 @@ We report in the following table our internal pipeline benchmarks.
116
  <tr>
117
  <td rowspan="3">General</td>
118
  <td>MMLU (5-shot)</td>
119
- <td>29.3</td>
120
- <td>56.2</td>
121
- <td><b>56.4</b></td>
122
- <td>55.7</td>
123
  </tr>
124
  <tr>
125
  <td>MMLU-PRO (5-shot)</td>
126
- <td>11.9</td>
127
- <td>17.2</td>
128
- <td>23.3</td>
129
- <td><b>29.7</b></td>
130
  </tr>
131
  <tr>
132
  <td>IFEval</td>
@@ -138,21 +138,21 @@ We report in the following table our internal pipeline benchmarks.
138
  <tr>
139
  <td rowspan="3">Math</td>
140
  <td>GSM8K (5-shot)</td>
141
- <td>68.5</td>
142
- <td>58.5</td>
143
- <td>46.9</td>
144
- <td><b>71.9</b></td>
145
  </tr>
146
  <tr>
147
  <td>GSM8K (8-shot, COT)</td>
148
- <td><b>74.5</b></td>
149
- <td>64.0</td>
150
- <td>46.5</td>
151
- <td>71.6</td>
152
  </tr>
153
  <tr>
154
  <td>MATH Lvl-5 (4-shot)</td>
155
- <td>2.4</td>
156
  <td>0.0</td>
157
  <td>0.0</td>
158
  <td><b>19.9</b></td>
@@ -160,10 +160,10 @@ We report in the following table our internal pipeline benchmarks.
160
  <tr>
161
  <td rowspan="5">Reasoning</td>
162
  <td>Arc Challenge (25-shot)</td>
163
- <td>38.9</td>
164
- <td>50.0</td>
165
- <td>51.2</td>
166
- <td><b>58.5</b></td>
167
  </tr>
168
  <tr>
169
  <td>GPQA (0-shot)</td>
@@ -181,16 +181,16 @@ We report in the following table our internal pipeline benchmarks.
181
  </tr>
182
  <tr>
183
  <td>MUSR (0-shot)</td>
184
- <td>34.9</td>
185
  <td><b>40.2</b></td>
186
- <td>38.9</td>
187
  <td>39.0</td>
188
  </tr>
189
  <tr>
190
  <td>BBH (3-shot)</td>
191
- <td>33.1</td>
192
- <td>44.1</td>
193
- <td>38.1</td>
194
  <td><b>45.4</b></td>
195
  </tr>
196
  <tr>
 
90
  ## Benchmarks
91
  We report in the following table our internal pipeline benchmarks.
92
  - We use [lm-evaluation harness](https://github.com/EleutherAI/lm-evaluation-harness).
93
+ - We report **raw scores** obtained by applying chat template and fewshot_as_multiturn.
94
  - We use same batch-size across all models.
95
 
96
  <table border="1" style="width: 100%; text-align: center; border-collapse: collapse;">
 
116
  <tr>
117
  <td rowspan="3">General</td>
118
  <td>MMLU (5-shot)</td>
119
+ <td>61.2</td>
120
+ <td><b>65.4</b></td>
121
+ <td>57.3</td>
122
+ <td>56.9</td>
123
  </tr>
124
  <tr>
125
  <td>MMLU-PRO (5-shot)</td>
126
+ <td>27.7</td>
127
+ <td><b>32.6</b></td>
128
+ <td>26.0</td>
129
+ <td>29.7</td>
130
  </tr>
131
  <tr>
132
  <td>IFEval</td>
 
138
  <tr>
139
  <td rowspan="3">Math</td>
140
  <td>GSM8K (5-shot)</td>
141
+ <td><b>76.8</b></td>
142
+ <td>56.7</td>
143
+ <td>29.8</td>
144
+ <td>74.8</td>
145
  </tr>
146
  <tr>
147
  <td>GSM8K (8-shot, COT)</td>
148
+ <td><b>78.8</b></td>
149
+ <td>60.8</td>
150
+ <td>35.0</td>
151
+ <td>78.0</td>
152
  </tr>
153
  <tr>
154
  <td>MATH Lvl-5 (4-shot)</td>
155
+ <td>14.6</td>
156
  <td>0.0</td>
157
  <td>0.0</td>
158
  <td><b>19.9</b></td>
 
160
  <tr>
161
  <td rowspan="5">Reasoning</td>
162
  <td>Arc Challenge (25-shot)</td>
163
+ <td>50.9</td>
164
+ <td>55.0</td>
165
+ <td><b>56.2</b></td>
166
+ <td>55.5</td>
167
  </tr>
168
  <tr>
169
  <td>GPQA (0-shot)</td>
 
181
  </tr>
182
  <tr>
183
  <td>MUSR (0-shot)</td>
184
+ <td>35.0</td>
185
  <td><b>40.2</b></td>
186
+ <td>38.7</td>
187
  <td>39.0</td>
188
  </tr>
189
  <tr>
190
  <td>BBH (3-shot)</td>
191
+ <td>41.8</td>
192
+ <td>44.5</td>
193
+ <td>39.5</td>
194
  <td><b>45.4</b></td>
195
  </tr>
196
  <tr>