kgreenewald
commited on
Update README.md
Browse files
README.md
CHANGED
@@ -161,11 +161,12 @@ The following datasets were used for calibration and/or finetuning.
|
|
161 |
## Evaluation
|
162 |
|
163 |
The model was evaluated on the [MMLU](https://huggingface.co/datasets/cais/mmlu) datasets (not used in training). Shown are the [Expected Calibration Error (ECE)](https://towardsdatascience.com/expected-calibration-error-ece-a-step-by-step-visual-explanation-with-python-code-c3e9aa12937d) for each task, for the base model (Granite-3.0-8b-instruct) and Granite-Uncertainty-3.0-8b.
|
164 |
-
The average ECE across tasks is 0.
|
165 |
<!-- This section describes the evaluation protocols and provides the results. -->
|
166 |
|
167 |
|
168 |
-
|
|
|
169 |
|
170 |
|
171 |
## Model Card Authors
|
|
|
161 |
## Evaluation
|
162 |
|
163 |
The model was evaluated on the [MMLU](https://huggingface.co/datasets/cais/mmlu) datasets (not used in training). Shown are the [Expected Calibration Error (ECE)](https://towardsdatascience.com/expected-calibration-error-ece-a-step-by-step-visual-explanation-with-python-code-c3e9aa12937d) for each task, for the base model (Granite-3.0-8b-instruct) and Granite-Uncertainty-3.0-8b.
|
164 |
+
The average ECE across tasks for our method is 0.064 (out of 1) and is consistently low across tasks (maximum task ECE 0.10), compared to the base model average ECE of 0.20 and maximum task ECE of 0.60. Note that our ECE of 0.064 is smaller than the gap between the quantized certainty outputs (10% quantization steps). Additionally, the zero-shot performance on the MMLU tasks does not degrade, averaging at 89%.
|
165 |
<!-- This section describes the evaluation protocols and provides the results. -->
|
166 |
|
167 |
|
168 |
+
|
169 |
+
![image/png](https://cdn-uploads.huggingface.co/production/uploads/6602ffd971410cf02bf42c06/2MwP7DRZlNBtWSKWFvXOI.png)
|
170 |
|
171 |
|
172 |
## Model Card Authors
|