kgreenewald
commited on
Update README.md
Browse files
README.md
CHANGED
@@ -1,199 +1,170 @@
|
|
1 |
---
|
|
|
|
|
|
|
|
|
2 |
library_name: transformers
|
3 |
-
tags: []
|
4 |
---
|
5 |
|
6 |
-
#
|
7 |
|
8 |
-
<!-- Provide a quick summary of what the model is/does. -->
|
9 |
|
10 |
|
|
|
11 |
|
12 |
-
|
|
|
13 |
|
14 |
-
|
|
|
|
|
15 |
|
16 |
-
<!-- Provide a longer summary of what this model is. -->
|
17 |
|
18 |
-
|
19 |
-
|
20 |
-
- **Developed by:** [More Information Needed]
|
21 |
-
- **Funded by [optional]:** [More Information Needed]
|
22 |
-
- **Shared by [optional]:** [More Information Needed]
|
23 |
-
- **Model type:** [More Information Needed]
|
24 |
-
- **Language(s) (NLP):** [More Information Needed]
|
25 |
-
- **License:** [More Information Needed]
|
26 |
-
- **Finetuned from model [optional]:** [More Information Needed]
|
27 |
-
|
28 |
-
### Model Sources [optional]
|
29 |
|
30 |
<!-- Provide the basic links for the model. -->
|
31 |
|
32 |
-
- **Repository:** [More Information Needed]
|
33 |
-
- **Paper [optional]:** [More Information Needed]
|
34 |
-
- **Demo [optional]:** [More Information Needed]
|
35 |
|
36 |
-
|
37 |
-
|
38 |
-
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
|
39 |
|
40 |
-
### Direct Use
|
41 |
|
42 |
-
|
43 |
|
44 |
-
|
45 |
|
46 |
-
###
|
|
|
|
|
47 |
|
48 |
-
|
|
|
49 |
|
50 |
-
|
51 |
|
52 |
-
|
53 |
|
54 |
-
|
|
|
|
|
|
|
55 |
|
56 |
-
|
57 |
|
58 |
-
## Bias, Risks, and Limitations
|
59 |
|
60 |
-
<!-- This section is meant to convey both technical and sociotechnical limitations. -->
|
61 |
|
62 |
-
[More Information Needed]
|
63 |
|
64 |
-
###
|
65 |
|
66 |
-
|
67 |
|
68 |
-
|
|
|
|
|
|
|
69 |
|
70 |
-
|
|
|
|
|
|
|
71 |
|
72 |
-
|
|
|
|
|
|
|
|
|
73 |
|
74 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
75 |
|
76 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
77 |
|
78 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
79 |
|
80 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
81 |
|
82 |
-
[More Information Needed]
|
83 |
|
84 |
-
### Training Procedure
|
85 |
|
86 |
-
<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
|
87 |
|
88 |
-
#### Preprocessing [optional]
|
89 |
|
90 |
-
|
|
|
91 |
|
92 |
|
93 |
-
#### Training Hyperparameters
|
94 |
|
95 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
96 |
|
97 |
-
#### Speeds, Sizes, Times [optional]
|
98 |
|
99 |
-
<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
|
100 |
|
101 |
-
[More Information Needed]
|
102 |
|
103 |
## Evaluation
|
104 |
|
|
|
|
|
105 |
<!-- This section describes the evaluation protocols and provides the results. -->
|
106 |
|
107 |
-
### Testing Data, Factors & Metrics
|
108 |
-
|
109 |
-
#### Testing Data
|
110 |
-
|
111 |
-
<!-- This should link to a Dataset Card if possible. -->
|
112 |
-
|
113 |
-
[More Information Needed]
|
114 |
-
|
115 |
-
#### Factors
|
116 |
-
|
117 |
-
<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
|
118 |
-
|
119 |
-
[More Information Needed]
|
120 |
-
|
121 |
-
#### Metrics
|
122 |
-
|
123 |
-
<!-- These are the evaluation metrics being used, ideally with a description of why. -->
|
124 |
-
|
125 |
-
[More Information Needed]
|
126 |
-
|
127 |
-
### Results
|
128 |
-
|
129 |
-
[More Information Needed]
|
130 |
-
|
131 |
-
#### Summary
|
132 |
-
|
133 |
-
|
134 |
-
|
135 |
-
## Model Examination [optional]
|
136 |
-
|
137 |
-
<!-- Relevant interpretability work for the model goes here -->
|
138 |
-
|
139 |
-
[More Information Needed]
|
140 |
-
|
141 |
-
## Environmental Impact
|
142 |
-
|
143 |
-
<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
|
144 |
-
|
145 |
-
Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
|
146 |
-
|
147 |
-
- **Hardware Type:** [More Information Needed]
|
148 |
-
- **Hours used:** [More Information Needed]
|
149 |
-
- **Cloud Provider:** [More Information Needed]
|
150 |
-
- **Compute Region:** [More Information Needed]
|
151 |
-
- **Carbon Emitted:** [More Information Needed]
|
152 |
-
|
153 |
-
## Technical Specifications [optional]
|
154 |
-
|
155 |
-
### Model Architecture and Objective
|
156 |
-
|
157 |
-
[More Information Needed]
|
158 |
-
|
159 |
-
### Compute Infrastructure
|
160 |
-
|
161 |
-
[More Information Needed]
|
162 |
-
|
163 |
-
#### Hardware
|
164 |
-
|
165 |
-
[More Information Needed]
|
166 |
-
|
167 |
-
#### Software
|
168 |
-
|
169 |
-
[More Information Needed]
|
170 |
-
|
171 |
-
## Citation [optional]
|
172 |
-
|
173 |
-
<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
|
174 |
-
|
175 |
-
**BibTeX:**
|
176 |
-
|
177 |
-
[More Information Needed]
|
178 |
-
|
179 |
-
**APA:**
|
180 |
-
|
181 |
-
[More Information Needed]
|
182 |
-
|
183 |
-
## Glossary [optional]
|
184 |
-
|
185 |
-
<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
|
186 |
-
|
187 |
-
[More Information Needed]
|
188 |
-
|
189 |
-
## More Information [optional]
|
190 |
-
|
191 |
-
[More Information Needed]
|
192 |
|
193 |
-
## Model Card Authors [optional]
|
194 |
|
195 |
-
[More Information Needed]
|
196 |
|
197 |
-
## Model Card
|
198 |
|
199 |
-
|
|
|
1 |
---
|
2 |
+
license: apache-2.0
|
3 |
+
language:
|
4 |
+
- en
|
5 |
+
pipeline_tag: text-generation
|
6 |
library_name: transformers
|
|
|
7 |
---
|
8 |
|
9 |
+
# Granite Uncertainty 3.0 8b
|
10 |
|
|
|
11 |
|
12 |
|
13 |
+
## Model Summary
|
14 |
|
15 |
+
**Granite Uncertainty 3.0 8b** is a LoRA adapter for [ibm-granite/granite-3.0-8b-instruct](https://huggingface.co/ibm-granite/granite-3.0-8b-instruct),
|
16 |
+
adding the capability to provide calibrated certainty scores when answering questions when prompted, in addition to retaining the full abilities of the [ibm-granite/granite-3.0-8b-instruct](https://huggingface.co/ibm-granite/granite-3.0-8b-instruct) model.
|
17 |
|
18 |
+
- **Developer:** IBM Research
|
19 |
+
- **Model type:** LoRA adapter for [ibm-granite/granite-3.0-8b-instruct](https://huggingface.co/ibm-granite/granite-3.0-8b-instruct)
|
20 |
+
- **License:** [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)
|
21 |
|
|
|
22 |
|
23 |
+
### Model Sources
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
24 |
|
25 |
<!-- Provide the basic links for the model. -->
|
26 |
|
|
|
|
|
|
|
27 |
|
28 |
+
- **Paper:** The **Granite Uncertainty 3.0 8b** model is finetuned to provide certainty scores mimicking the output of a calibrator trained via the method in [[Shen et al. ICML 2024] Thermometer: Towards Universal Calibration for Large Language Models](https://arxiv.org/abs/2403.08819)
|
|
|
|
|
29 |
|
|
|
30 |
|
31 |
+
## Usage
|
32 |
|
33 |
+
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
|
34 |
|
35 |
+
### Intended use
|
36 |
+
**Granite Uncertainty 3.0 8b** is lightly tuned so that its behavior closely mimics that of [ibm-granite/granite-3.0-8b-instruct](https://huggingface.co/ibm-granite/granite-3.0-8b-instruct),
|
37 |
+
with the added ability to generate certainty scores for answers to questions when prompted.
|
38 |
|
39 |
+
**Certainty score definition** The model will respond with a certainty percentage, quantized to 10 possible values (i.e. 5%, 15%, 25%,...95%).
|
40 |
+
This percentage is *calibrated* in the following sense: given a set of answers assigned a certainty score of X%, approximately X% of these answers should be correct. See the eval experiment below for out-of-distribution verification of this behavior.
|
41 |
|
42 |
+
**Important note** Certainty is inherently an intrinsic property of a model and its abilitities. **Granite Uncertainty 3.0 8b** is not intended to predict the certainty of responses generated by any other model.
|
43 |
|
44 |
+
Answering a question and obtaining a certainty score proceeds as follows.
|
45 |
|
46 |
+
1. Prompt the model with a system and/or user prompt.
|
47 |
+
2. Use the model to generate a response as normal (via the `assistant` role).
|
48 |
+
3. Prompt the model to generate a certainty score by generating in the `certainty` role (by appending `<|start_of_role|>certainty<|end_of_role|>` and generating).
|
49 |
+
4. The model will respond with a certainty percentage, quantized with steps of 10% (i.e. 5%, 15%, 25%,...95%).
|
50 |
|
51 |
+
When not given the certainty generation prompt `<|start_of_role|>certainty<|end_of_role|>`, the model's behavior should mimic that of the base model [ibm-granite/granite-3.0-8b-instruct](https://huggingface.co/ibm-granite/granite-3.0-8b-instruct).
|
52 |
|
|
|
53 |
|
|
|
54 |
|
|
|
55 |
|
56 |
+
### Quickstart Example
|
57 |
|
58 |
+
The following code describes how to use the Granite Uncertainty model to answer questions and obtain intrinsic calibrated certainty scores. Note that a generic system prompt is included, this is not necessary and can be modified as needed.
|
59 |
|
60 |
+
```python
|
61 |
+
import torch,os
|
62 |
+
from transformers import AutoTokenizer, AutoModelForCausalLM
|
63 |
+
from peft import PeftModel, PeftConfig
|
64 |
|
65 |
+
token = os.getenv("HF_MISTRAL_TOKEN")
|
66 |
+
BASE_NAME = "ibm-granite/granite-3.0-8b-instruct"
|
67 |
+
LORA_NAME = "ibm-granite/granite-uncertainty-3.0-8b-lora"
|
68 |
+
device=torch.device('cuda' if torch.cuda.is_available() else 'cpu')
|
69 |
|
70 |
+
# Load model
|
71 |
+
token = os.getenv("HF_MISTRAL_TOKEN")
|
72 |
+
tokenizer = AutoTokenizer.from_pretrained(BASE_NAME,padding_side='left',trust_remote_code=True, token=token)
|
73 |
+
model_base = AutoModelForCausalLM.from_pretrained(BASE_NAME,device_map="auto")
|
74 |
+
model_UQ = PeftModel.from_pretrained(model_base, LORA_NAME)
|
75 |
|
76 |
+
system_prompt = "You are an AI language model developed by IBM Research. You are a cautious assistant. You carefully follow instructions. You are helpful and harmless and you follow ethical guidelines and promote positive behavior." #NOTE: this is generic, it can be changed
|
77 |
+
question = "What is IBM?"
|
78 |
+
print("Question:" + question)
|
79 |
+
question_chat = [
|
80 |
+
{
|
81 |
+
"role": "system",
|
82 |
+
"content": system_prompt
|
83 |
+
},
|
84 |
+
{
|
85 |
+
"role": "user",
|
86 |
+
"content": question
|
87 |
+
},
|
88 |
+
]
|
89 |
|
90 |
+
# Generate answer
|
91 |
+
input_text = tokenizer.apply_chat_template(question_chat,tokenize=False,add_generation_prompt=True)
|
92 |
+
inputs = tokenizer(input_text, return_tensors="pt")
|
93 |
+
output = model_UQ.generate(inputs["input_ids"].to(device), attention_mask=inputs["attention_mask"].to(device), max_new_tokens=80)
|
94 |
+
output_text = tokenizer.decode(output[0])
|
95 |
+
answer = output_text.split("assistant<|end_of_role|>")[1]
|
96 |
+
print("Answer: " + answer)
|
97 |
|
98 |
+
# Generate certainty score
|
99 |
+
uq_generation_prompt = "<|start_of_role|>certainty<|end_of_role|>"
|
100 |
+
uq_chat = [
|
101 |
+
{
|
102 |
+
"role": "system",
|
103 |
+
"content": system_prompt
|
104 |
+
},
|
105 |
+
{
|
106 |
+
"role": "user",
|
107 |
+
"content": question
|
108 |
+
},
|
109 |
+
{
|
110 |
+
"role": "assistant",
|
111 |
+
"content": answer
|
112 |
+
},
|
113 |
+
]
|
114 |
|
115 |
+
uq_text = tokenizer.apply_chat_template(uq_chat,tokenize=False) + uq_generation_prompt
|
116 |
+
inputs = tokenizer(uq_text, return_tensors="pt")
|
117 |
+
output = model_UQ.generate(inputs["input_ids"].to(device), attention_mask=inputs["attention_mask"].to(device), max_new_tokens=1)
|
118 |
+
output_text = tokenizer.decode(output[0])
|
119 |
+
uq_score = int(output_text[-1])
|
120 |
+
print("Certainty: " + str(5 + uq_score * 10) + "%")
|
121 |
+
```
|
122 |
|
|
|
123 |
|
|
|
124 |
|
|
|
125 |
|
|
|
126 |
|
127 |
+
## Training Details
|
128 |
+
The **Granite Uncertainty 3.0 8b** model is a LoRA adapter finetuned to provide certainty scores mimicking the output of a calibrator trained via the method in [[Shen et al. ICML 2024] Thermometer: Towards Universal Calibration for Large Language Models](https://arxiv.org/abs/2403.08819).
|
129 |
|
130 |
|
|
|
131 |
|
132 |
+
### Training Data
|
133 |
+
The following datasets were used for calibration and/or finetuning.
|
134 |
+
|
135 |
+
* [BigBench](https://huggingface.co/datasets/tasksource/bigbench)
|
136 |
+
* [MRQA](https://huggingface.co/datasets/mrqa-workshop/mrqa)
|
137 |
+
* [newsqa](https://huggingface.co/datasets/lucadiliello/newsqa)
|
138 |
+
* [trivia_qa](https://huggingface.co/datasets/mandarjoshi/trivia_qa)
|
139 |
+
* [search_qa](https://huggingface.co/datasets/lucadiliello/searchqa)
|
140 |
+
* [openbookqa](https://huggingface.co/datasets/allenai/openbookqa)
|
141 |
+
* [web_questions](https://huggingface.co/datasets/Stanford/web_questions)
|
142 |
+
* [smiles-qa](https://huggingface.co/datasets/alxfgh/ChEMBL_Drug_Instruction_Tuning)
|
143 |
+
* [orca-math](https://huggingface.co/datasets/microsoft/orca-math-word-problems-200k)
|
144 |
+
* [ARC-Easy](https://huggingface.co/datasets/allenai/ai2_arc)
|
145 |
+
* [commonsense_qa](https://huggingface.co/datasets/tau/commonsense_qa)
|
146 |
+
* [social_i_qa](https://huggingface.co/datasets/allenai/social_i_qa)
|
147 |
+
* [super_glue](https://huggingface.co/datasets/aps/super_glue)
|
148 |
+
* [figqa](https://huggingface.co/datasets/nightingal3/fig-qa)
|
149 |
+
* [riddle_sense](https://huggingface.co/datasets/INK-USC/riddle_sense)
|
150 |
+
* [ag_news](https://huggingface.co/datasets/fancyzhx/ag_news)
|
151 |
+
* [medmcqa](https://huggingface.co/datasets/openlifescienceai/medmcqa)
|
152 |
+
* [dream](https://huggingface.co/datasets/dataset-org/dream)
|
153 |
+
* [codah](https://huggingface.co/datasets/jaredfern/codah)
|
154 |
+
* [piqa](https://huggingface.co/datasets/ybisk/piqa)
|
155 |
|
|
|
156 |
|
|
|
157 |
|
|
|
158 |
|
159 |
## Evaluation
|
160 |
|
161 |
+
The model was evaluated on the [MMLU](https://huggingface.co/datasets/cais/mmlu) datasets (not used in training). Shown are the [Expected Calibration Error (ECE)](https://towardsdatascience.com/expected-calibration-error-ece-a-step-by-step-visual-explanation-with-python-code-c3e9aa12937d) for each task, for the base model (Granite-3.0-8b-instruct) and Granite-Uncertainty-3.0-8b.
|
162 |
+
The average ECE across tasks is 0.06 (out of 1). Note that this is smaller than the gap between the quantized certainty outputs (10% quantization steps).
|
163 |
<!-- This section describes the evaluation protocols and provides the results. -->
|
164 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
165 |
|
|
|
166 |
|
|
|
167 |
|
168 |
+
## Model Card Authors
|
169 |
|
170 |
+
Kristjan Greenewald
|