RichardErkhov
commited on
Commit
•
7fbb097
1
Parent(s):
3349ba7
uploaded readme
Browse files
README.md
ADDED
@@ -0,0 +1,176 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
Quantization made by Richard Erkhov.
|
2 |
+
|
3 |
+
[Github](https://github.com/RichardErkhov)
|
4 |
+
|
5 |
+
[Discord](https://discord.gg/pvy7H8DZMG)
|
6 |
+
|
7 |
+
[Request more models](https://github.com/RichardErkhov/quant_request)
|
8 |
+
|
9 |
+
|
10 |
+
FollowIR-7B - bnb 4bits
|
11 |
+
- Model creator: https://huggingface.co/jhu-clsp/
|
12 |
+
- Original model: https://huggingface.co/jhu-clsp/FollowIR-7B/
|
13 |
+
|
14 |
+
|
15 |
+
|
16 |
+
|
17 |
+
Original model description:
|
18 |
+
---
|
19 |
+
license: apache-2.0
|
20 |
+
language:
|
21 |
+
- en
|
22 |
+
tags:
|
23 |
+
- retrieval
|
24 |
+
- instructions
|
25 |
+
- reranking
|
26 |
+
- mteb
|
27 |
+
datasets:
|
28 |
+
- jhu-clsp/FollowIR-train
|
29 |
+
model-index:
|
30 |
+
- name: FollowIR-7B
|
31 |
+
results:
|
32 |
+
- task:
|
33 |
+
type: InstructionRetrieval
|
34 |
+
dataset:
|
35 |
+
type: jhu-clsp/core17-instructions
|
36 |
+
name: MTEB Core17InstructionRetrieval
|
37 |
+
config: default
|
38 |
+
split: test
|
39 |
+
revision: e39ff896cf3efbbdeeb950e6bd7c79f266995b07
|
40 |
+
metrics:
|
41 |
+
- type: p-MRR
|
42 |
+
value: 16.47851858684521
|
43 |
+
- task:
|
44 |
+
type: InstructionRetrieval
|
45 |
+
dataset:
|
46 |
+
type: jhu-clsp/news21-instructions
|
47 |
+
name: MTEB News21InstructionRetrieval
|
48 |
+
config: default
|
49 |
+
split: test
|
50 |
+
revision: e0144086b45fe31ac125e9ac1a83b6a409bb6ca6
|
51 |
+
metrics:
|
52 |
+
- type: p-MRR
|
53 |
+
value: 6.2615989256510005
|
54 |
+
- task:
|
55 |
+
type: InstructionRetrieval
|
56 |
+
dataset:
|
57 |
+
type: jhu-clsp/robust04-instructions
|
58 |
+
name: MTEB Robust04InstructionRetrieval
|
59 |
+
config: default
|
60 |
+
split: test
|
61 |
+
revision: a5a1c4fe2bc528ac12e83f8cdf82178da85d2f1d
|
62 |
+
metrics:
|
63 |
+
- type: p-MRR
|
64 |
+
value: 13.717553757582253
|
65 |
+
---
|
66 |
+
|
67 |
+
# Model Summary
|
68 |
+
|
69 |
+
FollowIR-7B is an instruction-tuned language model to be used for reranking in retrieval. It is [Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) fine-tuned on retrieval data with instructions from the FollowIR dataset. These instructions were taken from TREC tracks and are human written. FollowIR-7B outperforms all other retrieval models at following instructions. See the paper for more details.
|
70 |
+
|
71 |
+
- **Repository:** [orionw/FollowIR](https://github.com/orionw/FollowIR)
|
72 |
+
- **Paper:** https://arxiv.org/abs/2403.15246
|
73 |
+
- **Instruction-Training Dataset:** [jhu-clsp/followir-train](https://huggingface.co/datasets/jhu-clsp/FollowIR-train)
|
74 |
+
|
75 |
+
|
76 |
+
# Use
|
77 |
+
|
78 |
+
Below is an example to compute the similarity score of a query-document pair
|
79 |
+
```python
|
80 |
+
from transformers import (
|
81 |
+
AutoTokenizer,
|
82 |
+
AutoModelForCausalLM,
|
83 |
+
)
|
84 |
+
import torch
|
85 |
+
|
86 |
+
# model loading and setup
|
87 |
+
model_name = "jhu-clsp/FollowIR-7B"
|
88 |
+
model = AutoModelForCausalLM.from_pretrained(
|
89 |
+
model_name
|
90 |
+
).cuda()
|
91 |
+
tokenizer = AutoTokenizer.from_pretrained(
|
92 |
+
model_name, padding_side="left"
|
93 |
+
)
|
94 |
+
tokenizer.pad_token = tokenizer.eos_token
|
95 |
+
tokenizer.padding_side = "left"
|
96 |
+
token_false_id = tokenizer.get_vocab()["false"]
|
97 |
+
token_true_id = tokenizer.get_vocab()["true"]
|
98 |
+
template = """<s> [INST] You are an expert Google searcher, whose job is to determine if the following document is relevant to the query (true/false). Answer using only one word, one of those two choices.
|
99 |
+
|
100 |
+
Query: {query}
|
101 |
+
Document: {text}
|
102 |
+
Relevant (only output one word, either "true" or "false"): [/INST] """
|
103 |
+
|
104 |
+
|
105 |
+
## Lets define some example queries with instructions in the query and the passage
|
106 |
+
query1 = "What movies were written by James Cameron? A relevant document would describe a movie that was written by James Cameron only and not with anyone else"
|
107 |
+
query2 = "What movies were directed by James Cameron? A relevant document would describe any movie that was directed by James Cameron"
|
108 |
+
passages = ["Avatar: The Way of Water is a 2022 American epic science fiction film co-produced and directed by James Cameron, who co-wrote the screenplay with Rick Jaffa and Amanda Silver from a story the trio wrote with Josh Friedman and Shane Salerno. Distributed by 20th Century Studios, it is the sequel to Avatar (2009) and the second installment in the Avatar film series."] * 2
|
109 |
+
|
110 |
+
prompts = [
|
111 |
+
template.format(query=query, text=text) for (query, text) in zip([query1, query2], passages)
|
112 |
+
]
|
113 |
+
tokens = tokenizer(
|
114 |
+
prompts,
|
115 |
+
padding=True,
|
116 |
+
truncation=True,
|
117 |
+
return_tensors="pt",
|
118 |
+
pad_to_multiple_of=None,
|
119 |
+
)
|
120 |
+
|
121 |
+
# move to cuda if desired
|
122 |
+
for key in tokens:
|
123 |
+
tokens[key] = tokens[key].cuda()
|
124 |
+
|
125 |
+
# calculate the scores by comparing true and false tokens
|
126 |
+
batch_scores = model(**tokens).logits[:, -1, :]
|
127 |
+
true_vector = batch_scores[:, token_true_id]
|
128 |
+
false_vector = batch_scores[:, token_false_id]
|
129 |
+
batch_scores = torch.stack([false_vector, true_vector], dim=1)
|
130 |
+
batch_scores = torch.nn.functional.log_softmax(batch_scores, dim=1)
|
131 |
+
scores = batch_scores[:, 1].exp().tolist()
|
132 |
+
print(scores) # [0.0020704232156276703, 0.9999990463256836] first document is not relevant, as expected
|
133 |
+
```
|
134 |
+
|
135 |
+
# Training
|
136 |
+
|
137 |
+
We used [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) to fine-tune Mistral to create FollowIR-7B, after transforming it to fit their format (input of "query" + "instruction" inside the template, output is the label, and instruction as the beginning of the template) with the following training script:
|
138 |
+
```bash
|
139 |
+
#!/bin/bash
|
140 |
+
accelerate launch src/train_bash.py \
|
141 |
+
--stage sft \
|
142 |
+
--do_train \
|
143 |
+
--model_name_or_path "mistralai/Mistral-7B-Instruct-v0.2" \
|
144 |
+
--dataset followIR-train \
|
145 |
+
--template mistral \
|
146 |
+
--output_dir OUTPUT \
|
147 |
+
--finetuning_type lora \
|
148 |
+
--lora_target q_proj,v_proj,o_proj,k_proj \
|
149 |
+
--overwrite_cache \
|
150 |
+
--per_device_train_batch_size 32 \
|
151 |
+
--gradient_accumulation_steps 1 \
|
152 |
+
--lr_scheduler_type cosine \
|
153 |
+
--logging_steps 2 \
|
154 |
+
--save_steps 29 \
|
155 |
+
--learning_rate 3e-5 \
|
156 |
+
--num_train_epochs 8.0 \
|
157 |
+
--plot_loss \
|
158 |
+
--max_length 2048 \
|
159 |
+
--lora_rank 8 \
|
160 |
+
--lora_alpha 16 \
|
161 |
+
--bf16
|
162 |
+
```
|
163 |
+
|
164 |
+
# Citation
|
165 |
+
|
166 |
+
```bibtex
|
167 |
+
@misc{weller2024followir,
|
168 |
+
title={FollowIR: Evaluating and Teaching Information Retrieval Models to Follow Instructions},
|
169 |
+
author={Orion Weller and Benjamin Chang and Sean MacAvaney and Kyle Lo and Arman Cohan and Benjamin Van Durme and Dawn Lawrie and Luca Soldaini},
|
170 |
+
year={2024},
|
171 |
+
eprint={2403.15246},
|
172 |
+
archivePrefix={arXiv},
|
173 |
+
primaryClass={cs.IR}
|
174 |
+
}
|
175 |
+
```
|
176 |
+
|