File size: 3,006 Bytes
7f9a471
 
48f830c
 
7f9a471
48f830c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
---
license: mit
language:
- en
---

# BGE-small-en-v1.5-rag-int8-static

A quantized version of [BAAI/BGE-small-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5) quantized with [Intel® Neural Compressor](https://github.com/huggingface/optimum-intel) and compatible with [Optimum-Intel](https://github.com/huggingface/optimum-intel).

The model can be used with [Optimum-Intel](https://github.com/huggingface/optimum-intel) API and as a standalone model or as an embedder or ranker module as part of [fastRAG](https://github.com/IntelLabs/fastRAG) RAG pipeline.

## Technical details

Quantized using post-training static quantization.

|  |  |
|---|:---:|
| Calibration set | [qasper](https://huggingface.co/datasets/allenai/qasper) (with 50 random samples)" |
| Quantization tool | [Optimum-Intel](https://github.com/huggingface/optimum-intel) |
| Backend | `IPEX` |
| Original model | [BAAI/BGE-small-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5) |

Instructions how to reproduce the quantized model can be found [here](https://github.com/IntelLabs/fastRAG/tree/main/scripts/optimizations/embedders).

## Evaluation - MTEB

Model performance on the [Massive Text Embedding Benchmark (MTEB)](https://huggingface.co/spaces/mteb/leaderboard) *retrieval* and *reranking* tasks.

|  | `INT8` | `FP32` | % diff |
|---|:---:|:---:|:---:|
| Reranking | 0.5826 | 0.5836  | -0.166% |
| Retrieval | 0.5138  | 0.5168 | -0.58%  |

## Usage

### Using with Optimum-intel

See [Optimum-intel](https://github.com/huggingface/optimum-intel) installation page for instructions how to install. Or run:

``` sh
pip install -U optimum[neural-compressor, ipex] intel-extension-for-transformers
```

Loading a model:

``` python
from optimum.intel import IPEXModel

model = IPEXModel.from_pretrained("Intel/bge-small-en-v1.5-rag-int8-static")
```

Running inference:

``` python
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Intel/bge-small-en-v1.5-rag-int8-static")

inputs = tokenizer(sentences, return_tensors='pt')

with torch.no_grad():
    outputs = model(**inputs)
    # get the vector of [CLS]
    embedded = model_output[0][:, 0]
```

### Using with a fastRAG RAG pipeline

Get started with installing [fastRAG](https://github.com/IntelLabs/fastRAG) as instructed [here](https://github.com/IntelLabs/fastRAG).

Below is an example for loading the model into a ranker node that embeds and re-ranks all the documents it gets in the node input of a pipeline.

``` python
from fastrag.rankers import QuantizedBiEncoderRanker

ranker = QuantizedBiEncoderRanker("Intel/bge-small-en-v1.5-rag-int8-static")
```

and plugging it into a pipeline

``` python

from haystack import Pipeline

p = Pipeline()
p.add_node(component=retriever, name="retriever", inputs=["Query"])
p.add_node(component=ranker, name="ranker", inputs=["retriever"])
```

See a more complete example notebook [here](https://github.com/IntelLabs/fastRAG/blob/main/examples/optimized-embeddings.ipynb).