File size: 3,006 Bytes
7f9a471 48f830c 7f9a471 48f830c |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 |
---
license: mit
language:
- en
---
# BGE-small-en-v1.5-rag-int8-static
A quantized version of [BAAI/BGE-small-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5) quantized with [Intel® Neural Compressor](https://github.com/huggingface/optimum-intel) and compatible with [Optimum-Intel](https://github.com/huggingface/optimum-intel).
The model can be used with [Optimum-Intel](https://github.com/huggingface/optimum-intel) API and as a standalone model or as an embedder or ranker module as part of [fastRAG](https://github.com/IntelLabs/fastRAG) RAG pipeline.
## Technical details
Quantized using post-training static quantization.
| | |
|---|:---:|
| Calibration set | [qasper](https://huggingface.co/datasets/allenai/qasper) (with 50 random samples)" |
| Quantization tool | [Optimum-Intel](https://github.com/huggingface/optimum-intel) |
| Backend | `IPEX` |
| Original model | [BAAI/BGE-small-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5) |
Instructions how to reproduce the quantized model can be found [here](https://github.com/IntelLabs/fastRAG/tree/main/scripts/optimizations/embedders).
## Evaluation - MTEB
Model performance on the [Massive Text Embedding Benchmark (MTEB)](https://huggingface.co/spaces/mteb/leaderboard) *retrieval* and *reranking* tasks.
| | `INT8` | `FP32` | % diff |
|---|:---:|:---:|:---:|
| Reranking | 0.5826 | 0.5836 | -0.166% |
| Retrieval | 0.5138 | 0.5168 | -0.58% |
## Usage
### Using with Optimum-intel
See [Optimum-intel](https://github.com/huggingface/optimum-intel) installation page for instructions how to install. Or run:
``` sh
pip install -U optimum[neural-compressor, ipex] intel-extension-for-transformers
```
Loading a model:
``` python
from optimum.intel import IPEXModel
model = IPEXModel.from_pretrained("Intel/bge-small-en-v1.5-rag-int8-static")
```
Running inference:
``` python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("Intel/bge-small-en-v1.5-rag-int8-static")
inputs = tokenizer(sentences, return_tensors='pt')
with torch.no_grad():
outputs = model(**inputs)
# get the vector of [CLS]
embedded = model_output[0][:, 0]
```
### Using with a fastRAG RAG pipeline
Get started with installing [fastRAG](https://github.com/IntelLabs/fastRAG) as instructed [here](https://github.com/IntelLabs/fastRAG).
Below is an example for loading the model into a ranker node that embeds and re-ranks all the documents it gets in the node input of a pipeline.
``` python
from fastrag.rankers import QuantizedBiEncoderRanker
ranker = QuantizedBiEncoderRanker("Intel/bge-small-en-v1.5-rag-int8-static")
```
and plugging it into a pipeline
``` python
from haystack import Pipeline
p = Pipeline()
p.add_node(component=retriever, name="retriever", inputs=["Query"])
p.add_node(component=ranker, name="ranker", inputs=["retriever"])
```
See a more complete example notebook [here](https://github.com/IntelLabs/fastRAG/blob/main/examples/optimized-embeddings.ipynb).
|