|
--- |
|
license: mit |
|
language: |
|
- en |
|
--- |
|
|
|
# BGE-small-en-v1.5-rag-int8-static |
|
|
|
A quantized version of [BAAI/BGE-small-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5) quantized with [Intel® Neural Compressor](https://github.com/huggingface/optimum-intel) and compatible with [Optimum-Intel](https://github.com/huggingface/optimum-intel). |
|
|
|
The model can be used with [Optimum-Intel](https://github.com/huggingface/optimum-intel) API and as a standalone model or as an embedder or ranker module as part of [fastRAG](https://github.com/IntelLabs/fastRAG) RAG pipeline. |
|
|
|
## Technical details |
|
|
|
Quantized using post-training static quantization. |
|
|
|
| | | |
|
|---|:---:| |
|
| Calibration set | [qasper](https://huggingface.co/datasets/allenai/qasper) (with 50 random samples)" | |
|
| Quantization tool | [Optimum-Intel](https://github.com/huggingface/optimum-intel) | |
|
| Backend | `IPEX` | |
|
| Original model | [BAAI/BGE-small-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5) | |
|
|
|
Instructions how to reproduce the quantized model can be found [here](https://github.com/IntelLabs/fastRAG/tree/main/scripts/optimizations/embedders). |
|
|
|
## Evaluation - MTEB |
|
|
|
Model performance on the [Massive Text Embedding Benchmark (MTEB)](https://huggingface.co/spaces/mteb/leaderboard) *retrieval* and *reranking* tasks. |
|
|
|
| | `INT8` | `FP32` | % diff | |
|
|---|:---:|:---:|:---:| |
|
| Reranking | 0.5826 | 0.5836 | -0.166% | |
|
| Retrieval | 0.5138 | 0.5168 | -0.58% | |
|
|
|
## Usage |
|
|
|
### Using with Optimum-intel |
|
|
|
See [Optimum-intel](https://github.com/huggingface/optimum-intel) installation page for instructions how to install. Or run: |
|
|
|
``` sh |
|
pip install -U optimum[neural-compressor, ipex] intel-extension-for-transformers |
|
``` |
|
|
|
Loading a model: |
|
|
|
``` python |
|
from optimum.intel import IPEXModel |
|
|
|
model = IPEXModel.from_pretrained("Intel/bge-small-en-v1.5-rag-int8-static") |
|
``` |
|
|
|
Running inference: |
|
|
|
``` python |
|
from transformers import AutoTokenizer |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("Intel/bge-small-en-v1.5-rag-int8-static") |
|
|
|
inputs = tokenizer(sentences, return_tensors='pt') |
|
|
|
with torch.no_grad(): |
|
outputs = model(**inputs) |
|
# get the vector of [CLS] |
|
embedded = model_output[0][:, 0] |
|
``` |
|
|
|
### Using with a fastRAG RAG pipeline |
|
|
|
Get started with installing [fastRAG](https://github.com/IntelLabs/fastRAG) as instructed [here](https://github.com/IntelLabs/fastRAG). |
|
|
|
Below is an example for loading the model into a ranker node that embeds and re-ranks all the documents it gets in the node input of a pipeline. |
|
|
|
``` python |
|
from fastrag.rankers import QuantizedBiEncoderRanker |
|
|
|
ranker = QuantizedBiEncoderRanker("Intel/bge-small-en-v1.5-rag-int8-static") |
|
``` |
|
|
|
and plugging it into a pipeline |
|
|
|
``` python |
|
|
|
from haystack import Pipeline |
|
|
|
p = Pipeline() |
|
p.add_node(component=retriever, name="retriever", inputs=["Query"]) |
|
p.add_node(component=ranker, name="ranker", inputs=["retriever"]) |
|
``` |
|
|
|
See a more complete example notebook [here](https://github.com/IntelLabs/fastRAG/blob/main/examples/optimized-embeddings.ipynb). |
|
|