Intel
/

bge-small-en-v1.5-rag-int8-static

Feature Extraction

text-embeddings-inference

Inference Endpoints

Model card Files Files and versions Community

bge-small-en-v1.5-rag-int8-static / README.md

peterizsak's picture

Upload README.md

48f830c verified 11 months ago

|

history blame contribute delete

3.01 kB

	---
	license: mit
	language:
	- en
	---

	# BGE-small-en-v1.5-rag-int8-static

	A quantized version of [BAAI/BGE-small-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5) quantized with [Intel® Neural Compressor](https://github.com/huggingface/optimum-intel) and compatible with [Optimum-Intel](https://github.com/huggingface/optimum-intel).

	The model can be used with [Optimum-Intel](https://github.com/huggingface/optimum-intel) API and as a standalone model or as an embedder or ranker module as part of [fastRAG](https://github.com/IntelLabs/fastRAG) RAG pipeline.

	## Technical details

	Quantized using post-training static quantization.

	\| \| \|
	\|---\|:---:\|
	\| Calibration set \| [qasper](https://huggingface.co/datasets/allenai/qasper) (with 50 random samples)" \|
	\| Quantization tool \| [Optimum-Intel](https://github.com/huggingface/optimum-intel) \|
	\| Backend \| `IPEX` \|
	\| Original model \| [BAAI/BGE-small-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5) \|

	Instructions how to reproduce the quantized model can be found [here](https://github.com/IntelLabs/fastRAG/tree/main/scripts/optimizations/embedders).

	## Evaluation - MTEB

	Model performance on the [Massive Text Embedding Benchmark (MTEB)](https://huggingface.co/spaces/mteb/leaderboard) retrieval and reranking tasks.

	\| \| `INT8` \| `FP32` \| % diff \|
	\|---\|:---:\|:---:\|:---:\|
	\| Reranking \| 0.5826 \| 0.5836 \| -0.166% \|
	\| Retrieval \| 0.5138 \| 0.5168 \| -0.58% \|

	## Usage

	### Using with Optimum-intel

	See [Optimum-intel](https://github.com/huggingface/optimum-intel) installation page for instructions how to install. Or run:

	``` sh
	pip install -U optimum[neural-compressor, ipex] intel-extension-for-transformers
	```

	Loading a model:

	``` python
	from optimum.intel import IPEXModel

	model = IPEXModel.from_pretrained("Intel/bge-small-en-v1.5-rag-int8-static")
	```

	Running inference:

	``` python
	from transformers import AutoTokenizer

	tokenizer = AutoTokenizer.from_pretrained("Intel/bge-small-en-v1.5-rag-int8-static")

	inputs = tokenizer(sentences, return_tensors='pt')

	with torch.no_grad():
	outputs = model(**inputs)
	# get the vector of [CLS]
	embedded = model_output[0][:, 0]
	```

	### Using with a fastRAG RAG pipeline

	Get started with installing [fastRAG](https://github.com/IntelLabs/fastRAG) as instructed [here](https://github.com/IntelLabs/fastRAG).

	Below is an example for loading the model into a ranker node that embeds and re-ranks all the documents it gets in the node input of a pipeline.

	``` python
	from fastrag.rankers import QuantizedBiEncoderRanker

	ranker = QuantizedBiEncoderRanker("Intel/bge-small-en-v1.5-rag-int8-static")
	```

	and plugging it into a pipeline

	``` python

	from haystack import Pipeline

	p = Pipeline()
	p.add_node(component=retriever, name="retriever", inputs=["Query"])
	p.add_node(component=ranker, name="ranker", inputs=["retriever"])
	```

	See a more complete example notebook [here](https://github.com/IntelLabs/fastRAG/blob/main/examples/optimized-embeddings.ipynb).