Update README.md
Browse files
README.md
CHANGED
@@ -26,10 +26,42 @@ vdr-2b-multi-v1 is a multilingual embedding model designed for visual document r
|
|
26 |
|
27 |
- **Matryoshka Representation Learning**: You can reduce the vectors size 3x and still keep 98% of the embeddings quality.
|
28 |
|
29 |
-
To know more about the model, read the [announcement blogpost](https://huggingface.co/blog/marco/vdr-2b-multilingual).
|
30 |
-
|
31 |
# Usage
|
32 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
33 |
<details>
|
34 |
<summary>
|
35 |
via HuggingFace Transformers
|
@@ -174,29 +206,6 @@ def encode_documents(documents: list[Image.Image], dimension: int):
|
|
174 |
|
175 |
</details>
|
176 |
|
177 |
-
<details>
|
178 |
-
<summary>
|
179 |
-
via LlamaIndex
|
180 |
-
</summary>
|
181 |
-
|
182 |
-
```bash
|
183 |
-
pip install -U llama-index-embeddings-huggingface
|
184 |
-
```
|
185 |
-
|
186 |
-
```python
|
187 |
-
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
|
188 |
-
|
189 |
-
model = HuggingFaceEmbedding(
|
190 |
-
model_name_or_path="llamaindex/vdr-2b-multi-v1",
|
191 |
-
device="mps",
|
192 |
-
trust_remote_code=True,
|
193 |
-
)
|
194 |
-
|
195 |
-
embeddings = model.get_image_embedding("image.png")
|
196 |
-
```
|
197 |
-
|
198 |
-
</details>
|
199 |
-
|
200 |
|
201 |
<details>
|
202 |
<summary>
|
@@ -223,8 +232,6 @@ embeddings = model.encode("image.png")
|
|
223 |
|
224 |
</details>
|
225 |
|
226 |
-
|
227 |
-
|
228 |
# Training
|
229 |
|
230 |
The model is based on [MrLight/dse-qwen2-2b-mrl-v1](https://huggingface.co/MrLight/dse-qwen2-2b-mrl-v1) and it was trained on the new [vdr-multilingual-train](https://huggingface.co/datasets/llamaindex/vdr-multilingual-train) dataset that consinsists of 500k high quality, multilingual query image pairs. It was trained for 1 epoch using the [DSE approach](https://arxiv.org/abs/2406.11251), with a batch size of 128 and hard-mined negatives.
|
|
|
26 |
|
27 |
- **Matryoshka Representation Learning**: You can reduce the vectors size 3x and still keep 98% of the embeddings quality.
|
28 |
|
|
|
|
|
29 |
# Usage
|
30 |
|
31 |
+
The model uses bf16 tensors and allocates ~4.4GB of VRAM when loaded. You can easily run inference and generate embeddings using 768 image patches and a batch size of 16 even on a cheap NVIDIA T4 GPU. This table reports the memory footprint (GB) under conditions of different batch sizes with HuggingFace Transformers and maximum 768 image patches.
|
32 |
+
|
33 |
+
| Batch Size | GPU Memory (GB) |
|
34 |
+
|------------|-----------------|
|
35 |
+
| 4 | 6.9 |
|
36 |
+
| 8 | 8.8 |
|
37 |
+
| 16 | 11.5 |
|
38 |
+
| 32 | 19.7 |
|
39 |
+
|
40 |
+
Generating embeddings with vdr-2b-multi-v1 is easier than ever with SentenceTransformers and LlamaIndex direct integrations. Get started with just a few lines of code:
|
41 |
+
|
42 |
+
<details open>
|
43 |
+
<summary>
|
44 |
+
via LlamaIndex
|
45 |
+
</summary>
|
46 |
+
|
47 |
+
```bash
|
48 |
+
pip install -U llama-index-embeddings-huggingface
|
49 |
+
```
|
50 |
+
|
51 |
+
```python
|
52 |
+
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
|
53 |
+
|
54 |
+
model = HuggingFaceEmbedding(
|
55 |
+
model_name_or_path="llamaindex/vdr-2b-multi-v1",
|
56 |
+
device="mps",
|
57 |
+
trust_remote_code=True,
|
58 |
+
)
|
59 |
+
|
60 |
+
embeddings = model.get_image_embedding("image.png")
|
61 |
+
```
|
62 |
+
|
63 |
+
</details>
|
64 |
+
|
65 |
<details>
|
66 |
<summary>
|
67 |
via HuggingFace Transformers
|
|
|
206 |
|
207 |
</details>
|
208 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
209 |
|
210 |
<details>
|
211 |
<summary>
|
|
|
232 |
|
233 |
</details>
|
234 |
|
|
|
|
|
235 |
# Training
|
236 |
|
237 |
The model is based on [MrLight/dse-qwen2-2b-mrl-v1](https://huggingface.co/MrLight/dse-qwen2-2b-mrl-v1) and it was trained on the new [vdr-multilingual-train](https://huggingface.co/datasets/llamaindex/vdr-multilingual-train) dataset that consinsists of 500k high quality, multilingual query image pairs. It was trained for 1 epoch using the [DSE approach](https://arxiv.org/abs/2406.11251), with a batch size of 128 and hard-mined negatives.
|