Image-Text-to-Text
sentence-transformers
Safetensors
Transformers
qwen2_vl
Qwen2-VL
conversational
marco commited on
Commit
cc63bc4
·
verified ·
1 Parent(s): c01a17b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +34 -27
README.md CHANGED
@@ -26,10 +26,42 @@ vdr-2b-multi-v1 is a multilingual embedding model designed for visual document r
26
 
27
  - **Matryoshka Representation Learning**: You can reduce the vectors size 3x and still keep 98% of the embeddings quality.
28
 
29
- To know more about the model, read the [announcement blogpost](https://huggingface.co/blog/marco/vdr-2b-multilingual).
30
-
31
  # Usage
32
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
33
  <details>
34
  <summary>
35
  via HuggingFace Transformers
@@ -174,29 +206,6 @@ def encode_documents(documents: list[Image.Image], dimension: int):
174
 
175
  </details>
176
 
177
- <details>
178
- <summary>
179
- via LlamaIndex
180
- </summary>
181
-
182
- ```bash
183
- pip install -U llama-index-embeddings-huggingface
184
- ```
185
-
186
- ```python
187
- from llama_index.embeddings.huggingface import HuggingFaceEmbedding
188
-
189
- model = HuggingFaceEmbedding(
190
- model_name_or_path="llamaindex/vdr-2b-multi-v1",
191
- device="mps",
192
- trust_remote_code=True,
193
- )
194
-
195
- embeddings = model.get_image_embedding("image.png")
196
- ```
197
-
198
- </details>
199
-
200
 
201
  <details>
202
  <summary>
@@ -223,8 +232,6 @@ embeddings = model.encode("image.png")
223
 
224
  </details>
225
 
226
-
227
-
228
  # Training
229
 
230
  The model is based on [MrLight/dse-qwen2-2b-mrl-v1](https://huggingface.co/MrLight/dse-qwen2-2b-mrl-v1) and it was trained on the new [vdr-multilingual-train](https://huggingface.co/datasets/llamaindex/vdr-multilingual-train) dataset that consinsists of 500k high quality, multilingual query image pairs. It was trained for 1 epoch using the [DSE approach](https://arxiv.org/abs/2406.11251), with a batch size of 128 and hard-mined negatives.
 
26
 
27
  - **Matryoshka Representation Learning**: You can reduce the vectors size 3x and still keep 98% of the embeddings quality.
28
 
 
 
29
  # Usage
30
 
31
+ The model uses bf16 tensors and allocates ~4.4GB of VRAM when loaded. You can easily run inference and generate embeddings using 768 image patches and a batch size of 16 even on a cheap NVIDIA T4 GPU. This table reports the memory footprint (GB) under conditions of different batch sizes with HuggingFace Transformers and maximum 768 image patches.
32
+
33
+ | Batch Size | GPU Memory (GB) |
34
+ |------------|-----------------|
35
+ | 4 | 6.9 |
36
+ | 8 | 8.8 |
37
+ | 16 | 11.5 |
38
+ | 32 | 19.7 |
39
+
40
+ Generating embeddings with vdr-2b-multi-v1 is easier than ever with SentenceTransformers and LlamaIndex direct integrations. Get started with just a few lines of code:
41
+
42
+ <details open>
43
+ <summary>
44
+ via LlamaIndex
45
+ </summary>
46
+
47
+ ```bash
48
+ pip install -U llama-index-embeddings-huggingface
49
+ ```
50
+
51
+ ```python
52
+ from llama_index.embeddings.huggingface import HuggingFaceEmbedding
53
+
54
+ model = HuggingFaceEmbedding(
55
+ model_name_or_path="llamaindex/vdr-2b-multi-v1",
56
+ device="mps",
57
+ trust_remote_code=True,
58
+ )
59
+
60
+ embeddings = model.get_image_embedding("image.png")
61
+ ```
62
+
63
+ </details>
64
+
65
  <details>
66
  <summary>
67
  via HuggingFace Transformers
 
206
 
207
  </details>
208
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
209
 
210
  <details>
211
  <summary>
 
232
 
233
  </details>
234
 
 
 
235
  # Training
236
 
237
  The model is based on [MrLight/dse-qwen2-2b-mrl-v1](https://huggingface.co/MrLight/dse-qwen2-2b-mrl-v1) and it was trained on the new [vdr-multilingual-train](https://huggingface.co/datasets/llamaindex/vdr-multilingual-train) dataset that consinsists of 500k high quality, multilingual query image pairs. It was trained for 1 epoch using the [DSE approach](https://arxiv.org/abs/2406.11251), with a batch size of 128 and hard-mined negatives.