File size: 4,396 Bytes
654ba6e
 
0d55a9c
 
 
d4a4ada
0d55a9c
654ba6e
 
0d55a9c
654ba6e
3823741
654ba6e
0d55a9c
654ba6e
0d55a9c
 
602759e
 
0d55a9c
 
654ba6e
 
0d55a9c
654ba6e
0d55a9c
654ba6e
0d55a9c
 
 
 
654ba6e
0d55a9c
 
 
 
 
 
 
654ba6e
0d55a9c
 
 
 
 
654ba6e
0d55a9c
 
654ba6e
 
0d55a9c
654ba6e
 
 
0d55a9c
654ba6e
0d55a9c
654ba6e
0d55a9c
654ba6e
0d55a9c
654ba6e
38b1a9b
 
 
654ba6e
0d55a9c
654ba6e
0d55a9c
 
 
 
 
 
654ba6e
0d55a9c
 
654ba6e
0d55a9c
 
654ba6e
0d55a9c
654ba6e
0d55a9c
 
 
654ba6e
0d55a9c
654ba6e
0d55a9c
654ba6e
0d55a9c
 
 
 
654ba6e
0d55a9c
654ba6e
0d55a9c
 
 
654ba6e
0d55a9c
654ba6e
0d55a9c
 
654ba6e
0d55a9c
 
654ba6e
0d55a9c
 
654ba6e
 
0d55a9c
654ba6e
0d55a9c
 
d4a4ada
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
---
library_name: transformers
language:
- en
pipeline_tag: image-feature-extraction
license: apache-2.0
inference: false
---

# nomic-embed-vision-v1: Expanding the Latent Space

`nomic-embed-vision-v1` is a high performing vision embedding model that shares the same embedding space as [nomic-embed-text-v1](https://huggingface.co/nomic-ai/nomic-embed-text-v1).

All Nomic Embed Text models are now **multimodal**!

| Name                             | Imagenet 0-shot | Datacomp (Avg. 38) | MTEB      |
| :-------------------------------:| :-------------- | :----------------- | :------:  | 
| `nomic-embed-vision-v1.5`        | **71.0**        | **56.8**           | 62.28     | 
| `nomic-embed-vision-v1`          | 70.7            | 56.7               | **62.39** |
| OpenAI CLIP ViT B/16             | 68.3            | 56.3               | 43.82     |
| Jina CLIP v1                     | 59.1            | 52.2               | 60.1      |


## Hosted Inference API

The easiest way to get started with Nomic Embed is through the Nomic Embedding API.

Generating embeddings with the `nomic` Python client is as easy as 
```python
from nomic import embed
import numpy as np

output = embed.image(
    images=[
        "image_path_1.jpeg",
        "image_path_2.png",
    ],
    model='nomic-embed-vision-v1',
)

print(output['usage'])
embeddings = np.array(output['embeddings'])
print(embeddings.shape)
```
For more information, see the [API reference](https://docs.nomic.ai/reference/endpoints/nomic-embed-vision)

## Data Visualization
Click the Nomic Atlas map below to visualize a 100,000 sample CC3M comparing the Vision and Text Embedding Space!


[![image/webp](/static-proxy?url=https%3A%2F%2Fcdn-uploads.huggingface.co%2Fproduction%2Fuploads%2F607997c83a565c15675055b3%2FaKJogjDQ4BBiYGRIIrFMa.webp%3C%2Fspan%3E)%5D(%3Cspan class="hljs-link">https://atlas.nomic.ai/data/nomic-multimodal-series/cc3m-100k-image-bytes-v15/map)

## Training Details

We align our vision embedder to the text embedding by employing a technique similar to [LiT](https://arxiv.org/abs/2111.07991) but instead lock the text embedder!

For more details, see the Nomic Embed Vision Technical Report (soon to be released!) and corresponding [blog post](https://blog.nomic.ai/posts/nomic-embed-vision)

Training code is released in the `contrastors` [repository](https://github.com/nomic-ai/contrastors)

## Usage

Remember `nomic-embed-text` *requires* prefixes and so, when using Nomic Embed in multimodal RAG scenarios (e.g. text to image retrieval),
you should use the `search_query: ` prefix. 


### Transformers

```python
import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModel, AutoImageProcessor
from PIL import Image
import requests

processor = AutoImageProcessor.from_pretrained("nomic-ai/nomic-embed-vision-v1")
vision_model = AutoModel.from_pretrained("nomic-ai/nomic-embed-vision-v1", trust_remote_code=True)

url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)

inputs = processor(image, return_tensors="pt")

img_emb = vision_model(**inputs).last_hidden_state
img_embeddings = F.normalize(img_emb[:, 0], p=2, dim=1)
```

Additionally, you can perform multimodal retrieval!

```python

def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0]
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

sentences = ['search_query: What are cute animals to cuddle with?', 'search_query: What do cats look like?']

tokenizer = AutoTokenizer.from_pretrained('nomic-ai/nomic-embed-text-v1')
text_model = AutoModel.from_pretrained('nomic-ai/nomic-embed-text-v1', trust_remote_code=True)
text_model.eval()

encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

with torch.no_grad():
    model_output = text_model(**encoded_input)

text_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
text_embeddings = F.normalize(text_embeddings, p=2, dim=1)

print(torch.matmul(img_embeddings, text_embeddings.T))
```


# Join the Nomic Community

- Nomic: [https://nomic.ai](https://nomic.ai)
- Discord: [https://discord.gg/myY5YDR8z8](https://discord.gg/myY5YDR8z8)
- Twitter: [https://twitter.com/nomic_ai](https://twitter.com/nomic_ai)