zpn's picture
Update README.md
351ebb2 verified
---
library_name: transformers
language:
- en
pipeline_tag: image-feature-extraction
license: apache-2.0
inference: false
---
# nomic-embed-vision-v1.5: Expanding the Latent Space
`nomic-embed-vision-v1.5` is a high performing vision embedding model that shares the same embedding space as [nomic-embed-text-v1.5](https://huggingface.co/nomic-ai/nomic-embed-text-v1.5).
All Nomic Embed Text models are now **multimodal**!
| Name | Imagenet 0-shot | Datacomp (Avg. 38) | MTEB |
| :-------------------------------:| :-------------- | :----------------- | :------: |
| `nomic-embed-vision-v1.5` | **71.0** | **56.8** | 62.28 |
| `nomic-embed-vision-v1` | 70.7 | 56.7 | **62.39** |
| OpenAI CLIP ViT B/16 | 68.3 | 56.3 | 43.82 |
| Jina CLIP v1 | 59.1 | 52.2 | 60.1 |
## Hosted Inference API
The easiest way to get started with Nomic Embed is through the Nomic Embedding API.
Generating embeddings with the `nomic` Python client is as easy as
```python
from nomic import embed
import numpy as np
output = embed.image(
images=[
"image_path_1.jpeg",
"image_path_2.png",
],
model='nomic-embed-vision-v1.5',
)
print(output['usage'])
embeddings = np.array(output['embeddings'])
print(embeddings.shape)
```
For more information, see the [API reference](https://docs.nomic.ai/reference/endpoints/nomic-embed-vision)
## Data Visualization
Click the Nomic Atlas map below to visualize a 100,000 sample CC3M comparing the Vision and Text Embedding Space!
[![image/webp](/static-proxy?url=https%3A%2F%2Fcdn-uploads.huggingface.co%2Fproduction%2Fuploads%2F607997c83a565c15675055b3%2FaKJogjDQ4BBiYGRIIrFMa.webp%3C%2Fspan%3E)%5D(%3Cspan class="hljs-link">https://atlas.nomic.ai/data/nomic-multimodal-series/cc3m-100k-image-bytes-v15/map)
## Training Details
We align our vision embedder to the text embedding by employing a technique similar to [LiT](https://arxiv.org/abs/2111.07991) but instead lock the text embedder!
For more details, see the Nomic Embed Vision Technical Report (soon to be released!) and corresponding [blog post](https://blog.nomic.ai/posts/nomic-embed-vision)
Training code is released in the `contrastors` [repository](https://github.com/nomic-ai/contrastors)
## Usage
Remember `nomic-embed-text` *requires* prefixes and so, when using Nomic Embed in multimodal RAG scenarios (e.g. text to image retrieval),
you should use the `search_query: ` prefix.
### Transformers
```python
import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModel, AutoImageProcessor
from PIL import Image
import requests
processor = AutoImageProcessor.from_pretrained("nomic-ai/nomic-embed-vision-v1.5")
vision_model = AutoModel.from_pretrained("nomic-ai/nomic-embed-vision-v1.5", trust_remote_code=True)
url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)
inputs = processor(image, return_tensors="pt")
img_emb = vision_model(**inputs).last_hidden_state
img_embeddings = F.normalize(img_emb[:, 0], p=2, dim=1)
```
Additionally, you can perform multimodal retrieval!
```python
def mean_pooling(model_output, attention_mask):
token_embeddings = model_output[0]
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
sentences = ['search_query: What are cute animals to cuddle with?', 'search_query: What do cats look like?']
tokenizer = AutoTokenizer.from_pretrained('nomic-ai/nomic-embed-text-v1.5')
text_model = AutoModel.from_pretrained('nomic-ai/nomic-embed-text-v1.5', trust_remote_code=True)
text_model.eval()
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
with torch.no_grad():
model_output = text_model(**encoded_input)
text_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
text_embeddings = F.layer_norm(text_embeddings, normalized_shape=(text_embeddings.shape[1],))
text_embeddings = F.normalize(text_embeddings, p=2, dim=1)
print(torch.matmul(img_embeddings, text_embeddings.T))
```
# Join the Nomic Community
- Nomic: [https://nomic.ai](https://nomic.ai)
- Discord: [https://discord.gg/myY5YDR8z8](https://discord.gg/myY5YDR8z8)
- Twitter: [https://twitter.com/nomic_ai](https://twitter.com/nomic_ai)