truncate_dim=N returns non-normalized embeddings

#60
by ivanstepanovftw - opened

Please, help. I am using jina-embeddings-v3 with a pgvector extension for Postgres database, so I can search for passages using cosine similarity. I noticed that model.encode(..., truncate_dim=768) returns non-normalized embeddings. Should I normalize the embeddings before adding them to the database to later perform cosine distance search? Or, when I use truncate_dim, I should avoid cosine distance and use inner product/l2/etc.?

Jina AI org

Seems like a bug in our encode function, checking

  • Maybe it would be better to only normalize it once when the embeddings are calculated
  • also we could add another option for backwards compatibility? (not sure how to handle it in the API)

Hey, I'm experiencing the same issue. There's of course an easy workaround. @ivanstepanovftw you could just:

embeddings = F.normalize(embeddings, p=2, dim=1)

Yet, direct model interface does not allow for setting truncate_dim(both with torch 2.3.1+cu121/flash-attn-2.6.3).
Following the example in the model card within the average pooling section

[...]
adapter_mask = ...
truncate_dim = 32
model_output = model(
    input_ids=input_ids,
    attention_mask=attention_mask,
    adapter_mask=adapter_mask, # This one works as in the example
    truncate_dim=truncate_dim # This one raises Flash attention implementation does not support kwargs: truncate_dim
)
embeddings_1 = mean_pooling(model_output, attention_mask)  # Normalization happens inside mean_pooling
embeddings_1 = embeddings_1.detach().cpu().float().numpy()  # 1024-dimensional but normalized
embeddings_2 = model.encode(texts, task=task, truncate_dim=truncate_dim)  # 32-dim, not normalized

One could normalize embeddings_2, but I wonder how can truncate_dim to a direct model call?

Wold something like the following work?

embeddings = torch.nn.functional.layer_norm(
    embeddings,
    normalized_shape=(embeddings.shape[1],)
)
embeddings = embeddings[:, :truncate_dim]

It is not very clean but I wonder if it is how it is done under the hood in your system as well... yeah, yo do

Thank you for releasing the model and giving community support.

Thanks for reporting the issue @ivanstepanovftw , it is fixed now. If anyone wants to keep using the previous version they can specify the code revision when loading the model: code_revision='da863dd04a4e5dce6814c6625adfba87b83838aa'. I added the disclaimer in the readme as well. I'm closing this issue.

jupyterjazz changed discussion status to closed

I am not sure if this issue is related to the above discussion. I initiated the model normally using
model = AutoModel.from_pretrained("jinaai/jina-embeddings-v3", trust_remote_code=True) as well as model = AutoModel.from_pretrained("jinaai/jina-embeddings-v3", code_revision='da863dd04a4e5dce6814c6625adfba87b83838aa', trust_remote_code=True).
But in both cases I end up with embeddings of dimension higher than expected. I expected to get an embedding dim of 512 and set truncate_dim=512 in the encode function. But its still giving me the same embeddings, although the length of the embedding varies everytime I try it. Please confirm if there is any other change I have to include.

Jina AI org

model.encode('text', truncate_dim=32) seems to be working as expected in both cases (with and without code_revision)

I don't know what could be causing this behavior on your side. Can you share the full code snippet? Maybe transformers version as well

It got resolved!

Sign up or log in to comment