Accept tokens instead of string and question regarding tokenizer behaviour
Hi all,
I usually prefer passing tokens directly to an embedding model (it make more sens to me, as the max sequence length is express in tokens, not in string length). It looks to me that this is not possible with the .encode() method you have.
Attempting to do so, I noticed that the tokenizer is calling .strip() and .lower(). Is the model oblivious to capital letters? Did you quantify the impact on doing so when capital letters are used, e.g. for 'NY' (New York) vs 'ny'. Could you see an impact on the ability to distinguish between proper names and other words?
Anyway, if anyone else is looking to do feed token instead of strings, here is some sample code.
Cheers!
from transformers import AutoModel, AutoTokenizer
import torch
def mean_pooling( token_embeddings: torch.Tensor, attention_mask: torch.Tensor):
input_mask_expanded = (attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float())
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
tokenizer = AutoTokenizer.from_pretrained("jinaai/jina-embeddings-v3")
print(type(tokenizer))
input = ["Hello, world!", "How are you today?"]
# Follow transformation applied in custom_st.py
input = [s.strip() for s in input]
input = [s.lower() for s in input]
batch_tokenized = tokenizer(input, return_tensors='pt', padding=True, truncation="longest_first",)
print(batch_tokenized)
print(type(batch_tokenized))
model = AutoModel.from_pretrained("jinaai/jina-embeddings-v3", trust_remote_code=True)
print(type(model))
embs = model(**batch_tokenized)[0]
embs = mean_pooling(embs, batch_tokenized['attention_mask'])
print(type(embs))
print(embs)
print("----")
embs2 = model.encode(input, normalize_embeddings=False)
print(type(embs2))
print(embs2)
Hi @MH1P ,
Attempting to do so, I noticed that the tokenizer is calling .strip() and .lower(). Is the model oblivious to capital letters?
No, our tokenizer is case sensitive. Where did you notice that behavior?
Here's an example:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('jinaai/jina-embeddings-v3')
txt = 'New York'
print(tokenizer(txt))
print(tokenizer(txt.lower()))
Output:
{'input_ids': [0, 2356, 5753, 2], 'attention_mask': [1, 1, 1, 1]}
{'input_ids': [0, 3525, 70662, 92, 2], 'attention_mask': [1, 1, 1, 1, 1]}
Also, while your code looks good, it doesn't make use of LoRA adapters which is a very important part of jina-embeddings-v3
. I recommend taking a look at our implementation of the encode() function
Hi, and thank you for your quick answer!
My apologies, it seems that I had an uncleared state in my notebook and then got confused in my exploration of the code with https://huggingface.co/jinaai/jina-embeddings-v3/blob/main/custom_st.py (line :
# strip
to_tokenize = [[str(s).strip() for s in col] for col in to_tokenize]
# Lowercase
if self.do_lower_case:
to_tokenize = [[s.lower() for s in col] for col in to_tokenize]
So what is the use of custom_st.py ?
Here's an update comparison code that indeed shows the same results.
def mean_pooling( token_embeddings: torch.Tensor, attention_mask: torch.Tensor):
input_mask_expanded = (attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float())
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
# Initialize the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained("jinaai/jina-embeddings-v3", trust_remote_code=True)
model = AutoModel.from_pretrained("jinaai/jina-embeddings-v3", trust_remote_code=True)
def tokenize_model(input):
batch_tokenized = tokenizer(input, return_tensors='pt', padding=True, truncation="longest_first",)
embs = model(**batch_tokenized)[0]
embs = mean_pooling(embs, batch_tokenized['attention_mask'])
return embs.detach().numpy()
def direct_model(input):
return model.encode(input, normalize_embeddings=False)
input = ["Hello, world!", "How are you today?"]
print(tokenize_model(input))
print(direct_model(input))
exit(0)
Going back to the root of my exploration is there a reason not to allow a sequence of tokens (or a batch of sequences) as input of the encode method?
Thank you!