Genformer doesn't use all tokens in provided token dicts

#472

by avelezarce - opened 3 days ago

3 days ago

max(tokenizer.gene_token_dict.values()), min(tokenizer.gene_token_dict.values()), len(set(tokenizer.gene_token_dict.values()))
(25425, 0, 25426)
^ this is from loading the provided dicts in geneformer hf and gh codebases

however, as can be seen in the config, the geneformer only uses 20275 . even if config is changed to use 25425 the model is already expecting < 20275 and produces error.

is this expected? is there a model developed and trained to use all tokens? are we expected to mask away the tokens greater than 20275? some guidance for folks not using the geneformer package would be very beneficial.

we are opting for masking away the unexpected tokens for now but wondering if this is expected and a different solution is better or there is a model using all tokens. thank you!

ctheodoris

Owner 3 days ago

Thanks for your question. Please use the 95M dictionary with the 95M model and the 30M dictionary with the 30M model. This should resolve the issue.

ctheodoris changed discussion status to closed 3 days ago

avelezarce

2 days ago

the sizes i've included are the same in both the 95m and 30m

avelezarce

2 days ago

you can also see the config size mismatch with both of your models. both the 95m and 30m have 20725

ctheodoris

Owner about 20 hours ago

The config for the 95m model is here, and has a vocab of 20275.
The config for the 30m model is here and has a vocab of 25426.

The dictionary for the 95m model is here, and has a vocab of 20275.
The dictionary for the 30m model is here, and has a vocab of 25426.

As long as you pair the correct dictionary with the correct model, it should resolve this issue.

avelezarce

about 16 hours ago

•

edited about 16 hours ago

thankk you very much. i can confirm issue was with validation on our end. i was indeed incorrect wrt the dictionary of the 95m model which does have a vocab size of 20275. we do have the correct 95m pairing and it does work. not sure why there was a user issue. maybe versioning of our package or hf rpeo. sorry!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment