Genformer doesn't use all tokens in provided token dicts

#472
by avelezarce - opened

max(tokenizer.gene_token_dict.values()), min(tokenizer.gene_token_dict.values()), len(set(tokenizer.gene_token_dict.values()))
(25425, 0, 25426)
^ this is from loading the provided dicts in geneformer hf and gh codebases

however, as can be seen in the config, the geneformer only uses 20275 . even if config is changed to use 25425 the model is already expecting < 20275 and produces error.

is this expected? is there a model developed and trained to use all tokens? are we expected to mask away the tokens greater than 20275? some guidance for folks not using the geneformer package would be very beneficial.

we are opting for masking away the unexpected tokens for now but wondering if this is expected and a different solution is better or there is a model using all tokens. thank you!

Thanks for your question. Please use the 95M dictionary with the 95M model and the 30M dictionary with the 30M model. This should resolve the issue.

ctheodoris changed discussion status to closed

the sizes i've included are the same in both the 95m and 30m

you can also see the config size mismatch with both of your models. both the 95m and 30m have 20725

The config for the 95m model is here, and has a vocab of 20275.
The config for the 30m model is here and has a vocab of 25426.

The dictionary for the 95m model is here, and has a vocab of 20275.
The dictionary for the 30m model is here, and has a vocab of 25426.

As long as you pair the correct dictionary with the correct model, it should resolve this issue.

thankk you very much. i can confirm issue was with validation on our end. i was indeed incorrect wrt the dictionary of the 95m model which does have a vocab size of 20275. we do have the correct 95m pairing and it does work. not sure why there was a user issue. maybe versioning of our package or hf rpeo. sorry!

Sign up or log in to comment