BarcodeBERT model trained on all complete DNA sequences from the latest BOLD database release. We used the 'nucraw' column of DNA sequences and followed the preprocessing steps outlined in the BarcodeBERT paper.

The model has been trained for a total of 17 epochs.

Example Usage

from transformers import PreTrainedTokenizerFast, BertForMaskedLM

model = BertForMaskedLM.from_pretrained("LofiAmazon/BarcodeBERT-Entire-BOLD")
model.eval()

tokenizer = PreTrainedTokenizerFast.from_pretrained("LofiAmazon/BarcodeBERT-Entire-BOLD")

# The DNA sequence you want to predict.
# There should be a space after every 4 characters.
# The sequence may also have unknown characters which are not A,C,T,G.
# The maximum DNA sequence length (not counting spaces) should be 660 characters
dna_sequence = "AACA ATGT ATTT A-T- TTCG CCCT TGTG AATT TATT ..."

inputs = tokenizer(dna_sequence, return_tensors="pt")

# Obtain a DNA embedding, which is a vector of length 768.
# The embedding is a representation of this DNA sequence in the model's latent space.
embedding = model(**inputs).hidden_states[-1].mean(1).squeeze()

Results

image/png

Downloads last month
16
Safetensors
Model size
86.2M params
Tensor type
F32
ยท
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Space using LofiAmazon/BarcodeBERT-Entire-BOLD 1