kuleshov-group
/

PlantCaduceus_l24

Jingjing Zhai commited on May 20, 2024

Commit

d62653f

1 Parent(s): e6b01d1

Brief description of PlantCaduceus

Files changed (1) hide show

README.md CHANGED Viewed

@@ -1,3 +1,33 @@
 ---
 license: apache-2.0
 ---

 ---
 license: apache-2.0
 ---
+## Model Overview
+PlantCaduceus is a DNA language model pre-trained on 16 Angiosperm genomes. Utilizing the Caduceus architecture and a masked language modeling objective, PlantCaduceus is designed to pre-train genomic sequences from 16 species spanning a history of 160 million years. We have trained a series of PlantCaduceus models with varying parameter sizes:
+- **PlantCaduceus_l20**: 20 layers, 384 hidden size, 20M parameters
+- **PlantCaduceus_l24**: 24 layers, 512 hidden size, 40M parameters
+- **PlantCaduceus_l28**: 28 layers, 768 hidden size, 112M parameters
+- **PlantCaduceus_l32**: 32 layers, 1024 hidden size, 225M parameters
+## How to use
+```python
+from transformers import AutoModel, AutoModelForMaskedLM, AutoTokenizer
+model_path = 'maize-genetics/PlantCaduceus_l24'
+device = "cuda:0" if torch.cuda.is_available() else "cpu"
+model = AutoModelForMaskedLM.from_pretrained(model_path, trust_remote_code=True).to(device)
+model.eval()
+tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
+sequence = "ATGCGTACGATCGTAG"
+encoding = tokenizer.encode_plus(
+            sequence,
+            return_tensors="pt",
+            return_attention_mask=False,
+            return_token_type_ids=False
+        )
+input_ids = encoding["input_ids"].to(device)
+with torch.inference_mode():
+    outputs = model(input_ids=input_ids, output_hidden_states=True)
+```