ZhiyuanChen
commited on
Update README.md
Browse files
README.md
CHANGED
@@ -10,6 +10,19 @@ library_name: multimolecule
|
|
10 |
pipeline_tag: fill-mask
|
11 |
mask_token: "<mask>"
|
12 |
widget:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
13 |
- example_title: "microRNA-21"
|
14 |
text: "UAGC<mask>UAUCAGACUGAUGUUGA"
|
15 |
output:
|
@@ -68,7 +81,7 @@ RNA-MSM is a [bert](https://huggingface.co/google-bert/bert-base-uncased)-style
|
|
68 |
- **Paper**: [Multiple sequence alignment-based RNA language model and its application to structural inference](https://doi.org/10.1093/nar/gkad1031)
|
69 |
- **Developed by**: Yikun Zhang, Mei Lang, Jiuhong Jiang, Zhiqiang Gao, Fan Xu, Thomas Litfin, Ke Chen, Jaswinder Singh, Xiansong Huang, Guoli Song, Yonghong Tian, Jian Zhan, Jie Chen, Yaoqi Zhou
|
70 |
- **Model type**: [BERT](https://huggingface.co/google-bert/bert-base-uncased) - [MSA](https://doi.org/10.1101/2021.02.12.430858)
|
71 |
-
- **Original Repository**: [
|
72 |
|
73 |
## Usage
|
74 |
|
@@ -85,29 +98,29 @@ You can use this model directly with a pipeline for masked language modeling:
|
|
85 |
```python
|
86 |
>>> import multimolecule # you must import multimolecule to register models
|
87 |
>>> from transformers import pipeline
|
88 |
-
>>> unmasker = pipeline(
|
89 |
-
>>> unmasker("
|
90 |
|
91 |
-
[{'score': 0.
|
92 |
'token': 9,
|
93 |
'token_str': 'U',
|
94 |
-
'sequence': 'U
|
95 |
-
{'score': 0.
|
|
|
|
|
|
|
|
|
96 |
'token': 15,
|
97 |
'token_str': 'K',
|
98 |
-
'sequence': '
|
99 |
-
{'score': 0.
|
100 |
-
'token': 8,
|
101 |
-
'token_str': 'G',
|
102 |
-
'sequence': 'U A G C G U A U C A G A C U G A U G U U G A'},
|
103 |
-
{'score': 0.06814167648553848,
|
104 |
'token': 18,
|
105 |
'token_str': 'D',
|
106 |
-
'sequence': '
|
107 |
-
{'score': 0.
|
108 |
-
'token':
|
109 |
-
'token_str': '
|
110 |
-
'sequence': 'U A
|
111 |
```
|
112 |
|
113 |
### Downstream Use
|
@@ -120,11 +133,11 @@ Here is how to use this model to get the features of a given sequence in PyTorch
|
|
120 |
from multimolecule import RnaTokenizer, RnaMsmModel
|
121 |
|
122 |
|
123 |
-
tokenizer = RnaTokenizer.from_pretrained(
|
124 |
-
model = RnaMsmModel.from_pretrained(
|
125 |
|
126 |
text = "UAGCUUAUCAGACUGAUGUUGA"
|
127 |
-
input = tokenizer(text, return_tensors=
|
128 |
|
129 |
output = model(**input)
|
130 |
```
|
@@ -140,17 +153,17 @@ import torch
|
|
140 |
from multimolecule import RnaTokenizer, RnaMsmForSequencePrediction
|
141 |
|
142 |
|
143 |
-
tokenizer = RnaTokenizer.from_pretrained(
|
144 |
-
model = RnaMsmForSequencePrediction.from_pretrained(
|
145 |
|
146 |
text = "UAGCUUAUCAGACUGAUGUUGA"
|
147 |
-
input = tokenizer(text, return_tensors=
|
148 |
label = torch.tensor([1])
|
149 |
|
150 |
output = model(**input, labels=label)
|
151 |
```
|
152 |
|
153 |
-
####
|
154 |
|
155 |
**Note**: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for nucleotide classification or regression.
|
156 |
|
@@ -158,14 +171,14 @@ Here is how to use this model as backbone to fine-tune for a nucleotide-level ta
|
|
158 |
|
159 |
```python
|
160 |
import torch
|
161 |
-
from multimolecule import RnaTokenizer,
|
162 |
|
163 |
|
164 |
-
tokenizer = RnaTokenizer.from_pretrained(
|
165 |
-
model = RnaMsmForNucleotidPrediction.from_pretrained(
|
166 |
|
167 |
text = "UAGCUUAUCAGACUGAUGUUGA"
|
168 |
-
input = tokenizer(text, return_tensors=
|
169 |
label = torch.randint(2, (len(text), ))
|
170 |
|
171 |
output = model(**input, labels=label)
|
@@ -182,11 +195,11 @@ import torch
|
|
182 |
from multimolecule import RnaTokenizer, RnaMsmForContactPrediction
|
183 |
|
184 |
|
185 |
-
tokenizer = RnaTokenizer.from_pretrained(
|
186 |
-
model = RnaMsmForContactPrediction.from_pretrained(
|
187 |
|
188 |
text = "UAGCUUAUCAGACUGAUGUUGA"
|
189 |
-
input = tokenizer(text, return_tensors=
|
190 |
label = torch.randint(2, (len(text), len(text)))
|
191 |
|
192 |
output = model(**input, labels=label)
|
|
|
10 |
pipeline_tag: fill-mask
|
11 |
mask_token: "<mask>"
|
12 |
widget:
|
13 |
+
- example_title: "HIV-1"
|
14 |
+
text: "GGUC<mask>CUCUGGUUAGACCAGAUCUGAGCCU"
|
15 |
+
output:
|
16 |
+
- label: "U"
|
17 |
+
score: 0.25111356377601624
|
18 |
+
- label: "W"
|
19 |
+
score: 0.1200353354215622
|
20 |
+
- label: "K"
|
21 |
+
score: 0.10132723301649094
|
22 |
+
- label: "D"
|
23 |
+
score: 0.08383019268512726
|
24 |
+
- label: "A"
|
25 |
+
score: 0.05737845227122307
|
26 |
- example_title: "microRNA-21"
|
27 |
text: "UAGC<mask>UAUCAGACUGAUGUUGA"
|
28 |
output:
|
|
|
81 |
- **Paper**: [Multiple sequence alignment-based RNA language model and its application to structural inference](https://doi.org/10.1093/nar/gkad1031)
|
82 |
- **Developed by**: Yikun Zhang, Mei Lang, Jiuhong Jiang, Zhiqiang Gao, Fan Xu, Thomas Litfin, Ke Chen, Jaswinder Singh, Xiansong Huang, Guoli Song, Yonghong Tian, Jian Zhan, Jie Chen, Yaoqi Zhou
|
83 |
- **Model type**: [BERT](https://huggingface.co/google-bert/bert-base-uncased) - [MSA](https://doi.org/10.1101/2021.02.12.430858)
|
84 |
+
- **Original Repository**: [yikunpku/RNA-MSM](https://github.com/yikunpku/RNA-MSM)
|
85 |
|
86 |
## Usage
|
87 |
|
|
|
98 |
```python
|
99 |
>>> import multimolecule # you must import multimolecule to register models
|
100 |
>>> from transformers import pipeline
|
101 |
+
>>> unmasker = pipeline("fill-mask", model="multimolecule/rnamsm")
|
102 |
+
>>> unmasker("gguc<mask>cucugguuagaccagaucugagccu")
|
103 |
|
104 |
+
[{'score': 0.25111356377601624,
|
105 |
'token': 9,
|
106 |
'token_str': 'U',
|
107 |
+
'sequence': 'G G U C U C U C U G G U U A G A C C A G A U C U G A G C C U'},
|
108 |
+
{'score': 0.1200353354215622,
|
109 |
+
'token': 14,
|
110 |
+
'token_str': 'W',
|
111 |
+
'sequence': 'G G U C W C U C U G G U U A G A C C A G A U C U G A G C C U'},
|
112 |
+
{'score': 0.10132723301649094,
|
113 |
'token': 15,
|
114 |
'token_str': 'K',
|
115 |
+
'sequence': 'G G U C K C U C U G G U U A G A C C A G A U C U G A G C C U'},
|
116 |
+
{'score': 0.08383019268512726,
|
|
|
|
|
|
|
|
|
117 |
'token': 18,
|
118 |
'token_str': 'D',
|
119 |
+
'sequence': 'G G U C D C U C U G G U U A G A C C A G A U C U G A G C C U'},
|
120 |
+
{'score': 0.05737845227122307,
|
121 |
+
'token': 6,
|
122 |
+
'token_str': 'A',
|
123 |
+
'sequence': 'G G U C A C U C U G G U U A G A C C A G A U C U G A G C C U'}]
|
124 |
```
|
125 |
|
126 |
### Downstream Use
|
|
|
133 |
from multimolecule import RnaTokenizer, RnaMsmModel
|
134 |
|
135 |
|
136 |
+
tokenizer = RnaTokenizer.from_pretrained("multimolecule/rnamsm")
|
137 |
+
model = RnaMsmModel.from_pretrained("multimolecule/rnamsm")
|
138 |
|
139 |
text = "UAGCUUAUCAGACUGAUGUUGA"
|
140 |
+
input = tokenizer(text, return_tensors="pt")
|
141 |
|
142 |
output = model(**input)
|
143 |
```
|
|
|
153 |
from multimolecule import RnaTokenizer, RnaMsmForSequencePrediction
|
154 |
|
155 |
|
156 |
+
tokenizer = RnaTokenizer.from_pretrained("multimolecule/rnamsm")
|
157 |
+
model = RnaMsmForSequencePrediction.from_pretrained("multimolecule/rnamsm")
|
158 |
|
159 |
text = "UAGCUUAUCAGACUGAUGUUGA"
|
160 |
+
input = tokenizer(text, return_tensors="pt")
|
161 |
label = torch.tensor([1])
|
162 |
|
163 |
output = model(**input, labels=label)
|
164 |
```
|
165 |
|
166 |
+
#### Token Classification / Regression
|
167 |
|
168 |
**Note**: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for nucleotide classification or regression.
|
169 |
|
|
|
171 |
|
172 |
```python
|
173 |
import torch
|
174 |
+
from multimolecule import RnaTokenizer, RnaMsmForTokenPrediction
|
175 |
|
176 |
|
177 |
+
tokenizer = RnaTokenizer.from_pretrained("multimolecule/rnamsm")
|
178 |
+
model = RnaMsmForNucleotidPrediction.from_pretrained("multimolecule/rnamsm")
|
179 |
|
180 |
text = "UAGCUUAUCAGACUGAUGUUGA"
|
181 |
+
input = tokenizer(text, return_tensors="pt")
|
182 |
label = torch.randint(2, (len(text), ))
|
183 |
|
184 |
output = model(**input, labels=label)
|
|
|
195 |
from multimolecule import RnaTokenizer, RnaMsmForContactPrediction
|
196 |
|
197 |
|
198 |
+
tokenizer = RnaTokenizer.from_pretrained("multimolecule/rnamsm")
|
199 |
+
model = RnaMsmForContactPrediction.from_pretrained("multimolecule/rnamsm")
|
200 |
|
201 |
text = "UAGCUUAUCAGACUGAUGUUGA"
|
202 |
+
input = tokenizer(text, return_tensors="pt")
|
203 |
label = torch.randint(2, (len(text), len(text)))
|
204 |
|
205 |
output = model(**input, labels=label)
|