ZhiyuanChen commited on
Commit
c3c66db
·
verified ·
1 Parent(s): 969e0cc

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +44 -31
README.md CHANGED
@@ -10,6 +10,19 @@ library_name: multimolecule
10
  pipeline_tag: fill-mask
11
  mask_token: "<mask>"
12
  widget:
 
 
 
 
 
 
 
 
 
 
 
 
 
13
  - example_title: "microRNA-21"
14
  text: "UAGC<mask>UAUCAGACUGAUGUUGA"
15
  output:
@@ -68,7 +81,7 @@ RNA-MSM is a [bert](https://huggingface.co/google-bert/bert-base-uncased)-style
68
  - **Paper**: [Multiple sequence alignment-based RNA language model and its application to structural inference](https://doi.org/10.1093/nar/gkad1031)
69
  - **Developed by**: Yikun Zhang, Mei Lang, Jiuhong Jiang, Zhiqiang Gao, Fan Xu, Thomas Litfin, Ke Chen, Jaswinder Singh, Xiansong Huang, Guoli Song, Yonghong Tian, Jian Zhan, Jie Chen, Yaoqi Zhou
70
  - **Model type**: [BERT](https://huggingface.co/google-bert/bert-base-uncased) - [MSA](https://doi.org/10.1101/2021.02.12.430858)
71
- - **Original Repository**: [https://github.com/yikunpku/RNA-MSM](https://github.com/yikunpku/RNA-MSM)
72
 
73
  ## Usage
74
 
@@ -85,29 +98,29 @@ You can use this model directly with a pipeline for masked language modeling:
85
  ```python
86
  >>> import multimolecule # you must import multimolecule to register models
87
  >>> from transformers import pipeline
88
- >>> unmasker = pipeline('fill-mask', model='multimolecule/rnamsm')
89
- >>> unmasker("uagc<mask>uaucagacugauguuga")
90
 
91
- [{'score': 0.2819758355617523,
92
  'token': 9,
93
  'token_str': 'U',
94
- 'sequence': 'U A G C U U A U C A G A C U G A U G U U G A'},
95
- {'score': 0.25282594561576843,
 
 
 
 
96
  'token': 15,
97
  'token_str': 'K',
98
- 'sequence': 'U A G C K U A U C A G A C U G A U G U U G A'},
99
- {'score': 0.22668947279453278,
100
- 'token': 8,
101
- 'token_str': 'G',
102
- 'sequence': 'U A G C G U A U C A G A C U G A U G U U G A'},
103
- {'score': 0.06814167648553848,
104
  'token': 18,
105
  'token_str': 'D',
106
- 'sequence': 'U A G C D U A U C A G A C U G A U G U U G A'},
107
- {'score': 0.03735977038741112,
108
- 'token': 14,
109
- 'token_str': 'W',
110
- 'sequence': 'U A G C W U A U C A G A C U G A U G U U G A'}]
111
  ```
112
 
113
  ### Downstream Use
@@ -120,11 +133,11 @@ Here is how to use this model to get the features of a given sequence in PyTorch
120
  from multimolecule import RnaTokenizer, RnaMsmModel
121
 
122
 
123
- tokenizer = RnaTokenizer.from_pretrained('multimolecule/rnamsm')
124
- model = RnaMsmModel.from_pretrained('multimolecule/rnamsm')
125
 
126
  text = "UAGCUUAUCAGACUGAUGUUGA"
127
- input = tokenizer(text, return_tensors='pt')
128
 
129
  output = model(**input)
130
  ```
@@ -140,17 +153,17 @@ import torch
140
  from multimolecule import RnaTokenizer, RnaMsmForSequencePrediction
141
 
142
 
143
- tokenizer = RnaTokenizer.from_pretrained('multimolecule/rnamsm')
144
- model = RnaMsmForSequencePrediction.from_pretrained('multimolecule/rnamsm')
145
 
146
  text = "UAGCUUAUCAGACUGAUGUUGA"
147
- input = tokenizer(text, return_tensors='pt')
148
  label = torch.tensor([1])
149
 
150
  output = model(**input, labels=label)
151
  ```
152
 
153
- #### Nucleotide Classification / Regression
154
 
155
  **Note**: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for nucleotide classification or regression.
156
 
@@ -158,14 +171,14 @@ Here is how to use this model as backbone to fine-tune for a nucleotide-level ta
158
 
159
  ```python
160
  import torch
161
- from multimolecule import RnaTokenizer, RnaMsmForNucleotidePrediction
162
 
163
 
164
- tokenizer = RnaTokenizer.from_pretrained('multimolecule/rnamsm')
165
- model = RnaMsmForNucleotidPrediction.from_pretrained('multimolecule/rnamsm')
166
 
167
  text = "UAGCUUAUCAGACUGAUGUUGA"
168
- input = tokenizer(text, return_tensors='pt')
169
  label = torch.randint(2, (len(text), ))
170
 
171
  output = model(**input, labels=label)
@@ -182,11 +195,11 @@ import torch
182
  from multimolecule import RnaTokenizer, RnaMsmForContactPrediction
183
 
184
 
185
- tokenizer = RnaTokenizer.from_pretrained('multimolecule/rnamsm')
186
- model = RnaMsmForContactPrediction.from_pretrained('multimolecule/rnamsm')
187
 
188
  text = "UAGCUUAUCAGACUGAUGUUGA"
189
- input = tokenizer(text, return_tensors='pt')
190
  label = torch.randint(2, (len(text), len(text)))
191
 
192
  output = model(**input, labels=label)
 
10
  pipeline_tag: fill-mask
11
  mask_token: "<mask>"
12
  widget:
13
+ - example_title: "HIV-1"
14
+ text: "GGUC<mask>CUCUGGUUAGACCAGAUCUGAGCCU"
15
+ output:
16
+ - label: "U"
17
+ score: 0.25111356377601624
18
+ - label: "W"
19
+ score: 0.1200353354215622
20
+ - label: "K"
21
+ score: 0.10132723301649094
22
+ - label: "D"
23
+ score: 0.08383019268512726
24
+ - label: "A"
25
+ score: 0.05737845227122307
26
  - example_title: "microRNA-21"
27
  text: "UAGC<mask>UAUCAGACUGAUGUUGA"
28
  output:
 
81
  - **Paper**: [Multiple sequence alignment-based RNA language model and its application to structural inference](https://doi.org/10.1093/nar/gkad1031)
82
  - **Developed by**: Yikun Zhang, Mei Lang, Jiuhong Jiang, Zhiqiang Gao, Fan Xu, Thomas Litfin, Ke Chen, Jaswinder Singh, Xiansong Huang, Guoli Song, Yonghong Tian, Jian Zhan, Jie Chen, Yaoqi Zhou
83
  - **Model type**: [BERT](https://huggingface.co/google-bert/bert-base-uncased) - [MSA](https://doi.org/10.1101/2021.02.12.430858)
84
+ - **Original Repository**: [yikunpku/RNA-MSM](https://github.com/yikunpku/RNA-MSM)
85
 
86
  ## Usage
87
 
 
98
  ```python
99
  >>> import multimolecule # you must import multimolecule to register models
100
  >>> from transformers import pipeline
101
+ >>> unmasker = pipeline("fill-mask", model="multimolecule/rnamsm")
102
+ >>> unmasker("gguc<mask>cucugguuagaccagaucugagccu")
103
 
104
+ [{'score': 0.25111356377601624,
105
  'token': 9,
106
  'token_str': 'U',
107
+ 'sequence': 'G G U C U C U C U G G U U A G A C C A G A U C U G A G C C U'},
108
+ {'score': 0.1200353354215622,
109
+ 'token': 14,
110
+ 'token_str': 'W',
111
+ 'sequence': 'G G U C W C U C U G G U U A G A C C A G A U C U G A G C C U'},
112
+ {'score': 0.10132723301649094,
113
  'token': 15,
114
  'token_str': 'K',
115
+ 'sequence': 'G G U C K C U C U G G U U A G A C C A G A U C U G A G C C U'},
116
+ {'score': 0.08383019268512726,
 
 
 
 
117
  'token': 18,
118
  'token_str': 'D',
119
+ 'sequence': 'G G U C D C U C U G G U U A G A C C A G A U C U G A G C C U'},
120
+ {'score': 0.05737845227122307,
121
+ 'token': 6,
122
+ 'token_str': 'A',
123
+ 'sequence': 'G G U C A C U C U G G U U A G A C C A G A U C U G A G C C U'}]
124
  ```
125
 
126
  ### Downstream Use
 
133
  from multimolecule import RnaTokenizer, RnaMsmModel
134
 
135
 
136
+ tokenizer = RnaTokenizer.from_pretrained("multimolecule/rnamsm")
137
+ model = RnaMsmModel.from_pretrained("multimolecule/rnamsm")
138
 
139
  text = "UAGCUUAUCAGACUGAUGUUGA"
140
+ input = tokenizer(text, return_tensors="pt")
141
 
142
  output = model(**input)
143
  ```
 
153
  from multimolecule import RnaTokenizer, RnaMsmForSequencePrediction
154
 
155
 
156
+ tokenizer = RnaTokenizer.from_pretrained("multimolecule/rnamsm")
157
+ model = RnaMsmForSequencePrediction.from_pretrained("multimolecule/rnamsm")
158
 
159
  text = "UAGCUUAUCAGACUGAUGUUGA"
160
+ input = tokenizer(text, return_tensors="pt")
161
  label = torch.tensor([1])
162
 
163
  output = model(**input, labels=label)
164
  ```
165
 
166
+ #### Token Classification / Regression
167
 
168
  **Note**: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for nucleotide classification or regression.
169
 
 
171
 
172
  ```python
173
  import torch
174
+ from multimolecule import RnaTokenizer, RnaMsmForTokenPrediction
175
 
176
 
177
+ tokenizer = RnaTokenizer.from_pretrained("multimolecule/rnamsm")
178
+ model = RnaMsmForNucleotidPrediction.from_pretrained("multimolecule/rnamsm")
179
 
180
  text = "UAGCUUAUCAGACUGAUGUUGA"
181
+ input = tokenizer(text, return_tensors="pt")
182
  label = torch.randint(2, (len(text), ))
183
 
184
  output = model(**input, labels=label)
 
195
  from multimolecule import RnaTokenizer, RnaMsmForContactPrediction
196
 
197
 
198
+ tokenizer = RnaTokenizer.from_pretrained("multimolecule/rnamsm")
199
+ model = RnaMsmForContactPrediction.from_pretrained("multimolecule/rnamsm")
200
 
201
  text = "UAGCUUAUCAGACUGAUGUUGA"
202
+ input = tokenizer(text, return_tensors="pt")
203
  label = torch.randint(2, (len(text), len(text)))
204
 
205
  output = model(**input, labels=label)