BioMike commited on
Commit
edc4692
·
verified ·
1 Parent(s): b84d2d1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +7 -109
README.md CHANGED
@@ -43,8 +43,6 @@ Firstly, install the library:
43
  pip install chemical-converters
44
  ```
45
  ### SMILES to IUPAC
46
- You can choose pretrained model from table in the section "Models",
47
- but we recommend to use model "smiles2iupac-canonical-base".
48
  #### ! Preferred IUPAC style
49
  To choose the preferred IUPAC style, place style tokens before
50
  your SMILES sequence.
@@ -58,7 +56,7 @@ your SMILES sequence.
58
  #### To perform simple translation, follow the example:
59
  ```python
60
  from chemicalconverters import NamesConverter
61
- converter = NamesConverter(model_name="smiles2iupac-canonical-base")
62
  print(converter.smiles_to_iupac('CCO'))
63
  print(converter.smiles_to_iupac(['<SYST>CCO', '<TRAD>CCO', '<BASE>CCO']))
64
  ```
@@ -69,7 +67,7 @@ print(converter.smiles_to_iupac(['<SYST>CCO', '<TRAD>CCO', '<BASE>CCO']))
69
  #### Processing in batches:
70
  ```python
71
  from chemicalconverters import NamesConverter
72
- converter = NamesConverter(model_name="smiles2iupac-canonical-base")
73
  print(converter.smiles_to_iupac(["<BASE>C=CC=C" for _ in range(10)], num_beams=1,
74
  process_in_batch=True, batch_size=1000))
75
  ```
@@ -81,13 +79,13 @@ It's possible to validate the translations by reverse translation into IUPAC
81
  and calculating Tanimoto similarity of two molecules fingerprints.
82
  ````python
83
  from chemicalconverters import NamesConverter
84
- converter = NamesConverter(model_name="smiles2iupac-canonical-base")
85
  print(converter.smiles_to_iupac('CCO', validate=True))
86
  ````
87
  ````text
88
  ['ethanol'] 1.0
89
  ````
90
- The larger is Tanimoto similarity, the more is probability, that the prediction was correct.
91
 
92
  You can also process validation manually:
93
  ```python
@@ -98,85 +96,16 @@ print(NamesConverter.validate_iupac(input_sequence='CCO', predicted_sequence='CC
98
  ```text
99
  1.0
100
  ```
101
- !Note validation was not implemented in processing in batches.
102
-
103
- ### IUPAC to SMILES
104
- You can choose pretrained model from table in the section "Models",
105
- but we recommend to use model "iupac2smiles-canonical-base".
106
- #### To perform simple translation, follow the example:
107
- ```python
108
- from chemicalconverters import NamesConverter
109
- converter = NamesConverter(model_name="iupac2smiles-canonical-base")
110
- print(converter.smiles_to_iupac('ethanol'))
111
- print(converter.smiles_to_iupac(['ethanol', 'ethanol', 'ethanol']))
112
- ```
113
- ```text
114
- ['CCO']
115
- ['CCO', 'CCO', 'CCO']
116
- ```
117
- #### Processing in batches:
118
- ```python
119
- from chemicalconverters import NamesConverter
120
- converter = NamesConverter(model_name="smiles2iupac-canonical-base")
121
- print(converter.smiles_to_iupac(["buta-1,3-diene" for _ in range(10)], num_beams=1,
122
- process_in_batch=True, batch_size=1000))
123
- ```
124
- ```text
125
- ['<SYST>C=CC=C', '<SYST>C=CC=C'...]
126
- ```
127
- Our models also predict IUPAC styles from the table:
128
-
129
- | Style Token | Description |
130
- |-------------|----------------------------------------------------------------------------------------------------|
131
- | `<BASE>` | The most known name of the substance, sometimes is the mixture of traditional and systematic style |
132
- | `<SYST>` | The totally systematic style without trivial names |
133
- | `<TRAD>` | The style is based on trivial names of the parts of substances |
134
-
135
 
136
  ## Bias, Risks, and Limitations
137
 
138
- <!-- This section is meant to convey both technical and sociotechnical limitations. -->
139
-
140
- [More Information Needed]
141
-
142
- ### Recommendations
143
-
144
- <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
145
-
146
- Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
147
-
148
- ## How to Get Started with the Model
149
-
150
- Use the code below to get started with the model.
151
-
152
- [More Information Needed]
153
-
154
- ## Training Details
155
-
156
- ### Training Data
157
-
158
- <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
159
-
160
- [More Information Needed]
161
 
162
  ### Training Procedure
163
 
164
- <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
165
-
166
- #### Preprocessing [optional]
167
-
168
- [More Information Needed]
169
-
170
-
171
- #### Training Hyperparameters
172
-
173
- - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
174
-
175
- #### Speeds, Sizes, Times [optional]
176
-
177
- <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
178
 
179
- [More Information Needed]
180
 
181
  ## Evaluation
182
 
@@ -209,43 +138,12 @@ Use the code below to get started with the model.
209
  #### Summary
210
 
211
 
212
-
213
  ## Model Examination [optional]
214
 
215
  <!-- Relevant interpretability work for the model goes here -->
216
 
217
  [More Information Needed]
218
 
219
- ## Environmental Impact
220
-
221
- <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
222
-
223
- Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
224
-
225
- - **Hardware Type:** [More Information Needed]
226
- - **Hours used:** [More Information Needed]
227
- - **Cloud Provider:** [More Information Needed]
228
- - **Compute Region:** [More Information Needed]
229
- - **Carbon Emitted:** [More Information Needed]
230
-
231
- ## Technical Specifications [optional]
232
-
233
- ### Model Architecture and Objective
234
-
235
- [More Information Needed]
236
-
237
- ### Compute Infrastructure
238
-
239
- [More Information Needed]
240
-
241
- #### Hardware
242
-
243
- [More Information Needed]
244
-
245
- #### Software
246
-
247
- [More Information Needed]
248
-
249
  ## Citation [optional]
250
 
251
  <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
 
43
  pip install chemical-converters
44
  ```
45
  ### SMILES to IUPAC
 
 
46
  #### ! Preferred IUPAC style
47
  To choose the preferred IUPAC style, place style tokens before
48
  your SMILES sequence.
 
56
  #### To perform simple translation, follow the example:
57
  ```python
58
  from chemicalconverters import NamesConverter
59
+ converter = NamesConverter(model_name="smiles2iupac-canonical-small")
60
  print(converter.smiles_to_iupac('CCO'))
61
  print(converter.smiles_to_iupac(['<SYST>CCO', '<TRAD>CCO', '<BASE>CCO']))
62
  ```
 
67
  #### Processing in batches:
68
  ```python
69
  from chemicalconverters import NamesConverter
70
+ converter = NamesConverter(model_name="smiles2iupac-canonical-small")
71
  print(converter.smiles_to_iupac(["<BASE>C=CC=C" for _ in range(10)], num_beams=1,
72
  process_in_batch=True, batch_size=1000))
73
  ```
 
79
  and calculating Tanimoto similarity of two molecules fingerprints.
80
  ````python
81
  from chemicalconverters import NamesConverter
82
+ converter = NamesConverter(model_name="smiles2iupac-canonical-small")
83
  print(converter.smiles_to_iupac('CCO', validate=True))
84
  ````
85
  ````text
86
  ['ethanol'] 1.0
87
  ````
88
+ The larger is Tanimoto similarity, the larger is probability, that the prediction was correct.
89
 
90
  You can also process validation manually:
91
  ```python
 
96
  ```text
97
  1.0
98
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
99
 
100
  ## Bias, Risks, and Limitations
101
 
102
+ This model has limited accuracy in processing large molecules and currently, doesn't support isomeric and isotopic SMILES.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
103
 
104
  ### Training Procedure
105
 
106
+ <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
 
 
 
 
 
 
 
 
 
 
 
 
 
107
 
108
+ The model was trained on 100M examples of SMILES-IUPAC pairs with lr=0.0003, batch_size=1024 for 2 epochs.
109
 
110
  ## Evaluation
111
 
 
138
  #### Summary
139
 
140
 
 
141
  ## Model Examination [optional]
142
 
143
  <!-- Relevant interpretability work for the model goes here -->
144
 
145
  [More Information Needed]
146
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
147
  ## Citation [optional]
148
 
149
  <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->