Spaces:
Running
Running
jannisborn
commited on
update
Browse files
app.py
CHANGED
@@ -83,7 +83,17 @@ def regression_transformer(
|
|
83 |
)
|
84 |
model = RegressionTransformer(configuration=config, target=target)
|
85 |
samples = list(model.sample(number_of_samples))
|
86 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
87 |
if task == "Predict":
|
88 |
return draw_grid_predict(samples[0], target, domain=algorithm.split(":")[0])
|
89 |
else:
|
|
|
83 |
)
|
84 |
model = RegressionTransformer(configuration=config, target=target)
|
85 |
samples = list(model.sample(number_of_samples))
|
86 |
+
if algorithm_version == "polymer" and task == "Generate":
|
87 |
+
correct_samples = [(s, p) for s, p in samples if "." in s]
|
88 |
+
while len(correct_samples) < number_of_samples:
|
89 |
+
samples = list(model.sample(number_of_samples))
|
90 |
+
correct_samples.extend(
|
91 |
+
[
|
92 |
+
(s, p)
|
93 |
+
for s, p in samples
|
94 |
+
if "." in s and (s, p) not in correct_samples
|
95 |
+
]
|
96 |
+
)
|
97 |
if task == "Predict":
|
98 |
return draw_grid_predict(samples[0], target, domain=algorithm.split(":")[0])
|
99 |
else:
|
model_cards/regression_transformer_article.md
CHANGED
@@ -1,8 +1,8 @@
|
|
1 |
-
# Model
|
2 |
|
3 |
## Parameters
|
4 |
|
5 |
-
### Algorithm Version
|
6 |
Which model checkpoint to use (trained on different datasets).
|
7 |
|
8 |
### Task
|
@@ -45,6 +45,60 @@ Optionally specifies a list of substructures that should definitely be present i
|
|
45 |
*NOTE*: This keeps tokens even if they are included in `tokens_to_mask`.
|
46 |
*NOTE*: Most models operate on SELFIES and the matching of the substructures occurs in SELFIES simply on a string level.
|
47 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
48 |
## Citation
|
49 |
|
50 |
```bib
|
|
|
1 |
+
# Model documentation & parameters
|
2 |
|
3 |
## Parameters
|
4 |
|
5 |
+
### Algorithm Version
|
6 |
Which model checkpoint to use (trained on different datasets).
|
7 |
|
8 |
### Task
|
|
|
45 |
*NOTE*: This keeps tokens even if they are included in `tokens_to_mask`.
|
46 |
*NOTE*: Most models operate on SELFIES and the matching of the substructures occurs in SELFIES simply on a string level.
|
47 |
|
48 |
+
|
49 |
+
|
50 |
+
# Model card -- Regression Transformer
|
51 |
+
|
52 |
+
**Model Details**: The [Regression Transformer](https://arxiv.org/abs/2202.01338) is a multitask Transformer that reformulates regression as a conditional sequence modeling task. This yields a dichotomous language model that seamlessly integrates property prediction with property-driven conditional generation.
|
53 |
+
|
54 |
+
**Developers**: Jannis Born and Matteo Manica from IBM Research.
|
55 |
+
|
56 |
+
**Distributors**: Original authors' code wrapped and distributed by GT4SD Team (2023) from IBM Research.
|
57 |
+
|
58 |
+
**Model date**: Preprint released in 2022, currently under review at *Nature Machine Intelligence*.
|
59 |
+
|
60 |
+
**Model version**: Models trained and distributed by the original authors.
|
61 |
+
- **Molecules: QED**: Model trained on 1.6M molecules (SELFIES) from ChEMBL and their QED scores.
|
62 |
+
- **Molecules: Solubility**: QED model finetuned on the ESOL dataset from [Delaney et al (2004), *J. Chem. Inf. Comput. Sci.*](https://pubs.acs.org/doi/10.1021/ci034243x) to predict water solubility. Model trained on augmented SELFIES.
|
63 |
+
- **Molecules: USPTO**: Model trained on 2.8M [chemical reactions](https://figshare.com/articles/dataset/Chemical_reactions_from_US_patents_1976-Sep2016_/5104873) from the US patent office. The model used SELFIES and a synthetic property (total molecular weight of all precursors).
|
64 |
+
- **Molecules: Polymer**: Model finetuned on 600 ROPs (ring-opening polymerizations) with monomer-catalyst pairs. Model used three properties: conversion (`<conv>`), PDI (`<pdi>`) and Molecular Weight (`<molwt>`). Model trained with augmented SELFIES, optimized only to generate catalysts, given a monomer and the property constraints. See the example for details.
|
65 |
+
- **Molecules: Cosmo_acdl**: Model finetuned on 56k molecules with two properties (*pKa_ACDL* and *pKa_COSMO*). Model used augmented SELFIES.
|
66 |
+
- **Molecules: Pfas**: Model finetuned on ~1k PFAS (Perfluoroalkyl and Polyfluoroalkyl Substances) molecules with 9 properties including some experimentally measured ones (biodegradability, LD50 etc) and some synthetic ones (SCScore, molecular weight). Model trained on augmented SELFIES.
|
67 |
+
- **Molecules: Logp_and_synthesizability**: Model trained on 2.9M molecules (SELFIES) from PubChem with **two** synthetic properties, the logP (partition coefficient) and the [SCScore by Coley et al. (2018); *J. Chem. Inf. Model.*](https://pubs.acs.org/doi/full/10.1021/acs.jcim.7b00622?casa_token=JZzOrdWlQ_QAAAAA%3A3_ynCfBJRJN7wmP2gyAR0EWXY-pNW_l-SGwSSU2SGfl5v5SxcvqhoaPNDhxq4THberPoyyYqTZELD4Ck)
|
68 |
+
- **Molecules: Crippen_logp**: Model trained on 2.9M molecules (SMILES) from PubChem, but *only* on logP (partition coefficient).
|
69 |
+
- **Proteins: Stability**: Model pretrained on 2.6M peptides from UniProt with the Boman index as property. Finetuned on the [**Stability**](https://www.science.org/doi/full/10.1126/science.aan0693) dataset from the [TAPE benchmark](https://proceedings.neurips.cc/paper/2019/hash/37f65c068b7723cd7809ee2d31d7861c-Abstract.html) which has ~65k samples.
|
70 |
+
|
71 |
+
**Model type**: A Transformer-based language model that is trained on alphanumeric sequence to simultaneously perform sequence regression or conditional sequence generation.
|
72 |
+
|
73 |
+
**Information about training algorithms, parameters, fairness constraints or other applied approaches, and features**:
|
74 |
+
All models are trained with an alternated training scheme that alternated between optimizing the cross-entropy loss on the property tokens ("regression") or the self-consistency objective on the molecular tokens. See the [Regression Transformer](https://arxiv.org/abs/2202.01338) paper for details.
|
75 |
+
|
76 |
+
**Paper or other resource for more information**:
|
77 |
+
The [Regression Transformer](https://arxiv.org/abs/2202.01338) paper. See the [source code](https://github.com/IBM/regression-transformer) for details.
|
78 |
+
|
79 |
+
**License**: MIT
|
80 |
+
|
81 |
+
**Where to send questions or comments about the model**: Open an issue on [GT4SD repository](https://github.com/GT4SD/gt4sd-core).
|
82 |
+
|
83 |
+
**Intended Use. Use cases that were envisioned during development**: Chemical research, in particular drug discovery.
|
84 |
+
|
85 |
+
**Primary intended uses/users**: Researchers and computational chemists using the model for model comparison or research exploration purposes.
|
86 |
+
|
87 |
+
**Out-of-scope use cases**: Production-level inference, producing molecules with harmful properties.
|
88 |
+
|
89 |
+
**Factors**: Not applicable.
|
90 |
+
|
91 |
+
**Metrics**: High predictive power for the properties of the `model_version`.
|
92 |
+
|
93 |
+
**Datasets**: Different ones, as described under **Model version**.
|
94 |
+
|
95 |
+
**Ethical Considerations**: No specific considerations as no private/personal data is involved. Please consult with the authors in case of questions.
|
96 |
+
|
97 |
+
**Caveats and Recommendations**: Please consult with original authors in case of questions.
|
98 |
+
|
99 |
+
Model card prototype inspired by [Mitchell et al. (2019)](https://dl.acm.org/doi/abs/10.1145/3287560.3287596?casa_token=XD4eHiE2cRUAAAAA:NL11gMa1hGPOUKTAbtXnbVQBDBbjxwcjGECF_i-WC_3g1aBgU1Hbz_f2b4kI_m1in-w__1ztGeHnwHs)
|
100 |
+
|
101 |
+
|
102 |
## Citation
|
103 |
|
104 |
```bib
|
model_cards/regression_transformer_description.md
CHANGED
@@ -1,7 +1,13 @@
|
|
1 |
|
|
|
|
|
|
|
2 |
### Concurrent sequence regression and generation for molecular language modeling
|
3 |
|
4 |
-
The
|
5 |
This yields a dichotomous language model that seamlessly integrates property prediction with property-driven conditional generation. For details see the [arXiv preprint](https://arxiv.org/abs/2202.01338), the [development code](https://github.com/IBM/regression-transformer) and the [GT4SD endpoint](https://github.com/GT4SD/gt4sd-core) for inference.
|
6 |
|
7 |
-
Each `algorithm_version` refers to one trained model. Each model can be used for **two tasks**, either to *predict* one (or multiple) properties of a molecule or to *generate* a molecule (given a seed molecule and a property constraint).
|
|
|
|
|
|
|
|
1 |
|
2 |
+
|
3 |
+
<img align="right" src="https://raw.githubusercontent.com/GT4SD/gt4sd-core/main/docs/_static/gt4sd_logo.png" alt="logo" width="120" >
|
4 |
+
|
5 |
### Concurrent sequence regression and generation for molecular language modeling
|
6 |
|
7 |
+
The [Regression Transformer](https://arxiv.org/abs/2202.01338) is a multitask Transformer that reformulates regression as a conditional sequence modeling task.
|
8 |
This yields a dichotomous language model that seamlessly integrates property prediction with property-driven conditional generation. For details see the [arXiv preprint](https://arxiv.org/abs/2202.01338), the [development code](https://github.com/IBM/regression-transformer) and the [GT4SD endpoint](https://github.com/GT4SD/gt4sd-core) for inference.
|
9 |
|
10 |
+
Each `algorithm_version` refers to one trained model. Each model can be used for **two tasks**, either to *predict* one (or multiple) properties of a molecule or to *generate* a molecule (given a seed molecule and a property constraint).
|
11 |
+
|
12 |
+
For **examples** and **documentation** of the model parameters, please see below.
|
13 |
+
Moreover, we provide a **model card** ([Mitchell et al. (2019)](https://dl.acm.org/doi/abs/10.1145/3287560.3287596?casa_token=XD4eHiE2cRUAAAAA:NL11gMa1hGPOUKTAbtXnbVQBDBbjxwcjGECF_i-WC_3g1aBgU1Hbz_f2b4kI_m1in-w__1ztGeHnwHs)) at the bottom of this page.
|
model_cards/regression_transformer_examples.csv
CHANGED
@@ -5,3 +5,5 @@ Proteins: Stability,Predict,<stab>[MASK][MASK][MASK][MASK][MASK]|GSQEVNSGTQTYKNA
|
|
5 |
Proteins: Stability,Generate,GSQEVNSGTQTYKNASPEEAERIARKAGATTWTEKGNKWEIRI,10,Sample,1.2,30,True,0.3,<stab>:0.393,,SQEVNSGTQTYKN,WTEK
|
6 |
Molecules: Qed,Generate,<qed>0.717|[MASK][MASK][MASK][MASK][MASK][C][Branch2_1][Ring1][Ring1][MASK][MASK][=C][C][Branch1_1][C][C][=N][C][MASK][MASK][=C][C][=C][Ring1][O][Ring1][Branch1_2][=C][Ring2][MASK][MASK],10,Sample,1.2,30,False,0.0,,,,
|
7 |
Molecules: Solubility,Generate,ClC(Cl)C(Cl)Cl,5,Sample,1.3,40,True,0.4,<esol>:0.754,,,
|
|
|
|
|
|
5 |
Proteins: Stability,Generate,GSQEVNSGTQTYKNASPEEAERIARKAGATTWTEKGNKWEIRI,10,Sample,1.2,30,True,0.3,<stab>:0.393,,SQEVNSGTQTYKN,WTEK
|
6 |
Molecules: Qed,Generate,<qed>0.717|[MASK][MASK][MASK][MASK][MASK][C][Branch2_1][Ring1][Ring1][MASK][MASK][=C][C][Branch1_1][C][C][=N][C][MASK][MASK][=C][C][=C][Ring1][O][Ring1][Branch1_2][=C][Ring2][MASK][MASK],10,Sample,1.2,30,False,0.0,,,,
|
7 |
Molecules: Solubility,Generate,ClC(Cl)C(Cl)Cl,5,Sample,1.3,40,True,0.4,<esol>:0.754,,,
|
8 |
+
Molecules: Polymer,Predict,<conv>[MASK][MASK][MASK][MASK]|<pdi>[MASK][MASK][MASK][MASK][MASK]|<molwt>[MASK][MASK][MASK][MASK][MASK]|[C][Branch1_2][C][=O][O][C@@Hexpl][Branch1_1][C][C][C][Branch1_2][C][=O][O][C@Hexpl][Ring1][Branch2_2][C].[C][C][C][Branch2_1][Ring1][Ring1][N][C][Branch1_1][=C][N][C][=C][C][=C][Branch1_1][Ring1][O][C][C][=C][Ring1][Branch2_1][=S][C][C][C][Ring2][Ring1][C],1,Greedy,1,0,False,,,,,
|
9 |
+
Molecules: Polymer,Generate,C1(=O)O[C@@H](C)C(=O)O[C@H]1C.C2CC(NC(NC1=CC=C(OC)C=C1)=S)CCC2,10,Sample,1.3,50,True,0.5,"<pdi>:3.490, <conv>:0.567, <molwt>:3.567",,,C1(=O)O[C@@H](C)C(=O)O[C@H]1C
|