SMILES2IUPAC-canonical-base

SMILES2IUPAC-canonical-base was designed to accurately translate SMILES chemical names to IUPAC standards.

Model Details

Model Description

SMILES2IUPAC-canonical-base is based on the MT5 model with optimizations in implementing different tokenizers for the encoder and decoder.

  • Developed by: Knowladgator Engineering
  • Model type: Encoder-Decoder with attention mechanism
  • Language(s) (NLP): SMILES, IUPAC (English)
  • License: Apache License 2.0

Model Sources

Quickstart

Firstly, install the library:

pip install chemical-converters

SMILES to IUPAC

! Preferred IUPAC style

To choose the preferred IUPAC style, place style tokens before your SMILES sequence.

Style Token Description
<BASE> The most known name of the substance, sometimes is the mixture of traditional and systematic style
<SYST> The totally systematic style without trivial names
<TRAD> The style is based on trivial names of the parts of substances

To perform simple translation, follow the example:

from chemicalconverters import NamesConverter

converter = NamesConverter(model_name="knowledgator/SMILES2IUPAC-canonical-base")
print(converter.smiles_to_iupac('CCO'))
print(converter.smiles_to_iupac(['<SYST>CCO', '<TRAD>CCO', '<BASE>CCO']))
['ethanol']
['ethanol', 'ethanol', 'ethanol']

Processing in batches:

from chemicalconverters import NamesConverter

converter = NamesConverter(model_name="knowledgator/SMILES2IUPAC-canonical-base")
print(converter.smiles_to_iupac(["<BASE>C=CC=C" for _ in range(10)], num_beams=1, 
                                process_in_batch=True, batch_size=1000))
['buta-1,3-diene', 'buta-1,3-diene'...]

Validation SMILES to IUPAC translations

It's possible to validate the translations by reverse translation into IUPAC and calculating Tanimoto similarity of two molecules fingerprints.

from chemicalconverters import NamesConverter

converter = NamesConverter(model_name="knowledgator/SMILES2IUPAC-canonical-base")
print(converter.smiles_to_iupac('CCO', validate=True))
['ethanol'] 1.0

The larger is Tanimoto similarity, the larger is probability, that the prediction was correct.

You can also process validation manually:

from chemicalconverters import NamesConverter

validation_model = NamesConverter(model_name="knowledgator/IUPAC2SMILES-canonical-base")
print(NamesConverter.validate_iupac(input_sequence='CCO', predicted_sequence='CCO', validation_model=validation_model))
1.0

Bias, Risks, and Limitations

This model has limited accuracy in processing large molecules and currently, doesn't support isomeric and isotopic SMILES.

Training Procedure

The model was trained on 100M examples of SMILES-IUPAC pairs with lr=0.00001, batch_size=512 for 2 epochs.

Evaluation

Model Accuracy BLEU-4 score Size(MB)
SMILES2IUPAC-canonical-small 75% 0.93 23
SMILES2IUPAC-canonical-base 86.9% 0.964 180
STOUT V2.0* 66.65% 0.92 128
STOUT V2.0 (according to our tests) 0.89 128
*According to the original paper https://jcheminf.biomedcentral.com/articles/10.1186/s13321-021-00512-4

Citation

Coming soon.

Model Card Authors

Mykhailo Shtopko

Model Card Contact

[email protected]

Downloads last month
115
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Spaces using knowledgator/SMILES2IUPAC-canonical-base 3

Collection including knowledgator/SMILES2IUPAC-canonical-base