Skip to content

Utilities for working with SMILES based encodings of molecules for deep learning (PyTorch oriented)

License

Notifications You must be signed in to change notification settings

hogru/pysmilesutils

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PySMILESutils

PySMILES utilities is a package of tools for handling encoding and decoding of SMILES for deep learning applications in PyTorch. The package contains a flexible tokenizer that can be used to analyze a given SMILES dataset using regular expressions and build a vocabulary of tokens, which can subsequently be used to encode the molecules via SMILES into pytorch tensors. The augment class can be used for data augmentation via SMILES enumeration or atom order randomization.

Moreover, the package contains a variety of dataset, sampler and dataloader classes for pytorch. These solve various tasks that can appear. The BucketBatchSampler devides the dataset into buckets, and randomly creates mini-batches from within each bucket. This way the mini-batches can be created of SMILES of approximate similar length and sequence padded can be kept at a minimum. This speeds up training.

For datasets that are too large to fit in memory, chunck based loading can be applied, and for data that needs pre-augmentation (e.g. slow Levenshtein augmentation), the epochs can be pre-created on disk.

Prerequisites

Before you begin, ensure you have met the following requirements:

  • Linux, Windows or macOS platforms are supported - as long as the dependencies are supported on these platforms.

  • You have installed anaconda or miniconda with python 3.6 - 3.8

The tool has been developed on a Linux platform.

Installation

Dependencies

Depencies are listed in environment.yml file and can be installed in the conda environment, either during creation

conda env create -f environment.yml

or updating an already activated environment

conda env update --file environment.yml

Installation with pip

git clone https://github.com/MolecularAI/pysmilesutils.git

cd pysmilesutils

pip install .

pip can also install directly from github

python -m pip install git+https://github.com/MolecularAI/pysmilesutils.git

Alternativly, the package can also be installed in developer mode, which leaves the source directory editable and the package still instantly usable without the need to reinstall after every change.

pip install -e .

Testing

Post-installation the package can be tested with pytest.

cd tests

pytest

It is also recommended to run through the scripts in the example directory.

Documentation

Sphinx documentation can be build with e.g. the make.sh in the "docs" directory

./docs/make.sh

Moreover, the examples directory contains some #%% delimited notebooks that show how to use the various classes. These notebooks can be paired with jupyter notebooks using the jupytext extension, and is also VScode compatible. #%% delimited scripts are much more GIT friendly than jupyter notebooks.

The training example contains a full example on how to train a transformer model using different approaches for handling the conversion of the SMILES in the mini-batches.

Contributing

We welcome contributions, in the form of issues or pull requests.

If you have a question or want to report a bug, please submit an issue.

To contribute with code to the project, follow these steps:

  1. Fork this repository.
  2. Create a branch: git checkout -b <branch_name>.
  3. Make your changes and commit them: git commit -m '<commit_message>'
  4. Push to the remote branch: git push
  5. Create the pull request.

Please use black package for formatting, and follow pep8 style guide.

Contributors

License

The software is licensed under the Apache 2.0 license (see LICENSE file), and is free and provided as-is.

References

Framework:

  • Bjerrum, E., Rastemo, T., Irwin, R., Kannas, C. & Genheden, S. PySMILESUtils – Enabling deep learning with the SMILES chemical language. ChemRxiv (2021). doi:10.33774/chemrxiv-2021-kzhbs

Augmentation:

About

Utilities for working with SMILES based encodings of molecules for deep learning (PyTorch oriented)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%