Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added function to get the number of stereoisomers #217

Merged
merged 8 commits into from
Nov 20, 2023
Merged

Conversation

zhu0619
Copy link
Contributor

@zhu0619 zhu0619 commented Nov 17, 2023

Changelogs

  • Added datamol.isomers._enumerate.count_stereoisomers
  • Added unit tests to count stereoisomers for only undefined and all possible stereoisomers.

The step Chem.FindPotentialStereoBonds(mol, cleanIt=clean_it), the information on bond is cleared if cleanit=True.
Therefore, cleanit should be disabled when performing enumeration or counting only on undefined stereochemistry when the molecules have defined stereo information on bonds.

See example below:
image

Reproduce the error

import datamol as dm
from rdkit import Chem

from rdkit.Chem.EnumerateStereoisomers import GetStereoisomerCount, StereoEnumerationOptions, EnumerateStereoisomers
n_variants= 20
undefined_only= True # <-
rationalise = True
timeout_seconds= None
clean_it= True
stereo_opts = StereoEnumerationOptions(
        tryEmbedding=rationalise,
        onlyUnassigned=undefined_only,
        unique=True,
    )
mol  = dm.to_mol('Br/C=C\Br')
Chem.AssignStereochemistry(mol, force=False, flagPossibleStereoCenters=True, cleanIt=clean_it)  # type: ignore
Chem.FindPotentialStereoBonds(mol, cleanIt=clean_it)  # type: ignore
dm.to_image(list(EnumerateStereoisomers(mol, options=stereo_opts)))

@zhu0619 zhu0619 requested a review from hadim as a code owner November 17, 2023 23:56
Copy link

codecov bot commented Nov 17, 2023

Codecov Report

Attention: 1 lines in your changes are missing coverage. Please review.

Comparison is base (9e94d02) 91.96% compared to head (e812492) 91.93%.

Files Patch % Lines
datamol/isomers/_enumerate.py 90.90% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #217      +/-   ##
==========================================
- Coverage   91.96%   91.93%   -0.03%     
==========================================
  Files          46       46              
  Lines        3832     3843      +11     
==========================================
+ Hits         3524     3533       +9     
- Misses        308      310       +2     
Flag Coverage Δ
unittests 91.93% <91.66%> (-0.03%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@hadim hadim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Lu.

It looks good to me after fixing the docstring.

Question: my understanding is that GetStereoisomerCount will actually do that exact same as enumerate_stereoisomers with n_variants=<MAX> and simply call len() on the output. Am I correct here? Maybe check what the rdkit code is doing under the hood. Not really a big deal for me here but I just wanted to flag it in case you think count() should instead reuse enumerate().

rationalise: If we should try to build and rationalise the molecule to ensure it
can exist.
clean_it: A flag for assigning stereochemistry. If True, it will remove previous stereochemistry
markings on the bonds.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the CI is failing because of the malformed docstring. Locally, you can call mkdocs serve to reproduce the error and it can help to fix the docstring.

@zhu0619
Copy link
Contributor Author

zhu0619 commented Nov 20, 2023

Thanks Lu.

It looks good to me after fixing the docstring.

Question: my understanding is that GetStereoisomerCount will actually do that exact same as enumerate_stereoisomers with n_variants=<MAX> and simply call len() on the output. Am I correct here? Maybe check what the rdkit code is doing under the hood. Not really a big deal for me here but I just wanted to flag it in case you think count() should instead reuse enumerate().

[GetStereoisomerCount](https://github.com/rdkit/rdkit/blob/2a68050ed07a3b27cabf33d535f0c46117135209/rdkit/Chem/EnumerateStereoisomers.py#L136C24-L136C24) computes an estimated number based on the stereo bonds. So in some cases, the counts from GetStereoisomerCount is larger than the enumerations.

Initially, I was using the output of enumerate_stereoisomers. But the computational time is too long especially for large dataset even with parallelization.

I will also add an option to count the isomer using enumerate_stereoisomers if the user needs more accurate counts.

@hadim
Copy link
Contributor

hadim commented Nov 20, 2023

ok, so it seems like GetStereoisomerCount is doing a slightly different things and also seems faster. All good then, thank you Lu!

@zhu0619 zhu0619 merged commit c23d273 into main Nov 20, 2023
15 checks passed
@hadim hadim deleted the feat/isomers branch November 24, 2023 15:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants