-
Notifications
You must be signed in to change notification settings - Fork 666
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
increased the maximum number of matches from an rdkit smarts query #3470
increased the maximum number of matches from an rdkit smarts query #3470
Conversation
…. Sphzone operating on an empty selection now returns an empty atom group.
…ection tests to fix issue_#2915.
…testing to confirm.
…to be more consistent.
Hello @orioncohen! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:
Comment last updated at 2022-06-01 12:56:56 UTC |
Codecov Report
@@ Coverage Diff @@
## develop #3470 +/- ##
===========================================
+ Coverage 94.33% 94.35% +0.02%
===========================================
Files 191 191
Lines 24917 24975 +58
Branches 3357 3365 +8
===========================================
+ Hits 23505 23565 +60
+ Misses 1364 1362 -2
Partials 48 48
Continue to review full report at Codecov.
|
10k is also somewhat arbitrary. The OpenFF toolkit uses the largest unsigned int, could that work too? ( |
I'm not super into the idea of hardcoding stuff like this in. @lilyminium did OFF do any related benchmarks on the cost of this or was it just done for the sake of completeness without performance in mind? If this is something that will regularly be used maybe we should just expose the argument to users? |
@IAlibay I doubt it, and tbh it's difficult to hit with small molecules anyway. We should probably just expose the argument. IIRC OpenFF has merged this too now, Edit: worth noting that 1000 is RDKit's own default limit, and I like setting it to the highest possible number as a default instead. |
Thank you for the feedback. I set the default value to the largest unsigned int, added an optional
As @lilyminium pointed out, it's going to be hardcoded to 1000 by default. I tend to think a higher hardcoding is preferable. Please let me know if this needs anything else! |
It's probably worth holding off and seeing what the person who put the limit in thinks (@cbouy ) The comparisons to OFF are a little misleading, a design difference between OFF and MDA is that OFF templates molecules and will only need to perform matches on the unique molecule types (e.g. there's one water stereotype that water molecules point to), whereas MDA just brute forces this. This will mean that things like this selection scale differently, and OFF will likely be better at these things. |
Seems like test failures are chemfiles related - seems like the feedstock got updated just an hour ago so this PR probably just ran at a bad time. Let's see if it re-runs a bit later in the day. |
I didn't enforce any limit, I just mentioned in the docs that RDKit's default is 1000 unique matches. Exposing As for a good default value, why not use |
@orbeckst correct, this PR now serves to add additional kwargs. Namely, it creates the I'll resolve the conflicts next week and push the changes. |
@orioncohen any chance to make some time in your schedule to finish up this PR? Would be good to put a bow on it. |
# Conflicts: # package/CHANGELOG # package/MDAnalysis/core/selection.py
Thanks for the bump @orbeckst. I merged in the develop branch and this should be good to go. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple of things, overall lgtm, however I would be ok with swapping to the default to max(1000, n_atoms) as you suggested
package/MDAnalysis/core/selection.py
Outdated
@@ -678,7 +679,9 @@ def _apply(self, group): | |||
if not pattern: | |||
raise ValueError(f"{self.pattern!r} is not a valid SMARTS query") | |||
mol = group.convert_to("RDKIT", **self.rdkit_kwargs) | |||
matches = mol.GetSubstructMatches(pattern, useChirality=True) | |||
self.smarts_kwargs.setdefault("useChirality", True) | |||
self.smarts_kwargs.setdefault("maxMatches", 1000) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So re-reading the various comments I actually quite like your suggestion (#3470 (comment)) to use max(1000, n_atoms)
as the default, I would happy with that being implemented here and skipping the warning / deprecation with the excuse that it's technically a "bug" in large systems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might actually suggest raising it to max(1000, 10 * n_atoms)
. When using smarts selection I've found that n_atoms
is often too low to capture all matches. While 10 * n_atoms is a bit arbitrary, I believe it would work better for many users. We could also set it to some extremely high number and give users responsibility for not using a selection like CC
that would be very slow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have a sense for how frequently this happens at n_atoms vs 10 * n_atoms, and how large a cost each order of magnitude adds? Is there a way this could easily be benchmarked?
If we think we'll still get a lot of problem cases unless we have to go to a number that is prohibitively expensive, we might want to inspect the return number of matches and raise a warning if maxMatch == len(matches)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I performed a quick benchmarking and the match number seems to have a negligible impact on the overall time calling select_atoms
. The system I explored has ~8000 atoms
u.select_atoms(f'smarts C', smarts_kwargs={'maxMatches': 10})
and
u.select_atoms(f'smarts C', smarts_kwargs={'maxMatches': 100000})
the former returned 10 atoms in 718 ms and the latter returned 2035 atoms in 728 ms, which is correct. It seems that the python code is much slower than the substructure matching written in C. Except for very large systems, I doubt the substructure matching will be an issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how frequently n_atoms
is insufficient. I wasn't able to reproduce the issue on the system I have on hand, but I have encountered it in the past.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok yeah that makes sense it'll just be a loop that terminates once there are no further matches so you won't get any difference in performance unless maxMatches < all possible matches
.
I think we should do the following:
- We set
maxMatches
to some arbitrary high value if we think that's beneficial in most cases (10 * n_atoms if you want for now). - We put up a warning in the docs detailing a) the issue, explaining what happens when you run out of allowable matches, b) how to fix it, c) the fact that this is currently under refinement and the maximum number of matches may grow / reduce in future versions.
- Issue a warning if the number of matches == maxMatches that tells users that a maximum number of matches was returned and they may need to increase the number of matches if they want. (@jbarnoud @lilyminium I'm terrible at adding extra warnings - you both have stronger views on these things, are you for/against this?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That all sounds good to me. A warning seems wise because a) matches == maxMatches
almost certainly is not the intended behavior, and b) otherwise it would be a silent error that would be tricky to track down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just pushed an implementation of 1-3.
…en/mdanalysis into increase_smarts_matches
Merge branch 'develop' of https://github.com/MDAnalysis/mdanalysis into develop
# Conflicts: # package/CHANGELOG
I implemented @IAlibay's suggested changes and fixed the conflicts with CHANGELOG. I'd love to get this merged if there are no more changes! |
Thanks @orioncohen, I'll try to review it in a few |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've not read all the docs fully yet, but this will need addressing first.
package/MDAnalysis/core/selection.py
Outdated
@@ -678,7 +679,17 @@ def _apply(self, group): | |||
if not pattern: | |||
raise ValueError(f"{self.pattern!r} is not a valid SMARTS query") | |||
mol = group.convert_to("RDKIT", **self.rdkit_kwargs) | |||
matches = mol.GetSubstructMatches(pattern, useChirality=True) | |||
self.smarts_kwargs.setdefault("useChirality", True) | |||
self.smarts_kwargs.setdefault("maxMatches", len(group) * 10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we had agreed on max(1000, n_atoms*10)
so as not to change the default in cases where you have few atoms?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM barring Irfan's comment about the default number of matches
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok that should be fine now.
@richardjgowers do you want to have a quick glance at the changes I made? Then I think it's good to go.
Thanks @orioncohen ! Sorry for taking over at the end, I just wanted to get this through for the 2.2.0 release. |
Thanks for wrapping it up @IAlibay, glad this is merged! I've been moving the past couple days so I was AFK. |
Related to Issue #3469.
Changes made in this Pull Request:
GetSubstructMatch
andconvert_to
.PR Checklist