Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

increased the maximum number of matches from an rdkit smarts query #3470

Merged
merged 50 commits into from
Jun 1, 2022

Conversation

orionarcher
Copy link
Contributor

@orionarcher orionarcher commented Nov 29, 2021

Related to Issue #3469.

Changes made in this Pull Request:

  • increased the maximum number of matches from a SMARTS selection query from 1000 to n_atoms.
  • separated kwargs for GetSubstructMatch and convert_to.
  • updated documentation to clarify kwarg usage.

PR Checklist

  • Tests?
  • Update main docs
  • Docs?
  • CHANGELOG updated?
  • Issue raised/referenced?

@pep8speaks
Copy link

pep8speaks commented Nov 29, 2021

Hello @orioncohen! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 3022:80: E501 line too long (82 > 79 characters)
Line 3025:80: E501 line too long (87 > 79 characters)
Line 3035:80: E501 line too long (81 > 79 characters)

Line 598:80: E501 line too long (97 > 79 characters)

Comment last updated at 2022-06-01 12:56:56 UTC

@codecov
Copy link

codecov bot commented Nov 29, 2021

Codecov Report

Merging #3470 (c739404) into develop (eea743f) will increase coverage by 0.02%.
The diff coverage is 100.00%.

@@             Coverage Diff             @@
##           develop    #3470      +/-   ##
===========================================
+ Coverage    94.33%   94.35%   +0.02%     
===========================================
  Files          191      191              
  Lines        24917    24975      +58     
  Branches      3357     3365       +8     
===========================================
+ Hits         23505    23565      +60     
+ Misses        1364     1362       -2     
  Partials        48       48              
Impacted Files Coverage Δ
package/MDAnalysis/core/groups.py 98.58% <ø> (ø)
package/MDAnalysis/core/selection.py 98.81% <100.00%> (+<0.01%) ⬆️
package/MDAnalysis/coordinates/DCD.py 100.00% <0.00%> (ø)
package/MDAnalysis/coordinates/XDR.py 100.00% <0.00%> (ø)
package/MDAnalysis/coordinates/DLPoly.py 98.78% <0.00%> (+<0.01%) ⬆️
package/MDAnalysis/coordinates/H5MD.py 97.61% <0.00%> (+0.01%) ⬆️
package/MDAnalysis/coordinates/TRJ.py 97.90% <0.00%> (+0.01%) ⬆️
package/MDAnalysis/coordinates/memory.py 98.72% <0.00%> (+0.01%) ⬆️
package/MDAnalysis/coordinates/PDB.py 94.84% <0.00%> (+0.02%) ⬆️
package/MDAnalysis/coordinates/chemfiles.py 97.25% <0.00%> (+0.03%) ⬆️
... and 11 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update eea743f...c739404. Read the comment docs.

@lilyminium
Copy link
Member

10k is also somewhat arbitrary. The OpenFF toolkit uses the largest unsigned int, could that work too? (np.iinfo(np.uintc).max)

@IAlibay
Copy link
Member

IAlibay commented Nov 29, 2021

I'm not super into the idea of hardcoding stuff like this in. @lilyminium did OFF do any related benchmarks on the cost of this or was it just done for the sake of completeness without performance in mind?

If this is something that will regularly be used maybe we should just expose the argument to users?

@lilyminium
Copy link
Member

lilyminium commented Nov 29, 2021

@IAlibay I doubt it, and tbh it's difficult to hit with small molecules anyway. We should probably just expose the argument. IIRC OpenFF has merged this too now, max_matches=None is what gets you the largest unsigned int.

Edit: worth noting that 1000 is RDKit's own default limit, and I like setting it to the highest possible number as a default instead.

@orionarcher
Copy link
Contributor Author

orionarcher commented Nov 30, 2021

Thank you for the feedback. I set the default value to the largest unsigned int, added an optional max_matches argument to the rdkit_kwargs, wrote a test to confirm behavior, and added some documentation and an example in the docstring of atoms.select_atoms.

I'm not super into the idea of hardcoding stuff like this in.

As @lilyminium pointed out, it's going to be hardcoded to 1000 by default. I tend to think a higher hardcoding is preferable.

Please let me know if this needs anything else!

@richardjgowers
Copy link
Member

It's probably worth holding off and seeing what the person who put the limit in thinks (@cbouy )

The comparisons to OFF are a little misleading, a design difference between OFF and MDA is that OFF templates molecules and will only need to perform matches on the unique molecule types (e.g. there's one water stereotype that water molecules point to), whereas MDA just brute forces this. This will mean that things like this selection scale differently, and OFF will likely be better at these things.

@IAlibay
Copy link
Member

IAlibay commented Nov 30, 2021

It's probably worth holding off and seeing what the person who put the limit in thinks (@cbouy )

Not that I don't want @cbouy's opinion on this (given that it's his original code), but in his defense isn't GetSubstructMatches an RDKit method? (i.e. it's just a built-in default)

@IAlibay
Copy link
Member

IAlibay commented Nov 30, 2021

Seems like test failures are chemfiles related - seems like the feedstock got updated just an hour ago so this PR probably just ran at a bad time. Let's see if it re-runs a bit later in the day.

@cbouy
Copy link
Member

cbouy commented Nov 30, 2021

I didn't enforce any limit, I just mentioned in the docs that RDKit's default is 1000 unique matches.

Exposing max_matches is a good idea, although I would keep the RDKit syntax for it: maxMatches.
I don't particularly like camel-case in python but since users will have to type rdkit_kwargs=... I think it's better to stay consistent with RDKit, as we've done with other RDKit-based functions.

As for a good default value, why not use maxMatches=mol.GetNumAtoms() instead? This way you consistently get the maximum number of actual matches and not an arbitrary number.

@orionarcher
Copy link
Contributor Author

@orbeckst correct, this PR now serves to add additional kwargs. Namely, it creates the smarts_kwargs argument in select_atoms. It allows users to tailor the behavior of the smarts keyword by passing arguments to RDKit's GetSubstructMatch function. Previously, there was no way to override the default values. This has been causing me many issues as I've tried to select specific atoms and regularly hit the default 1000 match limit. I think this will be broadly useful to users who want to use smarts on large systems.

I'll resolve the conflicts next week and push the changes.

@orbeckst
Copy link
Member

@orioncohen any chance to make some time in your schedule to finish up this PR? Would be good to put a bow on it.

# Conflicts:
#	package/CHANGELOG
#	package/MDAnalysis/core/selection.py
@orionarcher
Copy link
Contributor Author

Thanks for the bump @orbeckst. I merged in the develop branch and this should be good to go.

Copy link
Member

@IAlibay IAlibay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of things, overall lgtm, however I would be ok with swapping to the default to max(1000, n_atoms) as you suggested

package/MDAnalysis/core/selection.py Show resolved Hide resolved
@@ -678,7 +679,9 @@ def _apply(self, group):
if not pattern:
raise ValueError(f"{self.pattern!r} is not a valid SMARTS query")
mol = group.convert_to("RDKIT", **self.rdkit_kwargs)
matches = mol.GetSubstructMatches(pattern, useChirality=True)
self.smarts_kwargs.setdefault("useChirality", True)
self.smarts_kwargs.setdefault("maxMatches", 1000)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So re-reading the various comments I actually quite like your suggestion (#3470 (comment)) to use max(1000, n_atoms) as the default, I would happy with that being implemented here and skipping the warning / deprecation with the excuse that it's technically a "bug" in large systems.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might actually suggest raising it to max(1000, 10 * n_atoms). When using smarts selection I've found that n_atoms is often too low to capture all matches. While 10 * n_atoms is a bit arbitrary, I believe it would work better for many users. We could also set it to some extremely high number and give users responsibility for not using a selection like CC that would be very slow.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a sense for how frequently this happens at n_atoms vs 10 * n_atoms, and how large a cost each order of magnitude adds? Is there a way this could easily be benchmarked?

If we think we'll still get a lot of problem cases unless we have to go to a number that is prohibitively expensive, we might want to inspect the return number of matches and raise a warning if maxMatch == len(matches)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I performed a quick benchmarking and the match number seems to have a negligible impact on the overall time calling select_atoms. The system I explored has ~8000 atoms

u.select_atoms(f'smarts C', smarts_kwargs={'maxMatches': 10})

and

u.select_atoms(f'smarts C', smarts_kwargs={'maxMatches': 100000})

the former returned 10 atoms in 718 ms and the latter returned 2035 atoms in 728 ms, which is correct. It seems that the python code is much slower than the substructure matching written in C. Except for very large systems, I doubt the substructure matching will be an issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how frequently n_atoms is insufficient. I wasn't able to reproduce the issue on the system I have on hand, but I have encountered it in the past.

Copy link
Member

@IAlibay IAlibay May 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok yeah that makes sense it'll just be a loop that terminates once there are no further matches so you won't get any difference in performance unless maxMatches < all possible matches.

I think we should do the following:

  1. We set maxMatches to some arbitrary high value if we think that's beneficial in most cases (10 * n_atoms if you want for now).
  2. We put up a warning in the docs detailing a) the issue, explaining what happens when you run out of allowable matches, b) how to fix it, c) the fact that this is currently under refinement and the maximum number of matches may grow / reduce in future versions.
  3. Issue a warning if the number of matches == maxMatches that tells users that a maximum number of matches was returned and they may need to increase the number of matches if they want. (@jbarnoud @lilyminium I'm terrible at adding extra warnings - you both have stronger views on these things, are you for/against this?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That all sounds good to me. A warning seems wise because a) matches == maxMatches almost certainly is not the intended behavior, and b) otherwise it would be a silent error that would be tricky to track down.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just pushed an implementation of 1-3.

package/MDAnalysis/core/groups.py Outdated Show resolved Hide resolved
@orionarcher
Copy link
Contributor Author

I implemented @IAlibay's suggested changes and fixed the conflicts with CHANGELOG. I'd love to get this merged if there are no more changes!

@IAlibay
Copy link
Member

IAlibay commented May 27, 2022

Thanks @orioncohen, I'll try to review it in a few

@IAlibay IAlibay self-requested a review May 27, 2022 15:33
Copy link
Member

@IAlibay IAlibay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've not read all the docs fully yet, but this will need addressing first.

package/MDAnalysis/core/groups.py Outdated Show resolved Hide resolved
@@ -678,7 +679,17 @@ def _apply(self, group):
if not pattern:
raise ValueError(f"{self.pattern!r} is not a valid SMARTS query")
mol = group.convert_to("RDKIT", **self.rdkit_kwargs)
matches = mol.GetSubstructMatches(pattern, useChirality=True)
self.smarts_kwargs.setdefault("useChirality", True)
self.smarts_kwargs.setdefault("maxMatches", len(group) * 10)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we had agreed on max(1000, n_atoms*10) so as not to change the default in cases where you have few atoms?

Copy link
Member

@richardjgowers richardjgowers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM barring Irfan's comment about the default number of matches

Copy link
Member

@IAlibay IAlibay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok that should be fine now.

@richardjgowers do you want to have a quick glance at the changes I made? Then I think it's good to go.

@IAlibay IAlibay merged commit 0b5b8ab into MDAnalysis:develop Jun 1, 2022
@IAlibay
Copy link
Member

IAlibay commented Jun 1, 2022

Thanks @orioncohen ! Sorry for taking over at the end, I just wanted to get this through for the 2.2.0 release.

@orionarcher
Copy link
Contributor Author

Thanks for wrapping it up @IAlibay, glad this is merged! I've been moving the past couple days so I was AFK.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants