increased the maximum number of matches from an rdkit smarts query #3470

orionarcher · 2021-11-29T20:35:08Z

Related to Issue #3469.

Changes made in this Pull Request:

increased the maximum number of matches from a SMARTS selection query from 1000 to n_atoms.
separated kwargs for GetSubstructMatch and convert_to.
updated documentation to clarify kwarg usage.

PR Checklist

…. Sphzone operating on an empty selection now returns an empty atom group.

…ection tests to fix issue_#2915.

…testing to confirm.

…to be more consistent.

… tokens

…orator

… functions

…to develop

pep8speaks · 2021-11-29T20:35:12Z

Hello @orioncohen! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

In the file package/MDAnalysis/core/groups.py:

Line 3022:80: E501 line too long (82 > 79 characters)
Line 3025:80: E501 line too long (87 > 79 characters)
Line 3035:80: E501 line too long (81 > 79 characters)

In the file testsuite/MDAnalysisTests/core/test_atomselections.py:

Line 598:80: E501 line too long (97 > 79 characters)

Comment last updated at 2022-06-01 12:56:56 UTC

codecov · 2021-11-29T20:59:51Z

Codecov Report

Merging #3470 (c739404) into develop (eea743f) will increase coverage by 0.02%.
The diff coverage is 100.00%.

@@             Coverage Diff             @@
##           develop    #3470      +/-   ##
===========================================
+ Coverage    94.33%   94.35%   +0.02%     
===========================================
  Files          191      191              
  Lines        24917    24975      +58     
  Branches      3357     3365       +8     
===========================================
+ Hits         23505    23565      +60     
+ Misses        1364     1362       -2     
  Partials        48       48

Impacted Files	Coverage Δ
package/MDAnalysis/core/groups.py	`98.58% <ø> (ø)`
package/MDAnalysis/core/selection.py	`98.81% <100.00%> (+<0.01%)`	⬆️
package/MDAnalysis/coordinates/DCD.py	`100.00% <0.00%> (ø)`
package/MDAnalysis/coordinates/XDR.py	`100.00% <0.00%> (ø)`
package/MDAnalysis/coordinates/DLPoly.py	`98.78% <0.00%> (+<0.01%)`	⬆️
package/MDAnalysis/coordinates/H5MD.py	`97.61% <0.00%> (+0.01%)`	⬆️
package/MDAnalysis/coordinates/TRJ.py	`97.90% <0.00%> (+0.01%)`	⬆️
package/MDAnalysis/coordinates/memory.py	`98.72% <0.00%> (+0.01%)`	⬆️
package/MDAnalysis/coordinates/PDB.py	`94.84% <0.00%> (+0.02%)`	⬆️
package/MDAnalysis/coordinates/chemfiles.py	`97.25% <0.00%> (+0.03%)`	⬆️
... and 11 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update eea743f...c739404. Read the comment docs.

lilyminium · 2021-11-29T21:24:19Z

10k is also somewhat arbitrary. The OpenFF toolkit uses the largest unsigned int, could that work too? (np.iinfo(np.uintc).max)

IAlibay · 2021-11-29T22:01:24Z

I'm not super into the idea of hardcoding stuff like this in. @lilyminium did OFF do any related benchmarks on the cost of this or was it just done for the sake of completeness without performance in mind?

If this is something that will regularly be used maybe we should just expose the argument to users?

lilyminium · 2021-11-29T22:18:01Z

@IAlibay I doubt it, and tbh it's difficult to hit with small molecules anyway. We should probably just expose the argument. IIRC OpenFF has merged this too now, max_matches=None is what gets you the largest unsigned int.

Edit: worth noting that 1000 is RDKit's own default limit, and I like setting it to the highest possible number as a default instead.

…e parameters

orionarcher · 2021-11-30T00:40:24Z

Thank you for the feedback. I set the default value to the largest unsigned int, added an optional max_matches argument to the rdkit_kwargs, wrote a test to confirm behavior, and added some documentation and an example in the docstring of atoms.select_atoms.

I'm not super into the idea of hardcoding stuff like this in.

As @lilyminium pointed out, it's going to be hardcoded to 1000 by default. I tend to think a higher hardcoding is preferable.

Please let me know if this needs anything else!

richardjgowers · 2021-11-30T11:54:15Z

It's probably worth holding off and seeing what the person who put the limit in thinks (@cbouy )

The comparisons to OFF are a little misleading, a design difference between OFF and MDA is that OFF templates molecules and will only need to perform matches on the unique molecule types (e.g. there's one water stereotype that water molecules point to), whereas MDA just brute forces this. This will mean that things like this selection scale differently, and OFF will likely be better at these things.

IAlibay · 2021-11-30T11:56:43Z

It's probably worth holding off and seeing what the person who put the limit in thinks (@cbouy )

Not that I don't want @cbouy's opinion on this (given that it's his original code), but in his defense isn't GetSubstructMatches an RDKit method? (i.e. it's just a built-in default)

IAlibay · 2021-11-30T12:00:07Z

Seems like test failures are chemfiles related - seems like the feedstock got updated just an hour ago so this PR probably just ran at a bad time. Let's see if it re-runs a bit later in the day.

cbouy · 2021-11-30T12:24:11Z

I didn't enforce any limit, I just mentioned in the docs that RDKit's default is 1000 unique matches.

Exposing max_matches is a good idea, although I would keep the RDKit syntax for it: maxMatches.
I don't particularly like camel-case in python but since users will have to type rdkit_kwargs=... I think it's better to stay consistent with RDKit, as we've done with other RDKit-based functions.

As for a good default value, why not use maxMatches=mol.GetNumAtoms() instead? This way you consistently get the maximum number of actual matches and not an arbitrary number.

orionarcher · 2022-04-07T19:31:51Z

@orbeckst correct, this PR now serves to add additional kwargs. Namely, it creates the smarts_kwargs argument in select_atoms. It allows users to tailor the behavior of the smarts keyword by passing arguments to RDKit's GetSubstructMatch function. Previously, there was no way to override the default values. This has been causing me many issues as I've tried to select specific atoms and regularly hit the default 1000 match limit. I think this will be broadly useful to users who want to use smarts on large systems.

I'll resolve the conflicts next week and push the changes.

orbeckst · 2022-05-12T10:24:44Z

@orioncohen any chance to make some time in your schedule to finish up this PR? Would be good to put a bow on it.

…to develop

# Conflicts: # package/CHANGELOG # package/MDAnalysis/core/selection.py

orionarcher · 2022-05-16T00:43:05Z

Thanks for the bump @orbeckst. I merged in the develop branch and this should be good to go.

IAlibay

Couple of things, overall lgtm, however I would be ok with swapping to the default to max(1000, n_atoms) as you suggested

package/MDAnalysis/core/selection.py

IAlibay · 2022-05-16T13:07:56Z

package/MDAnalysis/core/selection.py

@@ -678,7 +679,9 @@ def _apply(self, group):
        if not pattern:
            raise ValueError(f"{self.pattern!r} is not a valid SMARTS query")
        mol = group.convert_to("RDKIT", **self.rdkit_kwargs)
-        matches = mol.GetSubstructMatches(pattern, useChirality=True)
+        self.smarts_kwargs.setdefault("useChirality", True)
+        self.smarts_kwargs.setdefault("maxMatches", 1000)


So re-reading the various comments I actually quite like your suggestion (#3470 (comment)) to use max(1000, n_atoms) as the default, I would happy with that being implemented here and skipping the warning / deprecation with the excuse that it's technically a "bug" in large systems.

I might actually suggest raising it to max(1000, 10 * n_atoms). When using smarts selection I've found that n_atoms is often too low to capture all matches. While 10 * n_atoms is a bit arbitrary, I believe it would work better for many users. We could also set it to some extremely high number and give users responsibility for not using a selection like CC that would be very slow.

Do we have a sense for how frequently this happens at n_atoms vs 10 * n_atoms, and how large a cost each order of magnitude adds? Is there a way this could easily be benchmarked?

If we think we'll still get a lot of problem cases unless we have to go to a number that is prohibitively expensive, we might want to inspect the return number of matches and raise a warning if maxMatch == len(matches)

I performed a quick benchmarking and the match number seems to have a negligible impact on the overall time calling select_atoms. The system I explored has ~8000 atoms

u.select_atoms(f'smarts C', smarts_kwargs={'maxMatches': 10})

and

u.select_atoms(f'smarts C', smarts_kwargs={'maxMatches': 100000})

the former returned 10 atoms in 718 ms and the latter returned 2035 atoms in 728 ms, which is correct. It seems that the python code is much slower than the substructure matching written in C. Except for very large systems, I doubt the substructure matching will be an issue.

I'm not sure how frequently n_atoms is insufficient. I wasn't able to reproduce the issue on the system I have on hand, but I have encountered it in the past.

Ok yeah that makes sense it'll just be a loop that terminates once there are no further matches so you won't get any difference in performance unless maxMatches < all possible matches.

I think we should do the following:

We set maxMatches to some arbitrary high value if we think that's beneficial in most cases (10 * n_atoms if you want for now).

We put up a warning in the docs detailing a) the issue, explaining what happens when you run out of allowable matches, b) how to fix it, c) the fact that this is currently under refinement and the maximum number of matches may grow / reduce in future versions.

Issue a warning if the number of matches == maxMatches that tells users that a maximum number of matches was returned and they may need to increase the number of matches if they want. (@jbarnoud @lilyminium I'm terrible at adding extra warnings - you both have stronger views on these things, are you for/against this?)

That all sounds good to me. A warning seems wise because a) matches == maxMatches almost certainly is not the intended behavior, and b) otherwise it would be a silent error that would be tricky to track down.

I just pushed an implementation of 1-3.

package/MDAnalysis/core/groups.py

…sphinx docs

…en/mdanalysis into increase_smarts_matches

Merge branch 'develop' of https://github.com/MDAnalysis/mdanalysis into develop

# Conflicts: # package/CHANGELOG

orionarcher · 2022-05-27T15:32:41Z

I implemented @IAlibay's suggested changes and fixed the conflicts with CHANGELOG. I'd love to get this merged if there are no more changes!

IAlibay · 2022-05-27T15:33:28Z

Thanks @orioncohen, I'll try to review it in a few

IAlibay

I've not read all the docs fully yet, but this will need addressing first.

package/MDAnalysis/core/groups.py

IAlibay · 2022-05-31T05:38:51Z

package/MDAnalysis/core/selection.py

@@ -678,7 +679,17 @@ def _apply(self, group):
        if not pattern:
            raise ValueError(f"{self.pattern!r} is not a valid SMARTS query")
        mol = group.convert_to("RDKIT", **self.rdkit_kwargs)
-        matches = mol.GetSubstructMatches(pattern, useChirality=True)
+        self.smarts_kwargs.setdefault("useChirality", True)
+        self.smarts_kwargs.setdefault("maxMatches", len(group) * 10)


I thought we had agreed on max(1000, n_atoms*10) so as not to change the default in cases where you have few atoms?

richardjgowers

LGTM barring Irfan's comment about the default number of matches

IAlibay

Ok that should be fine now.

@richardjgowers do you want to have a quick glance at the changes I made? Then I think it's good to go.

IAlibay · 2022-06-01T14:39:48Z

Thanks @orioncohen ! Sorry for taking over at the end, I just wanted to get this through for the 2.2.0 release.

orionarcher · 2022-06-02T16:55:11Z

Thanks for wrapping it up @IAlibay, glad this is merged! I've been moving the past couple days so I was AFK.

orionarcher added 15 commits April 2, 2021 16:49

Fixed issue MDAnalysis#2915 and added a pytest to demonstrate the fix…

7a57af1

…. Sphzone operating on an empty selection now returns an empty atom group.

Merge branch 'issue_#2915' into develop. Updates to selection and sel…

7a2a155

…ection tests to fix issue_#2915.

removed a unncesessary TODO statement

89a38d3

added issue MDAnalysis#2915 fix to CHANGELOG

a9194a3

added self to AUTHORS and CHANGELOG, moved lines in selection

dcb23b5

fixed issue MDAnalysis#2915 for cylayer, cyzone, and sphlayer. added …

889b28d

…testing to confirm.

created decorator to test for empty selections. moved around testing …

96b63e3

…to be more consistent.

added another blank line to be consistent with PEP-8

d5fde7d

seperated testing for empty atom selection for sph* and cy* selection…

8a81f54

… tokens

added stacked decorators instead of repeated functionality in one dec…

a4c7e41

…orator

removed the decorator implementation and added code directly into the…

b6dfd9a

… functions

updated CHANGELOG with more detail

0cc38ad

Merge branch 'develop' of https://github.com/MDAnalysis/mdanalysis in…

d03d329

…to develop

increased the maximum number of matches from an rdkit smarts query

d0198e1

update changelog

3ac808c

github-actions bot added the Component-Core label Nov 29, 2021

orionarcher mentioned this pull request Nov 29, 2021

Increase SMARTS query maximum matches #3469

Closed

orionarcher added 2 commits November 29, 2021 16:36

set default max_matches to largest unsigned int and add a configurabl…

ad874f0

…e parameters

add better documentation to the smarts selection query

b2a9446

added a test for max_matches behavior

da317ed

orionarcher added 2 commits May 15, 2022 17:23

Merge branch 'develop' of https://github.com/MDAnalysis/mdanalysis in…

6317edb

…to develop

Merge branch 'develop' into increase_smarts_matches

8fff72a

# Conflicts: # package/CHANGELOG # package/MDAnalysis/core/selection.py

Merge branch 'develop' into increase_smarts_matches

fc129f6

IAlibay requested changes May 16, 2022

View reviewed changes

orionarcher added 8 commits May 18, 2022 15:55

added docstring parameter for smarts_kwargs

e57695a

updated documentation description for smarts query and duplicated in …

9eabb4c

…sphinx docs

Merge branch 'increase_smarts_matches' of https://github.com/orioncoh…

ace0c57

…en/mdanalysis into increase_smarts_matches

change default maxMatches to 10 * n_atoms and add test

ae5c5e6

update documentation for smarts kwargs

1e6eba0

add smarts kwargs warning to docs

e0e28a3

merge develop

e07f40b

Merge branch 'develop' of https://github.com/MDAnalysis/mdanalysis into develop

Merge branch 'develop' into increase_smarts_matches

92828c8

# Conflicts: # package/CHANGELOG

IAlibay self-requested a review May 27, 2022 15:33

IAlibay requested changes May 31, 2022

View reviewed changes

Merge branch 'develop' into increase_smarts_matches

9574661

richardjgowers approved these changes Jun 1, 2022

View reviewed changes

IAlibay and others added 4 commits June 1, 2022 13:01

various doc improvements and maxMatches fix

e38e219

update selections.rst

e7d6a68

Merge branch 'develop' into increase_smarts_matches

30c3f35

fix tests

c739404

IAlibay approved these changes Jun 1, 2022

View reviewed changes

IAlibay merged commit 0b5b8ab into MDAnalysis:develop Jun 1, 2022

IAlibay added the enhancement label Sep 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

increased the maximum number of matches from an rdkit smarts query #3470

increased the maximum number of matches from an rdkit smarts query #3470

orionarcher commented Nov 29, 2021 •

edited

Loading

pep8speaks commented Nov 29, 2021 •

edited

Loading

codecov bot commented Nov 29, 2021 •

edited

Loading

lilyminium commented Nov 29, 2021

IAlibay commented Nov 29, 2021

lilyminium commented Nov 29, 2021 •

edited

Loading

orionarcher commented Nov 30, 2021 •

edited

Loading

richardjgowers commented Nov 30, 2021

IAlibay commented Nov 30, 2021

IAlibay commented Nov 30, 2021

cbouy commented Nov 30, 2021

orionarcher commented Apr 7, 2022

orbeckst commented May 12, 2022

orionarcher commented May 16, 2022

IAlibay left a comment

IAlibay May 16, 2022

orionarcher May 18, 2022

IAlibay May 18, 2022

orionarcher May 19, 2022

orionarcher May 19, 2022

IAlibay May 19, 2022 •

edited

Loading

orionarcher May 19, 2022

orionarcher May 19, 2022

orionarcher commented May 27, 2022

IAlibay commented May 27, 2022

IAlibay left a comment

IAlibay May 31, 2022

richardjgowers left a comment

IAlibay left a comment

IAlibay commented Jun 1, 2022

orionarcher commented Jun 2, 2022

increased the maximum number of matches from an rdkit smarts query #3470

increased the maximum number of matches from an rdkit smarts query #3470

Conversation

orionarcher commented Nov 29, 2021 • edited Loading

PR Checklist

pep8speaks commented Nov 29, 2021 • edited Loading

Comment last updated at 2022-06-01 12:56:56 UTC

codecov bot commented Nov 29, 2021 • edited Loading

Codecov Report

lilyminium commented Nov 29, 2021

IAlibay commented Nov 29, 2021

lilyminium commented Nov 29, 2021 • edited Loading

orionarcher commented Nov 30, 2021 • edited Loading

richardjgowers commented Nov 30, 2021

IAlibay commented Nov 30, 2021

IAlibay commented Nov 30, 2021

cbouy commented Nov 30, 2021

orionarcher commented Apr 7, 2022

orbeckst commented May 12, 2022

orionarcher commented May 16, 2022

IAlibay left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

IAlibay May 19, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

orionarcher commented May 27, 2022

IAlibay commented May 27, 2022

IAlibay left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

richardjgowers left a comment

Choose a reason for hiding this comment

IAlibay left a comment

Choose a reason for hiding this comment

IAlibay commented Jun 1, 2022

orionarcher commented Jun 2, 2022

orionarcher commented Nov 29, 2021 •

edited

Loading

pep8speaks commented Nov 29, 2021 •

edited

Loading

codecov bot commented Nov 29, 2021 •

edited

Loading

lilyminium commented Nov 29, 2021 •

edited

Loading

orionarcher commented Nov 30, 2021 •

edited

Loading

IAlibay May 19, 2022 •

edited

Loading