Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

increased the maximum number of matches from an rdkit smarts query #3470

Merged
merged 50 commits into from
Jun 1, 2022
Merged
Show file tree
Hide file tree
Changes from 45 commits
Commits
Show all changes
50 commits
Select commit Hold shift + click to select a range
7a57af1
Fixed issue #2915 and added a pytest to demonstrate the fix. Sphzone …
orionarcher Apr 2, 2021
7a2a155
Merge branch 'issue_#2915' into develop. Updates to selection and sel…
orionarcher Apr 3, 2021
89a38d3
removed a unncesessary TODO statement
orionarcher Apr 3, 2021
a9194a3
added issue #2915 fix to CHANGELOG
orionarcher Apr 3, 2021
dcb23b5
added self to AUTHORS and CHANGELOG, moved lines in selection
orionarcher Apr 3, 2021
889b28d
fixed issue #2915 for cylayer, cyzone, and sphlayer. added testing to…
orionarcher Apr 3, 2021
96b63e3
created decorator to test for empty selections. moved around testing …
orionarcher Apr 3, 2021
d5fde7d
added another blank line to be consistent with PEP-8
orionarcher Apr 3, 2021
8a81f54
seperated testing for empty atom selection for sph* and cy* selection…
orionarcher Apr 4, 2021
a4c7e41
added stacked decorators instead of repeated functionality in one dec…
orionarcher Apr 4, 2021
b6dfd9a
removed the decorator implementation and added code directly into the…
orionarcher Apr 4, 2021
0cc38ad
updated CHANGELOG with more detail
orionarcher Apr 5, 2021
d03d329
Merge branch 'develop' of https://github.com/MDAnalysis/mdanalysis in…
orionarcher May 22, 2021
d0198e1
increased the maximum number of matches from an rdkit smarts query
orionarcher Nov 29, 2021
3ac808c
update changelog
orionarcher Nov 29, 2021
ad874f0
set default max_matches to largest unsigned int and add a configurabl…
orionarcher Nov 30, 2021
b2a9446
add better documentation to the smarts selection query
orionarcher Nov 30, 2021
da317ed
added a test for max_matches behavior
orionarcher Nov 30, 2021
bef38a9
change max_matches -> maxMatches and set default max to group.n_atoms
orionarcher Nov 30, 2021
547d8f6
update CHANGELOG to reflect recent changes
orionarcher Nov 30, 2021
937e580
clean up syntax and expose useChirality kwarg
orionarcher Nov 30, 2021
4e126d1
add better documentation for rdkit_kwargs
orionarcher Nov 30, 2021
086e050
use two kwargs variables to pass kwargs to smarts queries
orionarcher Dec 2, 2021
14d163a
update testing to use smarts_kwargs
orionarcher Dec 2, 2021
d89b131
small bug fix
orionarcher Dec 3, 2021
1d7d898
updated testing to increase code coverage
orionarcher Dec 3, 2021
207e858
cleaned up testing for smarts kwargs
orionarcher Dec 3, 2021
a210d43
updated and clarified documentation
orionarcher Dec 10, 2021
8a67c76
cleaned up code style and used setdefaults rather than if statement
orionarcher Dec 10, 2021
dc5a704
added versionchaged info at 2.0.1
orionarcher Dec 10, 2021
e984be0
returned to default to 1000
orionarcher Jan 14, 2022
cbcba0f
update changelog to reflect no change to defaults
orionarcher Jan 14, 2022
8c252d2
Merge branch 'develop' into increase_smarts_matches
IAlibay Jan 16, 2022
04b415b
Updating to new develop
orionarcher Jan 28, 2022
6317edb
Merge branch 'develop' of https://github.com/MDAnalysis/mdanalysis in…
orionarcher May 16, 2022
8fff72a
Merge branch 'develop' into increase_smarts_matches
orionarcher May 16, 2022
fc129f6
Merge branch 'develop' into increase_smarts_matches
IAlibay May 16, 2022
e57695a
added docstring parameter for smarts_kwargs
orionarcher May 18, 2022
9eabb4c
updated documentation description for smarts query and duplicated in …
orionarcher May 18, 2022
ace0c57
Merge branch 'increase_smarts_matches' of https://github.com/orioncoh…
orionarcher May 18, 2022
ae5c5e6
change default maxMatches to 10 * n_atoms and add test
orionarcher May 19, 2022
1e6eba0
update documentation for smarts kwargs
orionarcher May 19, 2022
e0e28a3
add smarts kwargs warning to docs
orionarcher May 19, 2022
e07f40b
merge develop
orionarcher May 27, 2022
92828c8
Merge branch 'develop' into increase_smarts_matches
orionarcher May 27, 2022
9574661
Merge branch 'develop' into increase_smarts_matches
IAlibay May 31, 2022
e38e219
various doc improvements and maxMatches fix
IAlibay Jun 1, 2022
e7d6a68
update selections.rst
IAlibay Jun 1, 2022
30c3f35
Merge branch 'develop' into increase_smarts_matches
IAlibay Jun 1, 2022
c739404
fix tests
IAlibay Jun 1, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 10 additions & 8 deletions package/CHANGELOG
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ The rules for this file:
HenokB, umak1106, tamandeeps, Mrqeoqqt, megosato, AnirG, rishu235,
mtiberti, manishsaini6421, Sukeerti1, robotjellyzone, markvrma, alescoulie,
mjtadema, PicoCentauri, Atharva7K, aditi2906, orbeckst, yuxuanzhuang,
rsexton2
rsexton2, orioncohen

* 2.2.0

Expand All @@ -29,7 +29,7 @@ Fixes
* Fix the competition when generating offsets from multiple processes by
using a interprocess file lock. (Issue #1988, PR #3375)
* Fix the error when trying to load a corrupted offset file (Issue #3230
PR #3375)
PR #3375)
* Fixed `OpenMMTopologyParser` not working when some or none of the elements
are present in the openmm topology (Issue #3317, PR #3511).
* Remove None from filenames in analysis.hole.HoleAnalysis (PR #3650)
Expand All @@ -39,8 +39,8 @@ Fixes
an external reference (Issue #3539, PR #3621)
* Fixed LinearDensity not working with grouping='residues' or 'segments'
(Issue #3571, PR #3572)
* LinearDensity now uses the definition of Avogadro's number from
MDAnalysis.units, which has more significant digits.
* LinearDensity now uses the definition of Avogadro's number from
MDAnalysis.units, which has more significant digits.
(Issue #3571, PR #3572)
* MOL2 can read files without providing optional columns:
subst_id, subst_name and charge. (Issue #3385, PR #3598)
Expand All @@ -58,8 +58,8 @@ Fixes
* Fixed BAT method Cartesian modifies input data. (Issue #3501)

Enhancements
* Added `frames` argument to AnalysisBase.run to allow analysis to run on
arbitrary list of frames (Issue #1985)
* Added `frames` argument to AnalysisBase.run to allow analysis to run on
arbitrary list of frames (Issue #1985)
* LinearDensity now works with updating AtomGroups (Issue #2508, PR #3617)
* Link PMDA in documentation (PR #3652)
* Added MSD example where multiple replicates are combined(Issue #3578
Expand All @@ -77,13 +77,15 @@ Enhancements
* CRD extended format can now be explicitly requested with the `extended`
keyword (Issue #3605)
* Added benchmarks in lib.distances for enhancement (Issue #3205, PR #3641)
* Add smarts_kwargs to allow greater flexiblity with smarts atom selection
(Issue #3469, PR #3470).

Changes
* `fasteners` package is now a core dependency (Issues #3230, #1988, PR #3375)
* The minimm Python, NumPy, and SciPy versions (following NEP29) are now 3.8,
* The minimm Python, NumPy, and SciPy versions (following NEP29) are now 3.8,
1.19, and 1.5.0; GSD now also has a minimum version of 1.9.3 and networkx
has minimum version 2.0 (Issue #3665)
* LinearDensity now saves the histogram bin edges for easier plotting as
* LinearDensity now saves the histogram bin edges for easier plotting as
"hist_bin_edges" for each dimension xyz in the results dictionary.
(Issue #2508, PR #3617)
* ITPParser no longer adds an angle for water molecules that have the SETTLE
Expand Down
30 changes: 27 additions & 3 deletions package/MDAnalysis/core/groups.py
Original file line number Diff line number Diff line change
Expand Up @@ -2859,7 +2859,7 @@ def ts(self):

def select_atoms(self, sel, *othersel, periodic=True, rtol=1e-05,
atol=1e-08, updating=False, sorted=True,
rdkit_kwargs=None, **selgroups):
rdkit_kwargs=None, smarts_kwargs=None, **selgroups):
"""Select atoms from within this Group using a selection string.

Returns an :class:`AtomGroup` sorted according to their index in the
Expand Down Expand Up @@ -2892,6 +2892,10 @@ def select_atoms(self, sel, *othersel, periodic=True, rtol=1e-05,
Arguments passed to the
:class:`~MDAnalysis.converters.RDKit.RDKitConverter` when using
selection based on SMARTS queries
smarts_kwargs : dict (optional)
Arguments passed internally to RDKit's `GetSubstructMatches
<https://www.rdkit.org/docs/source/rdkit.Chem.rdchem.html#rdkit.Chem.rdchem.Mol.GetSubstructMatches>`_.

**selgroups : keyword arguments of str: AtomGroup (optional)
when using the "group" keyword in selections, groups are defined by
passing them as keyword arguments. See section on **preexisting
Expand Down Expand Up @@ -3010,7 +3014,23 @@ def select_atoms(self, sel, *othersel, periodic=True, rtol=1e-05,
smarts *SMARTS-query*
select atoms using Daylight's SMARTS queries, e.g. ``smarts
[#7;R]`` to find nitrogen atoms in rings. Requires RDKit.
All matches (max 1000) are combined as a unique match
All matches are combined as a single unique match. The `smarts`
selection accepts two sets of key word arguments from
`select_atoms()`: the ``rdkit_kwargs`` are passed internally to
`RDKitConverter.convert()` and the ``smarts_kwargs`` are passed to
RDKit's `GetSubstructMatches
<https://www.rdkit.org/docs/source/rdkit.Chem.rdchem.html#rdkit.Chem.rdchem.Mol.GetSubstructMatches>`_.
By default, the `useChirality` kwarg in ``rdkit_kwargs`` is set to true
and maxMatches in ``smarts_kwargs`` is ``10 * len(AtomGroup)`` or
``10 * len(Universe.atoms)``, whichever is applicable.
Note that the number of matches can occasionally exceed the default
value of maxMatches, causing too few atoms to be returned. If this
occurs, a warning will be issued. The problem can be fixed by increasing
the value of maxMatches. This behavior may be updated in the future.

>>> universe.select_atoms("C", smarts_kwargs={"maxMatches": 100})
<AtomGroup with 100 atoms>

chiral *R | S*
select a particular stereocenter. e.g. ``name C and chirality
S`` to select only S-chiral carbon atoms. Only ``R`` and
Expand Down Expand Up @@ -3166,6 +3186,9 @@ def select_atoms(self, sel, *othersel, periodic=True, rtol=1e-05,
Added the *smarts* selection. Added `atol` and `rtol` keywords
to select float values. Added the ``sort`` keyword. Added
`rdkit_kwargs` to pass parameters to the RDKitConverter.
.. versionchanged:: 2.0.1
IAlibay marked this conversation as resolved.
Show resolved Hide resolved
Added `smarts_kwargs` to pass parameters to the RDKit
GetSubstructMatch for *smarts* selection.
"""

if not sel:
Expand All @@ -3185,7 +3208,8 @@ def select_atoms(self, sel, *othersel, periodic=True, rtol=1e-05,
periodic=periodic,
atol=atol, rtol=rtol,
sorted=sorted,
rdkit_kwargs=rdkit_kwargs)
rdkit_kwargs=rdkit_kwargs,
orionarcher marked this conversation as resolved.
Show resolved Hide resolved
smarts_kwargs=smarts_kwargs)
for s in sel_strs))
if updating:
atomgrp = UpdatingAtomGroup(self, selections, sel_strs)
Expand Down
20 changes: 17 additions & 3 deletions package/MDAnalysis/core/selection.py
Original file line number Diff line number Diff line change
Expand Up @@ -665,6 +665,7 @@ def __init__(self, parser, tokens):
pattern.append(val)
self.pattern = "".join(pattern)
self.rdkit_kwargs = parser.rdkit_kwargs
self.smarts_kwargs = parser.smarts_kwargs

def _apply(self, group):
try:
Expand All @@ -678,7 +679,17 @@ def _apply(self, group):
if not pattern:
raise ValueError(f"{self.pattern!r} is not a valid SMARTS query")
mol = group.convert_to("RDKIT", **self.rdkit_kwargs)
matches = mol.GetSubstructMatches(pattern, useChirality=True)
self.smarts_kwargs.setdefault("useChirality", True)
self.smarts_kwargs.setdefault("maxMatches", len(group) * 10)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we had agreed on max(1000, n_atoms*10) so as not to change the default in cases where you have few atoms?

matches = mol.GetSubstructMatches(pattern, **self.smarts_kwargs)
if len(matches) == self.smarts_kwargs["maxMatches"]:
warnings.warn("Your smarts-based atom selection returned the max"
"number of matches. This indicates that not all"
"matching atoms were selected. When calling"
"atom_group.select_atoms(), the default value"
"of maxMatches is len(atom_group * 10). To fix, "
"add the following argument to select_atoms: \n"
"smarts_kwargs={maxMatches: <higher_value>}")
# flatten all matches and remove duplicated indices
indices = np.unique([idx for match in matches for idx in match])
# create boolean mask for atoms based on index
Expand Down Expand Up @@ -1408,7 +1419,7 @@ def expect(self, token):
"".format(self.tokens[0], token))

def parse(self, selectstr, selgroups, periodic=None, atol=1e-08,
rtol=1e-05, sorted=True, rdkit_kwargs=None):
rtol=1e-05, sorted=True, rdkit_kwargs=None, smarts_kwargs=None):
orionarcher marked this conversation as resolved.
Show resolved Hide resolved
"""Create a Selection object from a string.

Parameters
Expand All @@ -1431,7 +1442,9 @@ def parse(self, selectstr, selgroups, periodic=None, atol=1e-08,
rdkit_kwargs : dict, optional
Arguments passed to the RDKitConverter when using selection based
on SMARTS queries

smarts_kwargs : dict, optional
Arguments passed internally to RDKit's `GetSubstructMatches
<https://www.rdkit.org/docs/source/rdkit.Chem.rdchem.html#rdkit.Chem.rdchem.Mol.GetSubstructMatches>`_.

Returns
-------
Expand All @@ -1453,6 +1466,7 @@ def parse(self, selectstr, selgroups, periodic=None, atol=1e-08,
self.rtol = rtol
self.sorted = sorted
self.rdkit_kwargs = rdkit_kwargs or {}
self.smarts_kwargs = smarts_kwargs or {}

self.selectstr = selectstr
self.selgroups = selgroups
Expand Down
18 changes: 15 additions & 3 deletions package/doc/sphinx/source/documentation_pages/selections.rst
Original file line number Diff line number Diff line change
Expand Up @@ -164,9 +164,21 @@ moltype *molecule-type*
the TPR format defines the molecule type.

smarts *SMARTS-query*
select atoms using Daylight's SMARTS queries, e.g. ``smarts [#7;R]`` to
find nitrogen atoms in rings. Requires RDKit. All matches (max 1000) are
combined as a unique match.
select atoms using Daylight's SMARTS queries, e.g. ``smarts
[#7;R]`` to find nitrogen atoms in rings. Requires RDKit.
All matches are combined as a single unique match. The `smarts`
selection accepts two sets of key word arguments from
`select_atoms()`: the ``rdkit_kwargs`` are passed internally to
`RDKitConverter.convert()` and the ``smarts_kwargs`` are passed to
RDKit's `GetSubstructMatches
<https://www.rdkit.org/docs/source/rdkit.Chem.rdchem.html#rdkit.Chem.rdchem.Mol.GetSubstructMatches>`_.
By default, the `useChirality` kwarg in ``rdkit_kwargs`` is set to true
and maxMatches in ``smarts_kwargs`` is ``10 * len(AtomGroup)`` or
``10 * len(Universe.atoms)``, whichever is applicable.
Note that the number of matches can occasionally exceed the default
value of maxMatches, causing too few atoms to be returned. If this
occurs, a warning will be issued. The problem can be fixed by increasing
the value of maxMatches. This behavior may be updated in the future.

chiral *R | S*
select a particular stereocenter. e.g. ``name C and chirality S``
Expand Down
19 changes: 18 additions & 1 deletion testsuite/MDAnalysisTests/core/test_atomselections.py
Original file line number Diff line number Diff line change
Expand Up @@ -575,11 +575,28 @@ def test_invalid_smarts_sel_raises_error(self, u2):
with pytest.raises(ValueError, match="not a valid SMARTS"):
u2.select_atoms("smarts foo")

def test_passing_args_to_converter(self):
def test_passing_rdkit_kwargs_to_converter(self):
u = mda.Universe.from_smiles("O=C=O")
sel = u.select_atoms("smarts [$(O=C)]", rdkit_kwargs=dict(force=True))
assert sel.n_atoms == 2

def test_passing_max_matches_to_converter(self, u2):
sel = u2.select_atoms("smarts C", smarts_kwargs=dict(maxMatches=2))
assert sel.n_atoms == 2
sel2 = u2.select_atoms("smarts c")
assert sel2.n_atoms == 4
with pytest.warns(UserWarning) as warnings:
u2.select_atoms("smarts C", smarts_kwargs=dict(maxMatches=2))
u2.select_atoms("smarts C", smarts_kwargs=dict(maxMatches=1000))
assert len(warnings) == 1

def test_passing_use_chirality_to_converter(self):
u = mda.Universe.from_smiles("CC[C@H](C)O")
sel3 = u.select_atoms("byres smarts CC[C@@H](C)O")
assert sel3.n_atoms == 0
sel4 = u.select_atoms("byres smarts CC[C@@H](C)O", smarts_kwargs={"useChirality": False})
assert sel4.n_atoms == 15


class TestSelectionsNucleicAcids(object):
@pytest.fixture(scope='class')
Expand Down