Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

groupby method for GroupBase #1112

Merged
merged 4 commits into from
Jan 12, 2017
Merged

groupby method for GroupBase #1112

merged 4 commits into from
Jan 12, 2017

Conversation

dotsdl
Copy link
Member

@dotsdl dotsdl commented Dec 6, 2016

Addresses #1105

Changes made in this Pull Request:

  • added a general-purpose groupby method to GroupBase that allows for familiar groupby operations on single topology attributes a la pandas or datreant

e.g.

import MDAnalysis as mda

u = mda.fetch_mmtf('1nmu')

# exclude all HETATM entries
protein = u.select_atoms('protein')

# make a selection across chains
sel = protein.select_atoms('(segid A and resid 10-50) or (segid C and resid 51-100)')
for seg in sel.segments:
    print(len(seg.residues.resids), seg.residues.resids[:10])

and then we can do:

>>> sel.residues.groupby('segids')
{u'A': <ResidueGroup with 40 residues>, u'C': <ResidueGroup with 49 residues>}

>>> sel.residues.groupby('resnames')
{'ALA': <ResidueGroup with 8 residues>,
 'ARG': <ResidueGroup with 2 residues>,
 'ASN': <ResidueGroup with 2 residues>,
 'ASP': <ResidueGroup with 9 residues>,
 'GLN': <ResidueGroup with 3 residues>,
 'GLU': <ResidueGroup with 6 residues>,
 'GLY': <ResidueGroup with 10 residues>,
 'HIS': <ResidueGroup with 2 residues>,
 'ILE': <ResidueGroup with 5 residues>,
 'LEU': <ResidueGroup with 5 residues>,
 'LYS': <ResidueGroup with 9 residues>,
 'PHE': <ResidueGroup with 6 residues>,
 'PRO': <ResidueGroup with 5 residues>,
 'SER': <ResidueGroup with 1 residues>,
 'THR': <ResidueGroup with 5 residues>,
 'TRP': <ResidueGroup with 3 residues>,
 'TYR': <ResidueGroup with 4 residues>,
 'VAL': <ResidueGroup with 4 residues>}

>>> sel.groupby('masses')
{12.010999999999999: <AtomGroup with 462 atoms>,
 14.007: <AtomGroup with 116 atoms>,
 15.999000000000001: <AtomGroup with 134 atoms>}

before merging, it would be real nice if multiple attributes could be given so that groupings can be done on combinations of values.

PR Checklist

  • Tests?
  • Docs?
  • CHANGELOG updated?
  • Issue raised/referenced?

@dotsdl
Copy link
Member Author

dotsdl commented Dec 6, 2016

Not sure why my branch has stupid merge commits...hmmm.

Copy link
Member

@richardjgowers richardjgowers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Will need some tests for different data types (object, int, float). There's MDAnalysisTests.core.groupbase.make_Universe to mock a Universe with arbitrary attributes.

With floats, the comparison between them might be a little fragile (esp. for something like derived quantities), so you could use numpy.isclose rather than ==.

Once/If we have NaN values for missing values, we'll need to handle this here too as NaN != NaN, so maybe add a comment about that so we don't forget

@@ -816,6 +816,24 @@ def wrap(self, compound="atoms", center="com", box=None):
if not all(s == 0.0):
o.atoms.translate(s)

def groupby(self, topattr):
"""Obtain groupings of the components of this Group according the values
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a one liner, so something like 'group according to a given attribute'

----------
topattr: str
Topology attribute to group components by.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add an example to the docstring too (just what you put in the PR would do)

Unique values of the topology attribute as keys, Groups as values.

"""
return {i: self[self.__getattribute__(topattr) == i] for i in
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we're fooling around with __getattr__ methods in our classes, it might be safer to use getattr(self, topattr) which is more like calling self.topattr (rather than directly going to the class method). I'm not sure if calling the class method directly could bypass something

@richardjgowers richardjgowers self-assigned this Dec 6, 2016
@kain88-de
Copy link
Member

How come that sel.residues.groupby('segid') works now? I thought from @richardjgowers earlier comments that residues here has no idea about the original selection, see #1105 (comment) .

Otherwise the groupby is a nice idea. They seems to allow a lot of sub selections on AtomGroups, selecting residues could then be done with just a groupby `ag.groupby('residues')'.

@kain88-de
Copy link
Member

For floating point fields lke masses we should allow a eps kwarg to define a epsilon when we regard two values as close. As default I'd suggest 1e-8

@kain88-de
Copy link
Member

Where should general docs for this go? I would say they belong to selections. There we can also add some tips about common misconceptions of the segment and residues selections on the python objects (not using a select_atoms)

@orbeckst
Copy link
Member

orbeckst commented Dec 7, 2016

For floating point fields lke masses we should allow a eps kwarg to define a epsilon when we regard two values as close. As default I'd suggest 1e-8

Shouldn't this depend on the precision of the data type? You could default to maybe 10 * machine_precision. numpy.finfo().eps might be helpful.

@kain88-de
Copy link
Member

We could do that. But then you also have to compare the types and choose the epsilon of the least accurate type. Then there is also the issue that we should actually use the accuracy of the file format the value originally comes from. For example the occupancy entries in the PDB are only accurate upto 2 decimal places. I thought having a single eps argument defaulting to 1e-8 (float 32bit) would be easier and be just as good 90% of the time.

@richardjgowers
Copy link
Member

  R1   R2
 / \  /  \
A1 A2 A3 A4
   \  /
    AG

@kain88-de if you have a system with 4 atoms, 2 residues of size 2, and an AtomGroup with atoms 2&3 (see picture), then you can go upwards to Residues and not leave the scope of the AtomGroup, but if you come downwards from Residues then you will leave the AtomGroup.

So ag.residues.groupby('segids') is looking upwards, whereas ag.segments.residues is looking downwards.

Or another way of thinking about it, the memory of the original selection is kept going upwards because there is only many-to-one mappings, whereas downwards there is one-to-many mappings.

@orbeckst
Copy link
Member

orbeckst commented Dec 8, 2016

#1112 (comment)

For example the occupancy entries in the PDB are only accurate upto 2 decimal places. I thought having a single eps argument defaulting to 1e-8 (float 32bit) would be easier and be just as good 90% of the time.

I concede the point – let the user determine the level of equivalence. For XTC coordinates, 1e-3...

Note that float32 epsilon is

In [4]: numpy.finfo(dtype=numpy.float32).eps
Out[4]: 1.1920929e-07

In [6]: numpy.finfo(dtype=numpy.float32).precision
Out[6]: 6

so I'd be a bit more generous with the default, say 1e-07 or even 1e-06.

@orbeckst
Copy link
Member

orbeckst commented Dec 8, 2016

Regarding docs #1112 (comment):

Where should general docs for this go? I would say they belong to selections. There we can also add some tips about common misconceptions of the segment and residues selections on the python objects (not using a select_atoms)

The 0.15.0 docs had a pretty sizable narrative on using and working with the different levels in the container hierarchy under Fundamental building blocks — MDAnalysis.core.AtomGroup which seems to have been removed from the 0.16.0 docs during the topology reboot. It might be worthwhile putting that back in a separate reST file in the sphinx doc near the beginning (not under the new Core objects: Containers — MDAnalysis.core.groups which can stay lean and API-focused). This would then also be a good place for some the docs discussed here.

@richardjgowers
Copy link
Member

https://github.com/MDAnalysis/mdanalysis/blob/develop/package/MDAnalysis/core/groups.py#L1250

I noticed we have a AG.split method which is just a specialised groupby. Might be a good idea to make split use groupby to avoid some repetition.

@richardjgowers richardjgowers added this to the 0.16.0 milestone Dec 19, 2016
@kain88-de
Copy link
Member

@dotsdl & @richardjgowers what still needs to be done here?

@richardjgowers
Copy link
Member

Tests!

@dotsdl
Copy link
Member Author

dotsdl commented Dec 22, 2016

Sorry all, been traveling for the holiday. It's everything at once these days I'm afraid. :/

@kain88-de kain88-de removed this from the 0.16.0 milestone Jan 2, 2017
@richardjgowers richardjgowers added this to the 0.16.0 milestone Jan 10, 2017
richardjgowers added a commit that referenced this pull request Jan 10, 2017
@richardjgowers richardjgowers changed the title WIP: groupby method for GroupBase groupby method for GroupBase Jan 10, 2017
@richardjgowers
Copy link
Member

Ok I've finished this up if someone else wants to review it

@orbeckst
Copy link
Member

I think you need to rebase the branch against develop because it still contains "stupid merge commits" #1112 (comment)

@richardjgowers
Copy link
Member

I think if you just hit the magic squash and merge button it'll make it all into a single @dotsdl commit which is fine

@orbeckst
Copy link
Member

So this is the "minimal" groupby without eps#1112 (comment) ; maybe we consider it as "experimental" (we also do not have more extensive docs yet #1112 (comment) ) and then we improve as we go along?

@orbeckst
Copy link
Member

@richardjgowers or @kain88-de I leave it to you to squash (or do any last minute fixes if needed).

@kain88-de
Copy link
Member

@richardjgowers You can decide. I don't have time now to review this.

@richardjgowers richardjgowers merged commit 59af18f into develop Jan 12, 2017
@richardjgowers richardjgowers deleted the feature-groupby branch January 12, 2017 12:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants