groupby method for GroupBase #1112

dotsdl · 2016-12-06T00:50:00Z

Addresses #1105

Changes made in this Pull Request:

added a general-purpose groupby method to GroupBase that allows for familiar groupby operations on single topology attributes a la pandas or datreant

e.g.

import MDAnalysis as mda

u = mda.fetch_mmtf('1nmu')

# exclude all HETATM entries
protein = u.select_atoms('protein')

# make a selection across chains
sel = protein.select_atoms('(segid A and resid 10-50) or (segid C and resid 51-100)')
for seg in sel.segments:
    print(len(seg.residues.resids), seg.residues.resids[:10])

and then we can do:

>>> sel.residues.groupby('segids')
{u'A': <ResidueGroup with 40 residues>, u'C': <ResidueGroup with 49 residues>}

>>> sel.residues.groupby('resnames')
{'ALA': <ResidueGroup with 8 residues>,
 'ARG': <ResidueGroup with 2 residues>,
 'ASN': <ResidueGroup with 2 residues>,
 'ASP': <ResidueGroup with 9 residues>,
 'GLN': <ResidueGroup with 3 residues>,
 'GLU': <ResidueGroup with 6 residues>,
 'GLY': <ResidueGroup with 10 residues>,
 'HIS': <ResidueGroup with 2 residues>,
 'ILE': <ResidueGroup with 5 residues>,
 'LEU': <ResidueGroup with 5 residues>,
 'LYS': <ResidueGroup with 9 residues>,
 'PHE': <ResidueGroup with 6 residues>,
 'PRO': <ResidueGroup with 5 residues>,
 'SER': <ResidueGroup with 1 residues>,
 'THR': <ResidueGroup with 5 residues>,
 'TRP': <ResidueGroup with 3 residues>,
 'TYR': <ResidueGroup with 4 residues>,
 'VAL': <ResidueGroup with 4 residues>}

>>> sel.groupby('masses')
{12.010999999999999: <AtomGroup with 462 atoms>,
 14.007: <AtomGroup with 116 atoms>,
 15.999000000000001: <AtomGroup with 134 atoms>}

before merging, it would be real nice if multiple attributes could be given so that groupings can be done on combinations of values.

PR Checklist

Tests?
Docs?
CHANGELOG updated?
Issue raised/referenced?

dotsdl · 2016-12-06T00:50:50Z

Not sure why my branch has stupid merge commits...hmmm.

richardjgowers

Looks good. Will need some tests for different data types (object, int, float). There's MDAnalysisTests.core.groupbase.make_Universe to mock a Universe with arbitrary attributes.

With floats, the comparison between them might be a little fragile (esp. for something like derived quantities), so you could use numpy.isclose rather than ==.

Once/If we have NaN values for missing values, we'll need to handle this here too as NaN != NaN, so maybe add a comment about that so we don't forget

richardjgowers · 2016-12-06T10:36:44Z

package/MDAnalysis/core/groups.py

@@ -816,6 +816,24 @@ def wrap(self, compound="atoms", center="com", box=None):
            if not all(s == 0.0):
                o.atoms.translate(s)

+    def groupby(self, topattr):
+        """Obtain groupings of the components of this Group according the values


This should be a one liner, so something like 'group according to a given attribute'

richardjgowers · 2016-12-06T10:39:31Z

package/MDAnalysis/core/groups.py

+        ----------
+        topattr: str
+           Topology attribute to group components by.
+


Add an example to the docstring too (just what you put in the PR would do)

richardjgowers · 2016-12-06T10:41:08Z

package/MDAnalysis/core/groups.py

+            Unique values of the topology attribute as keys, Groups as values.
+
+        """
+        return {i: self[self.__getattribute__(topattr) == i] for i in 


Because we're fooling around with __getattr__ methods in our classes, it might be safer to use getattr(self, topattr) which is more like calling self.topattr (rather than directly going to the class method). I'm not sure if calling the class method directly could bypass something

kain88-de · 2016-12-06T22:11:13Z

How come that sel.residues.groupby('segid') works now? I thought from @richardjgowers earlier comments that residues here has no idea about the original selection, see #1105 (comment) .

Otherwise the groupby is a nice idea. They seems to allow a lot of sub selections on AtomGroups, selecting residues could then be done with just a groupby `ag.groupby('residues')'.

kain88-de · 2016-12-06T22:12:28Z

For floating point fields lke masses we should allow a eps kwarg to define a epsilon when we regard two values as close. As default I'd suggest 1e-8

kain88-de · 2016-12-06T22:15:43Z

Where should general docs for this go? I would say they belong to selections. There we can also add some tips about common misconceptions of the segment and residues selections on the python objects (not using a select_atoms)

orbeckst · 2016-12-07T05:06:51Z

For floating point fields lke masses we should allow a eps kwarg to define a epsilon when we regard two values as close. As default I'd suggest 1e-8

Shouldn't this depend on the precision of the data type? You could default to maybe 10 * machine_precision. numpy.finfo().eps might be helpful.

kain88-de · 2016-12-07T08:45:31Z

We could do that. But then you also have to compare the types and choose the epsilon of the least accurate type. Then there is also the issue that we should actually use the accuracy of the file format the value originally comes from. For example the occupancy entries in the PDB are only accurate upto 2 decimal places. I thought having a single eps argument defaulting to 1e-8 (float 32bit) would be easier and be just as good 90% of the time.

richardjgowers · 2016-12-07T10:42:51Z

  R1   R2
 / \  /  \
A1 A2 A3 A4
   \  /
    AG

@kain88-de if you have a system with 4 atoms, 2 residues of size 2, and an AtomGroup with atoms 2&3 (see picture), then you can go upwards to Residues and not leave the scope of the AtomGroup, but if you come downwards from Residues then you will leave the AtomGroup.

So ag.residues.groupby('segids') is looking upwards, whereas ag.segments.residues is looking downwards.

Or another way of thinking about it, the memory of the original selection is kept going upwards because there is only many-to-one mappings, whereas downwards there is one-to-many mappings.

orbeckst · 2016-12-08T00:31:59Z

#1112 (comment)

For example the occupancy entries in the PDB are only accurate upto 2 decimal places. I thought having a single eps argument defaulting to 1e-8 (float 32bit) would be easier and be just as good 90% of the time.

I concede the point – let the user determine the level of equivalence. For XTC coordinates, 1e-3...

Note that float32 epsilon is

In [4]: numpy.finfo(dtype=numpy.float32).eps
Out[4]: 1.1920929e-07

In [6]: numpy.finfo(dtype=numpy.float32).precision
Out[6]: 6

so I'd be a bit more generous with the default, say 1e-07 or even 1e-06.

orbeckst · 2016-12-08T00:45:30Z

Regarding docs #1112 (comment):

Where should general docs for this go? I would say they belong to selections. There we can also add some tips about common misconceptions of the segment and residues selections on the python objects (not using a select_atoms)

The 0.15.0 docs had a pretty sizable narrative on using and working with the different levels in the container hierarchy under Fundamental building blocks — MDAnalysis.core.AtomGroup which seems to have been removed from the 0.16.0 docs during the topology reboot. It might be worthwhile putting that back in a separate reST file in the sphinx doc near the beginning (not under the new Core objects: Containers — MDAnalysis.core.groups which can stay lean and API-focused). This would then also be a good place for some the docs discussed here.

richardjgowers · 2016-12-14T10:31:33Z

https://github.com/MDAnalysis/mdanalysis/blob/develop/package/MDAnalysis/core/groups.py#L1250

I noticed we have a AG.split method which is just a specialised groupby. Might be a good idea to make split use groupby to avoid some repetition.

kain88-de · 2016-12-22T15:52:27Z

@dotsdl & @richardjgowers what still needs to be done here?

richardjgowers · 2016-12-22T15:54:03Z

Tests!

dotsdl · 2016-12-22T17:56:55Z

Sorry all, been traveling for the holiday. It's everything at once these days I'm afraid. :/

richardjgowers · 2017-01-10T20:43:46Z

Ok I've finished this up if someone else wants to review it

orbeckst · 2017-01-11T17:52:36Z

I think you need to rebase the branch against develop because it still contains "stupid merge commits" #1112 (comment)

richardjgowers · 2017-01-11T17:55:20Z

I think if you just hit the magic squash and merge button it'll make it all into a single @dotsdl commit which is fine

orbeckst · 2017-01-11T17:58:31Z

So this is the "minimal" groupby without eps – #1112 (comment) ; maybe we consider it as "experimental" (we also do not have more extensive docs yet #1112 (comment) ) and then we improve as we go along?

orbeckst · 2017-01-11T17:59:38Z

@richardjgowers or @kain88-de I leave it to you to squash (or do any last minute fixes if needed).

kain88-de · 2017-01-11T20:54:16Z

@richardjgowers You can decide. I don't have time now to review this.

reduced number of getattr calls

dotsdl mentioned this pull request Dec 6, 2016

segments object isn't properly documented. #1105

Closed

richardjgowers requested changes Dec 6, 2016

View reviewed changes

richardjgowers self-assigned this Dec 6, 2016

orbeckst added Work in progress Component-Core labels Dec 8, 2016

richardjgowers added this to the 0.16.0 milestone Dec 19, 2016

kain88-de removed this from the 0.16.0 milestone Jan 2, 2017

richardjgowers added this to the 0.16.0 milestone Jan 10, 2017

richardjgowers removed the Work in progress label Jan 10, 2017

richardjgowers added a commit that referenced this pull request Jan 10, 2017

CHANGELOG for #1112

435ba94

richardjgowers changed the title ~~WIP: groupby method for GroupBase~~ groupby method for GroupBase Jan 10, 2017

richardjgowers approved these changes Jan 10, 2017

View reviewed changes

dotsdl and others added 4 commits January 12, 2017 12:46

Added prototype single-attribute groupby method for Groups.

a767900

Updated docs of Group.groupby

d493ea8

reduced number of getattr calls

Added tests for groupby

c7b8796

CHANGELOG for #1112

3deb8bd

richardjgowers force-pushed the feature-groupby branch from 435ba94 to 3deb8bd Compare January 12, 2017 12:47

richardjgowers merged commit 59af18f into develop Jan 12, 2017

richardjgowers deleted the feature-groupby branch January 12, 2017 12:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

groupby method for GroupBase #1112

groupby method for GroupBase #1112

dotsdl commented Dec 6, 2016 •

edited by orbeckst

Loading

dotsdl commented Dec 6, 2016

richardjgowers left a comment

richardjgowers Dec 6, 2016

richardjgowers Dec 6, 2016

richardjgowers Dec 6, 2016

kain88-de commented Dec 6, 2016

kain88-de commented Dec 6, 2016

kain88-de commented Dec 6, 2016

orbeckst commented Dec 7, 2016

kain88-de commented Dec 7, 2016

richardjgowers commented Dec 7, 2016

orbeckst commented Dec 8, 2016

orbeckst commented Dec 8, 2016 •

edited

Loading

richardjgowers commented Dec 14, 2016

kain88-de commented Dec 22, 2016

richardjgowers commented Dec 22, 2016

dotsdl commented Dec 22, 2016

richardjgowers commented Jan 10, 2017

orbeckst commented Jan 11, 2017

richardjgowers commented Jan 11, 2017

orbeckst commented Jan 11, 2017

orbeckst commented Jan 11, 2017

kain88-de commented Jan 11, 2017

groupby method for GroupBase #1112

groupby method for GroupBase #1112

Conversation

dotsdl commented Dec 6, 2016 • edited by orbeckst Loading

PR Checklist

dotsdl commented Dec 6, 2016

richardjgowers left a comment

Choose a reason for hiding this comment

richardjgowers Dec 6, 2016

Choose a reason for hiding this comment

richardjgowers Dec 6, 2016

Choose a reason for hiding this comment

richardjgowers Dec 6, 2016

Choose a reason for hiding this comment

kain88-de commented Dec 6, 2016

kain88-de commented Dec 6, 2016

kain88-de commented Dec 6, 2016

orbeckst commented Dec 7, 2016

kain88-de commented Dec 7, 2016

richardjgowers commented Dec 7, 2016

orbeckst commented Dec 8, 2016

orbeckst commented Dec 8, 2016 • edited Loading

richardjgowers commented Dec 14, 2016

kain88-de commented Dec 22, 2016

richardjgowers commented Dec 22, 2016

dotsdl commented Dec 22, 2016

richardjgowers commented Jan 10, 2017

orbeckst commented Jan 11, 2017

richardjgowers commented Jan 11, 2017

orbeckst commented Jan 11, 2017

orbeckst commented Jan 11, 2017

kain88-de commented Jan 11, 2017

dotsdl commented Dec 6, 2016 •

edited by orbeckst

Loading

orbeckst commented Dec 8, 2016 •

edited

Loading