Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensemble container for Universes #2664

Closed
wants to merge 7 commits into from

Conversation

lilyminium
Copy link
Member

Fixes #2663

Changes made in this Pull Request:

  • added an Ensemble class to hold Universes

I wrote most of this a while ago to make encore analysis more tractable, but thought it might be useful in general. The Ensemble class is designed to:

  • hold and track the frames of separate Universes
  • enable easy atom selection with the select_atoms method only, such that Ensemble.atoms.positions contains the positions of the selected atoms at the current time frame, when iterating over combined trajectory.
    • This is alternative to creating a new Universe with mda.Merge and having to load the entire trajectory into memory to select a subset of coordinates, as the current ENCORE analysis does
  • easily get the relative frame (e.g. 2nd frame of 3rd universe) from an absolute frame (e.g. frame 100 of the combined trajectory), which is helpful for e.g. identifying which actual frame is the centroid of a cluster
  • mimic a trajectory to the extent that it can be used in AnalysisBase. It mostly behaves like a FrameIterator, I think -- you can slice it and pass boolean masks to select particular frames. Indexing frames is relative to the Ensemble rather than absolute trajectories.
>>> import MDAnalysis as mda
>>> from MDAnalysis.tests.datafiles import PSF, DCD, PDB, TPR, XTC
>>> u1 = mda.Universe(PSF, DCD)
>>> len(u1.atoms)
3341
>>> u2 = mda.Universe(PDB)
>>> len(u2.atoms)
47681
>>> u3 = mda.Universe(TPR, XTC)
>>> len(u3.atoms)
47681
>>> ens = mda.Ensemble([u1, u2, u3], select='name CA')
>>> len(ens.atoms)
214
>>> len(ens.trajectory)  # 98 + 1 + 10 frames
109
>>> u3.trajectory.frame
0
>>> ens[-1]
< Timestep 9 with unit cell dimensions [80.08523 80.08523 80.08523 60.      60.      90.     ] >
>>> u3.trajectory.frame  # changes persist
9
>>> ens2 = ens.select_atoms('resid 1:50')  # select subset
>>> len(ens2.atoms)
50
>>> len(ens2.trajectory)
109
>>> ens3 = ens2[[0, 3, 2, -19]]  # select whatever frames
>>> len(ens3.trajectory)
4

PR Checklist

  • Tests?
  • Docs?
  • CHANGELOG updated?
  • Issue raised/referenced?

@codecov
Copy link

codecov bot commented Apr 20, 2020

Codecov Report

Merging #2664 (230d748) into develop (87e71bb) will decrease coverage by 5.72%.
The diff coverage is 84.74%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    MDAnalysis/mdanalysis#2664      +/-   ##
===========================================
- Coverage    93.17%   87.44%   -5.73%     
===========================================
  Files          171      166       -5     
  Lines        22735    21461    -1274     
  Branches      3216        0    -3216     
===========================================
- Hits         21183    18767    -2416     
- Misses        1504     2694    +1190     
+ Partials        48        0      -48     
Impacted Files Coverage Δ
...sis/analysis/encore/clustering/ClusteringMethod.py 96.92% <ø> (ø)
package/MDAnalysis/analysis/encore/covariance.py 87.69% <ø> (ø)
package/MDAnalysis/analysis/clustering/clusters.py 65.00% <65.00%> (ø)
package/MDAnalysis/analysis/reencore/utils.py 82.66% <82.66%> (ø)
package/MDAnalysis/core/ensemble.py 84.95% <84.95%> (ø)
package/MDAnalysis/analysis/reencore/similarity.py 85.71% <85.71%> (ø)
package/MDAnalysis/analysis/diffusionmap.py 95.08% <90.00%> (-1.47%) ⬇️
package/MDAnalysis/__init__.py 92.10% <100.00%> (+0.21%) ⬆️
package/MDAnalysis/analysis/align.py 92.08% <100.00%> (-5.76%) ⬇️
package/MDAnalysis/analysis/base.py 95.65% <100.00%> (+0.19%) ⬆️
... and 78 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 87e71bb...230d748. Read the comment docs.

@jbarnoud
Copy link
Contributor

It is an interesting concept, especially in term of semantics. It could lead to very exciting things about error estimates in analyses when the tools know what replicas are.

I find the interface confusing, though. Indexing an ensemble, for instance, can get you either a universe, a frame, or another ensemble. This is a surprising behaviour as most collections will return a single element, or a new collection of that element. I find this even more confusing because the Ensemble object tries to mimic a Universe, a Trajectory, and an AtomGroup, so it cannot fulfil the all the expectations.

What the ideal interface should look like is not obvious to me. I tried suggesting several things and was not fully convinced by any of them. The best I could come up is a combination of the chain reader and on-the-fly transformations to normalise the selections. I guess seeing how the object helps you with encore would help figuring things out.

@lilyminium
Copy link
Member Author

Thank you @jbarnoud for your thoughts!

Indexing an ensemble, for instance, can get you either a universe, a frame, or another ensemble.

I thought of the frame/ensemble duality similarly to the Atom/Group duality: AtomGroup[:1] returns an AtomGroup of length 1, AtomGroup[[0]] returns an AtomGroup, but AtomGroup[0] returns an Atom. So indexing should only give the item (frame), but slicing should yield the container (in this case, an ensemble).

Getting universe by name is more of a convenience thing. It's not possible to index an array with a string, so I didn't think it would be too strange to add it in. However, it could be more explicit: perhaps Ensemble.universe_by_name["universe_name"]?

the Ensemble object tries to mimic a Universe, a Trajectory, and an AtomGroup, so it cannot fulfil the all the expectations.

Yes -- it's functionally closest to a Trajectory, but to be useful in existing analysis classes it needed atom selection functionality. And most (all?) analysis classes pass u.trajectory instead of asking for the trajectory directly. It doesn't pretend to be a Universe/AtomGroup to users, just the analysis API, if that makes sense -- I envisioned people interacting with it just in an indexing/slicing way.

The best I could come up is a combination of the chain reader and on-the-fly transformations to normalise the selections. I guess seeing how the object helps you with encore would help figuring things out.

Mostly I wanted to run encore on a collection of Universes. Currently encore transfers them all to memory, presumably because it merges them all into the one Universe (this also breaks if any of the Universes have different topologies, because it merges into a single Universe before applying the atom selection). Limited memory meant I could not use encore as is. * If I therefore use external tools (e.g. cpptraj) and want to load in my pairwise RMSD matrices or PCA projections or cluster indices, the Ensemble class can tell encore which frames belong to one trajectory, and then it can proceed.

It also makes a task like visualising the centroid of a cluster essentially a single step -- instead of working out which trajectory it came from, you could nglview.show_mdanalysis(ensemble[cluster.centroid_index]). In addition, if we create a cluster class that contains indices, you could run further analysis on that selection of frames by creating cluster_ens = ensemble[cluster_1.indices]. Currently Analysis classes can't take FrameIterators or arbitrary frames, so this could be helpful, especially for aggregate methods such as AverageStructure.

*It would be nice to revamp encore to not require Universes at all, but simply be a collection of lightweight functions that operate on a matrix or dataframe of values. I do already have ces and dres functions that operate on a dataframe of labelled subspace projections / cluster indices and treat outliers as distinct clusters.

@pep8speaks
Copy link

pep8speaks commented Jan 5, 2021

Hello @lilyminium! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 207:1: E402 module level import not at top of file

Line 116:80: E501 line too long (87 > 79 characters)
Line 189:80: E501 line too long (84 > 79 characters)

Line 80:33: E128 continuation line under-indented for visual indent
Line 81:33: E128 continuation line under-indented for visual indent

Line 412:80: E501 line too long (83 > 79 characters)
Line 413:34: E128 continuation line under-indented for visual indent
Line 413:80: E501 line too long (107 > 79 characters)
Line 414:80: E501 line too long (92 > 79 characters)
Line 415:34: E128 continuation line under-indented for visual indent
Line 415:80: E501 line too long (109 > 79 characters)
Line 416:34: E128 continuation line under-indented for visual indent
Line 417:34: E128 continuation line under-indented for visual indent
Line 434:80: E501 line too long (82 > 79 characters)

Line 29:80: E501 line too long (82 > 79 characters)
Line 30:80: E501 line too long (100 > 79 characters)
Line 177:37: E128 continuation line under-indented for visual indent

Comment last updated at 2021-01-17 01:12:45 UTC

@lilyminium lilyminium force-pushed the ensemble branch 3 times, most recently from ad0a7f0 to 47561de Compare January 17, 2021 00:34
@lilyminium
Copy link
Member Author

lilyminium commented Jan 17, 2021

@jbarnoud I refactored encore into reencore. I hope some of the changes I made to the code here demonstrate how I wanted to make use of the Ensemble class. For example, being able to pass an arbitrary array of frame indices to analyse (which does not necessarily need to use an Ensemble, as I made changes to Universe also).

The main idea is to be able to plug and play an Ensemble into e.g. PCA or diffusionmap.DistanceMatrix without having to make multiple new Universes, which can be heavy on the memory and can lead to issues such as missing .timeseries methods or .filename attributes. (see: MDAnalysis/mdaencore#30, #2948, #2578, MDAnalysis/mdaencore#32). Moreover, the way encore is currently structured, it first merges the Universes and then applies select_atoms -- so I cannot, for example, analyse an ensemble of trajectories of proteins if one of the trajectories has water, due to mismatching numbers of atoms.

If I therefore use external tools (e.g. cpptraj) and want to load in my pairwise RMSD matrices or PCA projections or cluster indices, the Ensemble class can tell encore which frames belong to one trajectory, and then it can proceed.

The above still holds.

Apologies for mixing the encore refactor into this, I can split into two different PRs. It just relied on Ensemble features.

A comparison of timings and examples of usage can be found here: https://mdanalysisuserguide--132.org.readthedocs.build/en/132/examples/analysis/trajectory_similarity/encore.html. The ces and dres functions especially are considerably faster without merging into all the new Universes.

Edit: the failing test is test_written_remarks_property, it seems to be unrelated?

@orbeckst
Copy link
Member

@mnmelo as you were interested in overloading the ChainReader for ensemble analysis, have a look at this PR for an alternative idea.

@lilyminium
Copy link
Member Author

This is currently on hold because it's a new feature that no one asked for and does not solve any current problems, and as @jbarnoud pointed out it suffers from a lack of clarity as to what an ensemble should do. I can put it back on the to-do list if people think it should be in the core library (as opposed to an encore mdakit, for example).

@mnmelo
Copy link
Member

mnmelo commented May 30, 2021

I can have a look, but can already say that my idea leveraged what the ChainReader already does, ensemble-wise, rather than creating a new collection object.

@mnmelo
Copy link
Member

mnmelo commented May 30, 2021

So, from what I understand, this idea, when compared to simply using a ChainReader, allows the user to

  1. combine trajectories/Universes under the same collection, even if they come from incompatible topologies, and treat them as a single trajectory;
  2. explicitly break the reader abstraction and access individual trajectories/Universes when needed.

Also added is the capability to perform analysis runs over arbitrary sequences of frames, which could be considered as an addition on its own (on this aspect, I'd probably overload the start argument to take lists, rather than introducing a frames one, but I think this could be its own PR and discussed there).

Did I miss any other major features?

The downsides I see to this approach are:

  1. overall extensibility for use by other analyses (for replica considerations, for instance) probably requires their adaptation;
  2. Since the previous point might stymie adoption, Ensemble could end up becoming little-used API bloat.

If you still think this is a relevant addition, perhaps you could exemplify how this could benefit, say, an RMSD analysis? That way it'd be clearer how this would be advantageous over an overloading of ChainReader (I can already see that one of the advantages is use of incompatible Universes/topologies, but again, that likely requires analysis adaptation).

@lilyminium
Copy link
Member Author

lilyminium commented May 30, 2021

I suppose it would save you a little bit of work for RMSD from a single reference structure for different replicas?

import MDAnalysis as mda
from MDAnalysis.tests.datafiles import PSF, DCD, DCD2
from MDAnalysis.analysis.rms import RMSD

u1 = mda.Universe(PSF, DCD)
u2 = mda.Universe(PSF, DCD2)
u3 = mda.Universe(PSF, DCD)

# ensemble way
ens = mda.Ensemble([u1, u2, u3])
rmsd = RMSD(ens, u3, select="name CA").run(step=2)
## Create sub-ensemble just for convenience, could have also passed that into the RMSD analysis
ens2 = ens[::2]
## gives a list of arrays, one array per universe
rmsd_arrays_per_replica = list(ens2.split_array(rmsd.results.rmsd))

# chainreader way
u = mda.Universe(PSF, [DCD, DCD2, DCD])
rmsd = RMSD(u, u3, select="name CA").run(step=2)
rmsd_arrays_per_replica = []
current_replica = []
current_filename = u.trajectory.filename
for frame, rmsd_value in zip(rmsd.results.frames, rmsd.results.rmsd):
  u.trajectory[frame]
  if u.trajectory.filename != current_filename:
    rmsd_arrays_per_replica.append(current_replica)
    current_replica = []
  current_replica.append(rmsd_value)

Under the hood, the ensemble split_array function just matches the universe indices:

(Edit: the could be turned into a split_results function with the new Results object.)

    def split_array(self, arr):
        """Convenience function to split arrays of data by Universe
        into an iterable of data arrays"""
        for i in range(self.n_universes):
            ix = np.where(self._universe_frames[:, 0] == i)[0]
            yield arr[ix[0]:ix[-1]+1]

What I frequently like doing is pairwise RMSD comparisons between many different trajectories. The split_array function above is pretty convenient for chunking a huge square matrix into many smaller square trajectory-to-trajectory matrices.

However, all of these examples are contrived, because in reality it is unusual to work with replicas with different numbers of atoms and different numbers of frames. I suppose I got a little tired of writing the same one-liner over and over to figure out which actual trajectory and which actual frame correlated with a particular cluster index in the combined cluster analysis.

(on this aspect, I'd probably overload the start argument to take lists, rather than introducing a frames one, but I think this could be its own PR and discussed there).

This is a little off-topic but I think that would be very confusing to users, and I would ask what start=[0, 1, 2, 3], stop=10, step=2 would give me.

@mnmelo
Copy link
Member

mnmelo commented Jun 1, 2021

Thanks for the use-case @lilyminium. I don't think your example is contrived at all. I sometimes also do analyses in a similar spirit (also with incompatible numbers of atoms).


Begin Devil's advocacy

I think, however, that it's not necessarily easy to generalize the ensemble/replica case. What if you'd wanted each trajectory RMSD'ed to its own reference universe? (say, u1 with u1_ref, etc.) You wouldn't be able to use a single call for the ensemble.

There's also a simpler alternative to ensembles and ChainReaders:

u1 = mda.Universe(PSF, DCD)
u2 = mda.Universe(PSF, DCD2)
u3 = mda.Universe(PSF, DCD)

rmsd_arrays_per_replica = []
for u in [u1, u2, u3]:
    rmsd = RMSD(u, u3, select="name CA").run(step=2)
    rmsd_arrays_per_replica.append(rmsd.results.rmsd)

What I'm trying to highlight is that there are many possibilities of ensemble analyses, each perhaps too specialized to permit/warrant a common API. I can't tell if there's a specific type more common than others for which it'd be worth investing in API specialization.

End Devil's advocacy


That said, I can see myself using an Ensemble, mostly for statistics, as @jbarnoud pointed out, and perhaps we'd be able to adapt AnalysisBase in a way that doesn't involve rewriting every analysis.

This is a little off-topic but I think that would be very confusing to users, and I would ask what start=[0, 1, 2, 3], stop=10, step=2 would give me.

As to overloading start, I meant it in a way analogous to numpy's indexing, where you either specify [start:stop:step] or [iterable] (no stop/step allowed in this case) or [matching_bool_array] (also no stop/step allowed in this case). EDIT: Or perhaps implement AnalysisBase.__getitem__ so we can fancy-index an analysis, thus producing another analysis with its own set of frames to run on.

@lilyminium
Copy link
Member Author

lilyminium commented Jun 1, 2021

@mnmelo thank you for the devil's advocacy, it is easy to overcomplicate things :p In reply to that, I should mention the original use case I had for ensemble, which has gotten a bit lot now: it was for "analyses that take an ensemble of trajectory data to be one dataset, but then I want to split up the analysis to look at individual trajectories". (So RMSD with multiple references would not fall into that category.) For example (and it was indeed originally created for encore), using principal component analysis to construct a shared subspace between multiple trajectories, but then inspecting each trajectory's projection individually to calculate the dimension reduction ensemble similarity.

perhaps we'd be able to adapt AnalysisBase in a way that doesn't involve rewriting every analysis.

I would hope there wouldn't be much AnalysisBase adaptation at all -- what's in this PR is what I thought was necessary for the existing analyses to work. Primarily setting each frame individually in run, instead of iterating over the trajectory.

As to overloading start, I meant it in a way analogous to numpy's indexing,

At that stage it might be easier to just adapt run to accept a trajectory FrameIterator?

@mnmelo
Copy link
Member

mnmelo commented Jun 1, 2021

Ok, I see that Analysis adaptation is not very extensive, but analysis creators still have to bear in mind quirks such as using select_atoms(...).atoms.

Also, if this goes ahead, the time to implement this is now, so we can advertise this new API requirement without breaking people's analyses.

@IAlibay
Copy link
Member

IAlibay commented Jun 1, 2021

Also, if this goes ahead, the time to implement this is now, so we can advertise this new API requirement without breaking people's analyses.

Please don't scare me like this @mnmelo :P Can we make this is 2.1.0+ please?

@mnmelo
Copy link
Member

mnmelo commented Jun 1, 2021

Ok, but if we know we're gonna change the API soon, at least make some obvious warnings from 2.0.0.

@lilyminium
Copy link
Member Author

@mnmelo I'd think small changes that we make to our own classes (like u.select_atoms().atoms) would be irrelevant to users, and if they have written their own analysis, it is still usable with a normal Universe -- we could just advertise that if you want to use Ensemble these are the things you should be aware of. It's a new feature rather than changing an old one so it'll be backwards compatible.

@mnmelo
Copy link
Member

mnmelo commented Jun 1, 2021

That's an ok compromise, I guess: analyses don't have to be Ensemble compatible. If it catches on, we then get to be stricter in the API definition.

@hmacdope
Copy link
Member

hmacdope commented Nov 8, 2023

Closing as stale, feel free to re-open if you want to continue.

@hmacdope hmacdope closed this Nov 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Ensemble of Universes
7 participants