-
Notifications
You must be signed in to change notification settings - Fork 658
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensemble container for Universes #2664
Conversation
Codecov Report
@@ Coverage Diff @@
## develop MDAnalysis/mdanalysis#2664 +/- ##
===========================================
- Coverage 93.17% 87.44% -5.73%
===========================================
Files 171 166 -5
Lines 22735 21461 -1274
Branches 3216 0 -3216
===========================================
- Hits 21183 18767 -2416
- Misses 1504 2694 +1190
+ Partials 48 0 -48
Continue to review full report at Codecov.
|
It is an interesting concept, especially in term of semantics. It could lead to very exciting things about error estimates in analyses when the tools know what replicas are. I find the interface confusing, though. Indexing an ensemble, for instance, can get you either a universe, a frame, or another ensemble. This is a surprising behaviour as most collections will return a single element, or a new collection of that element. I find this even more confusing because the What the ideal interface should look like is not obvious to me. I tried suggesting several things and was not fully convinced by any of them. The best I could come up is a combination of the chain reader and on-the-fly transformations to normalise the selections. I guess seeing how the object helps you with encore would help figuring things out. |
Thank you @jbarnoud for your thoughts!
I thought of the frame/ensemble duality similarly to the Atom/Group duality: AtomGroup[:1] returns an AtomGroup of length 1, AtomGroup[[0]] returns an AtomGroup, but AtomGroup[0] returns an Atom. So indexing should only give the item (frame), but slicing should yield the container (in this case, an ensemble). Getting universe by name is more of a convenience thing. It's not possible to index an array with a string, so I didn't think it would be too strange to add it in. However, it could be more explicit: perhaps
Yes -- it's functionally closest to a Trajectory, but to be useful in existing analysis classes it needed atom selection functionality. And most (all?) analysis classes pass
Mostly I wanted to run encore on a collection of Universes. Currently It also makes a task like visualising the centroid of a cluster essentially a single step -- instead of working out which trajectory it came from, you could *It would be nice to revamp |
Hello @lilyminium! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:
Comment last updated at 2021-01-17 01:12:45 UTC |
ad0a7f0
to
47561de
Compare
@jbarnoud I refactored encore into The main idea is to be able to plug and play an Ensemble into e.g. PCA or diffusionmap.DistanceMatrix without having to make multiple new Universes, which can be heavy on the memory and can lead to issues such as missing
The above still holds. Apologies for mixing the encore refactor into this, I can split into two different PRs. It just relied on Ensemble features. A comparison of timings and examples of usage can be found here: https://mdanalysisuserguide--132.org.readthedocs.build/en/132/examples/analysis/trajectory_similarity/encore.html. The Edit: the failing test is |
@mnmelo as you were interested in overloading the ChainReader for ensemble analysis, have a look at this PR for an alternative idea. |
This is currently on hold because it's a new feature that no one asked for and does not solve any current problems, and as @jbarnoud pointed out it suffers from a lack of clarity as to what an ensemble should do. I can put it back on the to-do list if people think it should be in the core library (as opposed to an encore mdakit, for example). |
I can have a look, but can already say that my idea leveraged what the ChainReader already does, ensemble-wise, rather than creating a new collection object. |
So, from what I understand, this idea, when compared to simply using a ChainReader, allows the user to
Also added is the capability to perform analysis runs over arbitrary sequences of frames, which could be considered as an addition on its own (on this aspect, I'd probably overload the Did I miss any other major features? The downsides I see to this approach are:
If you still think this is a relevant addition, perhaps you could exemplify how this could benefit, say, an RMSD analysis? That way it'd be clearer how this would be advantageous over an overloading of ChainReader (I can already see that one of the advantages is use of incompatible Universes/topologies, but again, that likely requires analysis adaptation). |
I suppose it would save you a little bit of work for RMSD from a single reference structure for different replicas? import MDAnalysis as mda
from MDAnalysis.tests.datafiles import PSF, DCD, DCD2
from MDAnalysis.analysis.rms import RMSD
u1 = mda.Universe(PSF, DCD)
u2 = mda.Universe(PSF, DCD2)
u3 = mda.Universe(PSF, DCD)
# ensemble way
ens = mda.Ensemble([u1, u2, u3])
rmsd = RMSD(ens, u3, select="name CA").run(step=2)
## Create sub-ensemble just for convenience, could have also passed that into the RMSD analysis
ens2 = ens[::2]
## gives a list of arrays, one array per universe
rmsd_arrays_per_replica = list(ens2.split_array(rmsd.results.rmsd))
# chainreader way
u = mda.Universe(PSF, [DCD, DCD2, DCD])
rmsd = RMSD(u, u3, select="name CA").run(step=2)
rmsd_arrays_per_replica = []
current_replica = []
current_filename = u.trajectory.filename
for frame, rmsd_value in zip(rmsd.results.frames, rmsd.results.rmsd):
u.trajectory[frame]
if u.trajectory.filename != current_filename:
rmsd_arrays_per_replica.append(current_replica)
current_replica = []
current_replica.append(rmsd_value) Under the hood, the ensemble (Edit: the could be turned into a split_results function with the new Results object.) def split_array(self, arr):
"""Convenience function to split arrays of data by Universe
into an iterable of data arrays"""
for i in range(self.n_universes):
ix = np.where(self._universe_frames[:, 0] == i)[0]
yield arr[ix[0]:ix[-1]+1] What I frequently like doing is pairwise RMSD comparisons between many different trajectories. The However, all of these examples are contrived, because in reality it is unusual to work with replicas with different numbers of atoms and different numbers of frames. I suppose I got a little tired of writing the same one-liner over and over to figure out which actual trajectory and which actual frame correlated with a particular cluster index in the combined cluster analysis.
This is a little off-topic but I think that would be very confusing to users, and I would ask what |
Thanks for the use-case @lilyminium. I don't think your example is contrived at all. I sometimes also do analyses in a similar spirit (also with incompatible numbers of atoms). Begin Devil's advocacy I think, however, that it's not necessarily easy to generalize the ensemble/replica case. What if you'd wanted each trajectory RMSD'ed to its own reference universe? (say, There's also a simpler alternative to ensembles and u1 = mda.Universe(PSF, DCD)
u2 = mda.Universe(PSF, DCD2)
u3 = mda.Universe(PSF, DCD)
rmsd_arrays_per_replica = []
for u in [u1, u2, u3]:
rmsd = RMSD(u, u3, select="name CA").run(step=2)
rmsd_arrays_per_replica.append(rmsd.results.rmsd) What I'm trying to highlight is that there are many possibilities of ensemble analyses, each perhaps too specialized to permit/warrant a common API. I can't tell if there's a specific type more common than others for which it'd be worth investing in API specialization. End Devil's advocacy That said, I can see myself using an
As to overloading |
@mnmelo thank you for the devil's advocacy, it is easy to overcomplicate things :p In reply to that, I should mention the original use case I had for
I would hope there wouldn't be much AnalysisBase adaptation at all -- what's in this PR is what I thought was necessary for the existing analyses to work. Primarily setting each frame individually in
At that stage it might be easier to just adapt |
Ok, I see that Analysis adaptation is not very extensive, but analysis creators still have to bear in mind quirks such as using Also, if this goes ahead, the time to implement this is now, so we can advertise this new API requirement without breaking people's analyses. |
Please don't scare me like this @mnmelo :P Can we make this is 2.1.0+ please? |
Ok, but if we know we're gonna change the API soon, at least make some obvious warnings from 2.0.0. |
@mnmelo I'd think small changes that we make to our own classes (like |
That's an ok compromise, I guess: analyses don't have to be Ensemble compatible. If it catches on, we then get to be stricter in the API definition. |
Closing as stale, feel free to re-open if you want to continue. |
Fixes #2663
Changes made in this Pull Request:
I wrote most of this a while ago to make
encore
analysis more tractable, but thought it might be useful in general. TheEnsemble
class is designed to:select_atoms
method only, such thatEnsemble.atoms.positions
contains the positions of the selected atoms at the current time frame, when iterating over combined trajectory.mda.Merge
and having to load the entire trajectory into memory to select a subset of coordinates, as the current ENCORE analysis doesPR Checklist