Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added new topology system post #41

Merged
merged 5 commits into from
Apr 3, 2017
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
108 changes: 108 additions & 0 deletions _posts/2017-04-03-new-topology-system.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
---
layout: post
title: A shiny, new and faster topology system
---

With MDAnalysis 0.16.0 on the horizon, we wanted to showcase a major development.
In fall 2015, we (@richardjgowers and @dotsdl) set to work on redesigning the topology system from scratch.
This system determines how atom, residue, and segment information is internally represented and exposed to everything in the API (``Universe``, ``AtomGroup``, etc.), and the old scheme had issues with data duplication, maintaining consistency between atom and residue attributes, and performance for large systems.
We hoped to resolve all of these issues with our new design.

The starting point of this work was (the now infamous) [issue 363](https://github.com/MDAnalysis/mdanalysis/issues/363), which floated the idea of holding all atom, residue, and segment attributes in arrays instead of lists of ``Atom``, ``Residue``, and ``Segment`` objects.
This approach turned the way topology data such as atom names, resids, masses, etc. are stored in a ``Universe`` on its head, going from an array of structs (list of ``Atom`` objects with individual attributes) to a struct of arrays (an array for each attribute, one entry per ``Atom``).

Now, over a year later, the finishing touches on this work are being prepared for release.
This post is meant to serve as a brief view to what has changed internally, what has changed externally, and what benefits this gives us looking forward to the future.

## Invisible changes to make working with MDAnalysis *faster*

Most of the changes are (or should be) invisible to the user. But they made some of the most fundamental operations in MDAnalysis quite a bit *faster*. Although this section is mostly of interest to developers, it is useful for all users to know the operations that MDAnalysis can now do much faster than before (and why).
In the new system, each atom is a member of exactly one residue, and each residue is a member of exactly one segment.
The new `Topology` object keeps an array giving the residue membership of each atom, and likewise an array giving segment membership of each residue.
Getting the resname of the residue of a group of atoms, then, is achieved by taking the indices of these atoms to [fancy-index](https://docs.scipy.org/doc/numpy/user/basics.indexing.html#index-arrays) the `Atoms->Residues` array, and then using the result of this to fancy-index the `Resnames` array.
For example, if the `Topology` has 5 atoms and 3 residues, with membership (`Atoms->Residues`) and `Resnames` arrays as below:

```
Atoms->Residues Resnames
index --------------- index --------
0 0 0 GLU
1 2 1 LYS
2 1 2 ALA
3 1
4 2
```

calling `AtomGroup.resnames` for an `AtomGroup` with atoms [2, 0, 1, 2] will yield (pseudocode):

```
"Atoms->Residues"[[2, 0, 1, 2]] --> [1, 0, 2, 1]
"Resnames"[[1, 0, 2, 1]] --> ['LYS', 'GLU', 'ALA', 'LYS']
```

This scheme only works if each atom is a member of one and only one residue, and likewise if residues are members of one and only one segment.
Furthermore, `AtomGroup`s, `ResidueGroup`s, and `SegmentGroup`s are very thin, storing only the indices of their members as a `numpy` array.
This gives a number of advantages:

1. **Performance**. We get up to an 8x speedup over the old scheme when accessing attributes. Setting attributes can give up to a 40x speedup.
2. **Memory**. We don't store, for example, a resname for each atom, but instead store attributes at the level they make sense for.
3. **Consistency**. Since attributes are stored in one place, we avoid cases where the topology is in an inconsistent state, e.g. two atoms in the same residue give a different resname.
4. **No staleness**. Because e.g. `ResidueGroup`s are only an array of indices, not a list of `Residue` objects generated upon creation of the group, changes of resiude-level properties by another `ResidueGroup` are always reflected consistently by every other one. Data is not duplicated anywhere in this scheme, and is all contained in the `Topology` object.
5. **Serialization**. Topologies become serializable and changes to topologies can be easily saved and communicated around. This is an important step towards implementing parallel algorithms in MDAnalysis.

For further performance comparisons, check out this [notebook](http://nbviewer.jupyter.org/gist/dotsdl/0e0fbd409e3e102d0458).

## External changes that may affect how you use MDAnalysis

Previously, every object except ``Atom`` subclassed from ``AtomGroup``.
This meant that calling `.positions` of would give you the positions of the ``Atom``s contained within that group.

Previous class structure:
```
Atom

AtomGroup -> Residue
-> ResidueGroup -> Segment
-> SegmentGroup
```
New class structure:
```
Group -> AtomGroup
-> ResidueGroup
-> SegmentGroup

Atom
Residue
Segment
```

Now each object only contains information pertaining to that particular object.
A ``Residue`` object only yields information about the residue; to get to the atoms, use ``Residue.atoms``.
Similarly, to get the atoms from a ``Segment`` or a ``SegmentGroup`` use ``Segment.atoms`` or ``SegmentGroup.atoms``.
As before, you can get all residues associated with a group with ``Group.residues`` (which returns a ``ResidueGroup``) and all segments with ``Group.segments`` (a ``SegmentGroup``).
Bottom line: you should now *always be explicit about what you want*.

### Why this was changed

Previously everything inheriting from ``AtomGroup`` made it unclear at what level of topology a given method or attribute was working on.
For example, does ``ResidueGroup.charges`` give the charge of the residues or the atoms?
Also, it was unclear what size a given output would be (see [issue 411](https://github.com/MDAnalysis/mdanalysis/issues/411)).

### How to work with the new system

To access atom-level information from anything that isn't an ``AtomGroup``, use the `.atoms` level accessor.
For example, changing all `.positions` calls on anything that isn't an `AtomGroup` to `.atoms.positions`.

## Going forward: what does this mean for MDAnalysis as a project?

A major benefit of the new topology system is that information about the topology of a ``Universe`` is now completely encapsulated in the ``Topology`` object.
This not only makes development and maintenance easier, but also opens the door to some exciting new possibilities as simulation systems grow larger.
A single ``Topology`` object can now be cleanly shared by multiple ``Universe`` instances, each with their own trajectory reader(s).
This could make common operations such as fitting a trajectory to a reference structure or doing parallel analysis of many trajectories more efficient for large systems.
The ``Topology`` object can also be serialized more easily.
This should enable parallelization on workers without shared memory (using libraries such as [``distributed``](http://distributed.readthedocs.io/en/latest/)) out-of-the-box.

Making these things work is an ongoing effort, but the MDAnalysis [coredevs](https://github.com/orgs/MDAnalysis/teams/coredevs) are working to take advantage of all these possibilities.
We look forward to the benefits this brings not only to the project, but also to all our users going forward.
We hope you like what we've done here.

-- @dotsdl and @richardjgowers