MDAnalysis · kain88-de · Apr 3, 2017 · Feb 27, 2017 · Mar 1, 2017 · Apr 2, 2017
diff --git a/_posts/2017-04-03-new-topology-system.md b/_posts/2017-04-03-new-topology-system.md
@@ -0,0 +1,108 @@
+---
+layout: post
+title: A shiny, new and faster topology system 
+---
+
+With MDAnalysis 0.16.0 on the horizon, we wanted to showcase a major development.
+In fall 2015, we (@richardjgowers and @dotsdl) set to work on redesigning the topology system from scratch.
+This system determines how atom, residue, and segment information is internally represented and exposed to everything in the API (``Universe``, ``AtomGroup``, etc.), and the old scheme had issues with data duplication, maintaining consistency between atom and residue attributes, and performance for large systems.
+We hoped to resolve all of these issues with our new design.
+
+The starting point of this work was (the now infamous) [issue 363](https://github.com/MDAnalysis/mdanalysis/issues/363), which floated the idea of holding all atom, residue, and segment attributes in arrays instead of lists of ``Atom``, ``Residue``, and ``Segment`` objects.
+This approach turned the way topology data such as atom names, resids, masses, etc. are stored in a ``Universe`` on its head, going from an array of structs (list of ``Atom`` objects with individual attributes) to a struct of arrays (an array for each attribute, one entry per ``Atom``).
+
+Now, over a year later, the finishing touches on this work are being prepared for release.
+This post is meant to serve as a brief view to what has changed internally, what has changed externally, and what benefits this gives us looking forward to the future.
+
+## Invisible changes to make working with MDAnalysis *faster*
+
+Most of the changes are (or should be) invisible to the user. But they made some of the most fundamental operations in MDAnalysis quite a bit *faster*. Although this section is mostly of interest to developers, it is useful for all users to know the operations that MDAnalysis can now do much faster than before (and why).
+In the new system, each atom is a member of exactly one residue, and each residue is a member of exactly one segment.
+The new `Topology` object keeps an array giving the residue membership of each atom, and likewise an array giving segment membership of each residue.
+Getting the resname of the residue of a group of atoms, then, is achieved by taking the indices of these atoms to [fancy-index](https://docs.scipy.org/doc/numpy/user/basics.indexing.html#index-arrays) the `Atoms->Residues` array, and then using the result of this to fancy-index the `Resnames` array.
+For example, if the  `Topology` has 5 atoms and 3 residues, with membership (`Atoms->Residues`) and `Resnames` arrays as below:
+
+```
+       Atoms->Residues           Resnames
+ index ---------------     index --------
+     0 0                       0 GLU
+     1 2                       1 LYS
+     2 1                       2 ALA
+     3 1
+     4 2
+```
+
+calling `AtomGroup.resnames` for an `AtomGroup` with atoms [2, 0, 1, 2] will yield (pseudocode):
+
+```
+"Atoms->Residues"[[2, 0, 1, 2]] --> [1, 0, 2, 1]
+"Resnames"[[1, 0, 2, 1]]        --> ['LYS', 'GLU', 'ALA', 'LYS']
+```
+
+This scheme only works if each atom is a member of one and only one residue, and likewise if residues are members of one and only one segment.
+Furthermore, `AtomGroup`s, `ResidueGroup`s, and `SegmentGroup`s are very thin, storing only the indices of their members as a `numpy` array.
+This gives a number of advantages:
+
+1. **Performance**. We get up to an 8x speedup over the old scheme when accessing attributes. Setting attributes can give up to a 40x speedup.
+2. **Memory**. We don't store, for example, a resname for each atom, but instead store attributes at the level they make sense for.
+3. **Consistency**. Since attributes are stored in one place, we avoid cases where the topology is in an inconsistent state, e.g. two atoms in the same residue give a different resname.
+4. **No staleness**. Because e.g. `ResidueGroup`s are only an array of indices, not a list of `Residue` objects generated upon creation of the group, changes of resiude-level properties by another `ResidueGroup` are always reflected consistently by every other one. Data is not duplicated anywhere in this scheme, and is all contained in the `Topology` object.
+5. **Serialization**. Topologies become serializable and changes to topologies can be easily saved and communicated around. This is an important step towards implementing parallel algorithms in MDAnalysis.
+
+For further performance comparisons, check out this [notebook](http://nbviewer.jupyter.org/gist/dotsdl/0e0fbd409e3e102d0458).
+
+## External changes that may affect how you use MDAnalysis
+
+Previously, every object except ``Atom`` subclassed from ``AtomGroup``.
+This meant that calling `.positions` of would give you the positions of the ``Atom``s contained within that group.
+
+Previous class structure:
+```
+Atom
+
+AtomGroup  -> Residue
+           -> ResidueGroup -> Segment
+                           -> SegmentGroup
+```
+New class structure:
+```
+Group    -> AtomGroup
+         -> ResidueGroup
+         -> SegmentGroup
+
+Atom
+Residue
+Segment
+```
+
+Now each object only contains information pertaining to that particular object.
+A ``Residue`` object only yields information about the residue; to get to the atoms, use ``Residue.atoms``.
+Similarly, to get the atoms from a ``Segment`` or a ``SegmentGroup`` use ``Segment.atoms`` or ``SegmentGroup.atoms``. 
+As before, you can get all residues associated with a group with ``Group.residues`` (which returns a ``ResidueGroup``) and all segments with ``Group.segments`` (a ``SegmentGroup``). 
+Bottom line: you should now *always be explicit about what you want*.
+
+### Why this was changed
+
+Previously everything inheriting from ``AtomGroup`` made it unclear at what level of topology a given method or attribute was working on.
+For example, does ``ResidueGroup.charges`` give the charge of the residues or the atoms?
+Also, it was unclear what size a given output would be (see [issue 411](https://github.com/MDAnalysis/mdanalysis/issues/411)).
+
+### How to work with the new system
+
+To access atom-level information from anything that isn't an ``AtomGroup``, use the `.atoms` level accessor.
+For example, changing all `.positions` calls on anything that isn't an `AtomGroup` to `.atoms.positions`.
+
+## Going forward: what does this mean for MDAnalysis as a project?
+
+A major benefit of the new topology system is that information about the topology of a ``Universe`` is now completely encapsulated in the ``Topology`` object.
+This not only makes development and maintenance easier, but also opens the door to some exciting new possibilities as simulation systems grow larger.
+A single ``Topology`` object can now be cleanly shared by multiple ``Universe`` instances, each with their own trajectory reader(s).
+This could  make common operations such as fitting a trajectory to a reference structure or doing parallel analysis of many trajectories more efficient for large systems.
+The ``Topology`` object can also be serialized more easily.
+This should enable parallelization on workers without shared memory (using libraries such as [``distributed``](http://distributed.readthedocs.io/en/latest/)) out-of-the-box.
+
+Making these things work is an ongoing effort, but the MDAnalysis [coredevs](https://github.com/orgs/MDAnalysis/teams/coredevs) are working to take advantage of all these possibilities.
+We look forward to the benefits this brings not only to the project, but also to all our users going forward.
+We hope you like what we've done here.
+
+-- @dotsdl and @richardjgowers