Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Add draft post introducing distopia #269

Closed
wants to merge 10 commits into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
358 changes: 358 additions & 0 deletions _posts/2023-03-03-distopia.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,358 @@
---
layout: post
title: Introducing Distopia
---

Calculating Euclidian distances under periodic boundary conditions is a core component in both running and analysing molecular simulations. For example, the calculation of radial distribution functions, hydrogen bonds, structure factors, and many other common analyses require calculation of one or more pairwise Euclidian distances.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps link to a wiki page on PBC or cite a paper ... making it more scholarly

comment on different "simulation boxes" would be helpful

Image illustrating PBC (at least in 2D) would be icing on the cake.


The Euclidian distance formula under periodic boundary conditions can be expressed as:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

define R and A


$$
R = \sqrt{A}
$$

Many molecular mechanics engines have highly optimised code for calculating distances that often make use of specialised code paths and/or accelerators to achieve maximal performance. However this functionality is often heavily "baked-in" and is not able to be used as a standalone component.

We at MDAnalysis wanted to bring some of this blazing speed to our distance calculations without sacrificing the modularity of MDAnalysis by introducing heavyweight dependencies on a simulation engine.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Say early on how to use it in MDA and from which release onwards it is available.

The detail below is really informative and it's great to have technical blog posts. I would just make sure that our general users get the message that (1) we have fast distance calculations and (2) you can use them right away.


## Enter Distopia ##

Instead we developed a lightweight C++ / Python package **distopia** that uses explicit x86 SIMD vectorisation to accelerate calculation of Euclidian distances under periodic boundary conditions. The package is small and aims to provide both a Python and C++ layer to interface with if you would like to quickly calculate some distances! It is available on [PyPi][Distopia PyPi] and [conda forge][Distopia conda-forge] for x86 architectures if you would like to try it out.

Try one of:

```bash
pip install distopia
```

or,

```bash
conda install -c conda-forge distopia
```

You can also see distopia under development at its home on [github][Distopia github].

## What advantages does Distopia provide? ##

### Distopia is fast! ###

A comparison of distopia with several different approaches used in different simulation analysis packages is provided in the graph below. We compare with [MDAnalysis's][MDAnalysis] current distance backend, [MDTraj's][MDTraj] distance backend and [freud's][freud] distance backend.




We can see that distopia is faster than the approaches taken in the other packages (especially MDAnalysis!). This is due to the extensive efforts taken to improve vectorisation, data alignment and use low-latency CPU operations. More details about how this is achieved can be found below.

### Distopia is easy to use! ###

Distopia is easy to use, with a simple, [well documented python layer][Distopia docs]. Calculating the pairwise distances between two NumPy arrays is as simple as.

```python
import distopia

dists = distopia.calc_bonds_ortho_float(coords1, coords2, box)
```

Distopia provides both `float` and `double` precision versions of functions to calculate distances and currently supports orthorhombic periodic boundary conditions, or no boundary conditions.

### Distopia is open to the community! ###

We don't intend the benchmarks in this blog post to draw attention to relative performance of one package versus another. Instead we aim to discuss the merits of different approaches and provide an easy to use layer that other packages can adopt if they so wish.

We also welcome all kinds of feedback about distopia on our [github][Distopia github] or on the [MDAnalysis Discord][MDAnalysis discord].

## How does distopia compare to existing approaches? ##

Currently, many simulation analysis packages offer some kind of compiled layer for the numerically intensive job of calculating distances. MDAnalysis currently achieves this with a [C header wrapped in Cython][mda backend].

Distopia uses **explicit SIMD vectorisation** along with a set of "array of structs" -> "struct of arrays" transforms to get the maximum speed out of the available CPU hardware.

Lets break down what each of these mean:

### SIMD vectorisation ###

Optimising compilers aim to use the best possible set of CPU instructions, data pipelines and branch predictions to run code as fast as possible. Part of this involves the automatic compilation of code to use **Single Instruction Multiple Data** instructions that broadcast an operation over multiple data elements using specialised registers. For example, an AVX SIMD enabled **multiply** may be able to work on a 8 32 bit floats at a time using a x86 `YMM` register introduced with the AVX SIMD instruction set.

The following C++ snippet compiled with `gcc 12` using the `-O3` and `-mavx2` flags shows this nicely. It computes the scalar product $c = a * b$

```c++
void sax(float *c, float *a, float *b, int n) {
for (size_t i=0; i<n; i++) {
c[i] = a[i] * b[i];
}
}

```
Resulting in the following assembly with only the core of the unrolled loop
shown:

```assembly
vmovups ymm1, YMMWORD PTR [rsi+rax] ; load data from a
vmulps ymm0, ymm1, YMMWORD PTR [rdx+rax] ; load data from b
vmovups YMMWORD PTR [rdi+rax], ymm0 ; multiple
... ; store the result
```


However the compiler cannot see all ends and is often incapable of identifying large scale data transformations that may result in better SIMD utilisation down the line. The MDAnalysis distance library is a good example of this, showing limited used of the SIMD registers that would result in better performance when compiled with SIMD compiler flags.


```c++
#include <cmath>
void VanillaCalcDistances(const float *r_i, const float *r_j,
unsigned int nvals, float *output)
{
for (unsigned int i = 0; i < nvals; ++i)
{
float r2 = 0.0;
for (unsigned char j = 0; j < 3; ++j)
{
float rij = r_i[i * 3 + j] - r_j[i * 3 + j];
r2 += rij * rij;
}
*output++ = std::sqrt(r2);
}
}

```

Which results in the following assembly with only one iteration of the core of the inner loop shown:

```assembly
vmovss xmm0, DWORD PTR [rdi+rax*4] ; load data from r_i
vsubss xmm0, xmm0, DWORD PTR [rsi+rax*4] ; subtract data from r_j
vmulss xmm0, xmm0, xmm0 ; multiply by self to get the square
vaddss xmm0, xmm0, xmm1 ; add to the running total
; repeat twice more (loop unrolled 3 times total)

vsqrtss xmm0, xmm0, xmm0 ; take the square root of r2
vmovss DWORD PTR [rcx-4], xmm0 ; store the result
```

Notably this failed to use the YMM registers at all. Additionally the compiler was forced to use the `vsqrtss` instruction which is a scalar instruction that only works on a single float at a time as it was unable to identify that the data **could** be aligned in a way that would allow it to use the vectorised sqrt instruction `vsqrtps`.`

So what can we do to remedy this situation? There are two main possible paths that vary in difficulty.

1. We can try and help the compiler by organising the data in a manner that might help it

2. We can use **explicit vectorisation** where we use C++ functions that directly embed the assembly instructions we want (or very close to) in the generated assembly. This allows us to have much more control over when and how data is transformed into SIMD constructs.

Packages like `freud` use option 1, while `MDTraj` and `MDAnalysis` use option 2 but in different ways.

Lets explore each of these in slightly more depth:


#### Option 1: Helping the compiler ####

Our first strategy is to try and make the compiler's job easier by laying out the data in a way that it can more readily identify opportunities for vectorisation.
The easiest way to show this is with an example. Consider the following code for computing distances taken from the `freud` simulation analysis package.

```c++

inline float computeDistance(vec3<float>& r_i, vec3<float>& r_j) {
const vec3<float> r_ij = wrap(r_j - r_i);
return std::sqrt(dot(r_ij, r_ij));
}

// where the vec3 datastructure is roughly

template<class Real> struct vec3
{
Real x {0};
Real y {0};
Real z {0};
};

template<class Real> inline vec3<Real> operator*(const vec3<Real>& a, const vec3<Real>& b) {

return vec3<Real>(a.x * b.x, a.y * b.y, a.z * b.z);
}

```

Using the `vec3` datastructure enables the compiler to **automatically** generate more optimally vectorised code, as the operations on the `x`, `y` and `z` elements of the vectors can be separated into registers as operations on coordinates are paired.

For example, `x` coordinates can easily be multiplied with `x` coordinates and `y` with `y` (in this case using operator overloading) resulting in more opportunities for vectorisation.


This is sometimes called a Struct Of Arrays (SOA) data layout rather than an Array of Structs (AOS) layout and is best demonstrated by comparing the two example classes.

```c++

class SOA {
// an array for each dimensions
float* x; // xxxxx...
float* y; // yyyyy...
float* z; // zzzzz...
};

class AOS {
// an array containing repeated coordinates
float* coordinates; //xyzxyzxyz.....
};
```

This approach is a fantastic first step for optimising compiler vectorisation. However, we can do more.


#### Option 2: Explicit vectorisation ####

Rather than relying on the compiler to do our vectorisation for us, we can use **SIMD intrinsics** to have a much finer grained control over vectorisation in our program. The general way SIMD intrinsics work is that a single C++ intrinsic function roughly corresponds to a single CPU instruction but does not require writing of assembly. For example we could write our `sax()` function from above using AVX2 intrinsics as such:

```c++
#include <immintrin.h>

void sax(float *c, float *a, float *b, int n) {
for (size_t i=0; i<n; i+=8) {
__m256 a_ = _mm256_loadu_ps(a+i); // load 8 values into a 256 bit YMM register
__m256 b_ = _mm256_loadu_ps(b+i); // load 8 values into a 256 bit YMM register
__m256 c_ = _mm256_mul_ps(a_, b_); // multiply 2x YMM register
_mm256_store_ps(c + i, c_); // store the result
}
}

```

This results in the very compact assembly (only core loop shown)


```assembly
vmovups ymm1, YMMWORD PTR [rsi+rax*4]
vmulps ymm0, ymm1, YMMWORD PTR [rdx+rax*4]
vmovups YMMWORD PTR [rdi+rax*4], ymm0
```


You might notice that this is almost same assembly as the compiler generated for the `sax()` function
above. This is because the compiler is able to recognise the pattern of operations and generate highly optimised assembly without help from us.

So what about the case where the compiler is unable to generate the optimal assembly? In particular how can we apply explicit vectorisation in our code for calculating distances? We saw above that the compiler was unable to generate optimal assembly for the `VanillaCalcDistances()` function above.

One way is to use SIMD intrinsics to perform our calculation using an **AOS** data layout. This is the approach taken in the `MDTraj` library.

Examine this annotated code snippet from the `MDTraj` library.

```c++
void dist_mic(const float* xyz, const int* pairs, const float* box_matrix,
float* distance_out, float* displacement_out,
const int n_frames, const int n_atoms, const int n_pairs) {

for (int i = 0; i < n_frames; i++) {
// Load the periodic box vectors etc ...
for (int j = 0; j < n_pairs; j++) {
// Compute the displacement.
int offset1 = 3*pairs[2*j + 0];
// load 3 values from an AOS xyz buffer
fvec4 pos1(xyz[offset1], xyz[offset1+1], xyz[offset1+2], 0);
int offset2 = 3*pairs[2*j + 1];
// load 3 values from an AOS xyz buffer.
fvec4 pos2(xyz[offset2], xyz[offset2+1], xyz[offset2+2], 0);
fvec4 r12 = pos2-pos1;
r12 -= round(r12*inv_box_size)*box_size;
// Store results....
}
}
}
```

The `fvec4` class handles the explicit SIMD vectorisation and holds the coordinate values in an **AOS format**.

```c++
class fvec4 {
public:
__m128 val; // holds xyz and one blank coordinate. (x, y, z, something)
// define all operators and functions

};
```

This also provides some speed improvements over the compiler generated code. However, it is still suboptimal as a slot for a blank coordinate is wasted. Additionally the operations over matched dimensions
are not separated into registers as they are in the **SOA** data layout. Additionally due to the loading of only three coordinate elements at a time (x,y,z) larger SIMD registers above SSE sizes (4 `float` coordinates) cannot be used to accelerate performance.

## So how does distopia work? ##

Distopia is a combination of the two approaches above. It uses the **SOA** data layout and explicitly
vectorised operations to generate highly optimised assembly. This is based on packing the coordinates in each dimension into SIMD registers. This is best demonstrated by the `SOASimd` class.

```c++
template <typename simdT>
class SOASimd {
// an array for each dimensions
simdT x; // xxxxx...
simdT y; // yyyyy...
simdT z; // zzzzz...


// define operations and operator overloads
operator*(const SOASimd<simdT>& a, const SOASimd<simdT>& b) {

return SOASimd<simdT>(a.x * b.x, a.y * b.y, a.z * b.z);
}

};

```

The `simdT` type is a template parameter that can be a SIMD register of any width.
The downside to this approach is the overhead required to pack the coordinates into the SIMD registers
in the correct order. This however is a one time cost and can be amortised by performing multiple
vectorised operations on the coordinates.

This comines the strengths of the two approaches above allowing vectorised SIMD operations to be applied accross each dimension and allows the Distopia system to benefit from wider SIMD registers.

In this way the Distopia distance calculation can be written as such:

```c++

template <typename simdT>
void DistopiaCalcDistances(const SOASimd<simdT> r_i, const SOASimd<simdT> r_j,
unsigned int nvals, float *output) {
// calculate the displacement
SOASimd<simdT> r_ij = r_j - r_i;
// calculate the squared distance
SOASimd<simdT> r2 = r_ij * r_ij;
// sum the squared distance
simdT r2_sum = r2.x + r2.y + r2.z;
// take the square root
simdT r = simd_sqrt(r2_sum);
// store the result
r.store(output);
```

Notice that the subtraction, multiplication and addition operations are all applied to each dimension of the coordinates. The sqrt and store operations are also applied to the entire SIMD register, resulting in
fantastic performance.


## Conclusion

In this post we have seen how we can use the `distopia` library to accelerate distance calculations. Distopia is available as a backend to some MDAnalysis distiance functions right now, with support for more to be added in the future.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

distance



```python
import MDAnalysis as mda
from MDAnalysis.tests.datafiles import PSF, DCD

u = mda.Universe(PSF, DCD)
# calculate pairwise distances distance between two atom groups
atomgroup1 = u.select_atoms("selection 1")
atomgroup2 = u.select_atoms("selection 2")

# calculate the distances using the distopia backend
distopia_distances = mda.lib.distances.calc_bonds(atomgroup1, atomgroup2, backend="distopia")

```

Come try it out on the develop branch of `MDAnalysis`! We welcome community feedback so if you have any questions or comments, please join us on the [MDAnalysis discord] or open an issue on the [Distopia github].

[Distopia github]: https://github.com/MDAnalysis/distopia
[Distopia PyPi]: https://pypi.org/project/distopia/
[Distopia conda-forge]: https://github.com/conda-forge/distopia-feedstock
[Distopia docs]: https://www.mdanalysis.org/distopia
[MDAnalysis]: https://www.mdanalysis.org
[MDAnalysis discord]: https://www.mdanalysis.org/2021/03/20/discord/
[MDTraj]: https://mdtraj.org/
[freud]: https://freud.readthedocs.io/en/stable/
[mda backend]: https://github.com/MDAnalysis/mdanalysis/blob/develop/package/MDAnalysis/lib/include/calc_distances.h