Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mixed precision MD #355

Closed
joaander opened this issue Mar 1, 2019 · 10 comments
Closed

Mixed precision MD #355

joaander opened this issue Mar 1, 2019 · 10 comments
Labels
breaking Changes that will break API. enhancement New feature or request essential Work that must be completed. md MD component

Comments

@joaander
Copy link
Member

joaander commented Mar 1, 2019

Description

Implement mixed precision for MD simulations.

Motivation

Compared to a full double precision build, mixed precision may offer performance benefits - especially on GeForce cards. Compared to a full single precision build, mixed precision simulations will conserve energy and momentum significantly better.

Implementation details

The most reasonable mixed precision model for HOOMD is to maintain particle coordinates in double, compute forces in single, and accumulate forces in double. HOOMD is too general for fixed precision force calculations, and double-single implementations do not offer enough benefits given the implementation complexity.

Care must be taken in testing a new mixed precision implementation evaluating the energy and momentum conservation of a number of test systems that exercise the relevant code paths (pair potentials, bond potentials, ....)

The HPMC mixed precision and a new MD mixed precision type should be merged and controlled by the same compile time option.

Question for debate

Should we continue to support single, double, and mixed precision builds? Certainly this will be helpful in developer testing, but it may require significant maintenance overhead in the future. If we decide to keep support more than one combination, which do we test and validate for users? Only mixed?

@joaander joaander added the enhancement New feature or request label Mar 1, 2019
@joaander joaander added this to the v3.0 milestone Mar 1, 2019
@mphoward
Copy link
Collaborator

mphoward commented Mar 1, 2019

The most reasonable mixed precision model for HOOMD is to maintain particle coordinates in double, compute forces in single, and accumulate forces in double. HOOMD is too general for fixed precision force calculations, and double-single implementations do not offer enough benefits given the implementation complexity.

Agreed, we can't reasonably assume a certain number of decimal places are needed, and double-single is really only beneficial on something like GeForce where fp64 is crippled. In contrast, even Tesla cards might still benefit for using a mixture of native fp32 and fp64 ops where appropriate.

Should we continue to support single, double, and mixed precision builds? Certainly this will be helpful in developer testing, but it may require significant maintenance overhead in the future. If we decide to keep support more than one combination, which do we test and validate for users? Only mixed?

I had thought about this a little bit as a potential project for someone. One flexible solution might be to replace the current Scalar by two floating point types: Double and Float (need better names so someone doesn't fat-finger the shift key and get the native type by mistake). Double would refer to all values that we believe should be fp64 in a mixed precision build (particle coordinates, forces, accumulation, etc.), while Float would be for anything we believe could be fp32 in a mixed precision build without sacrificing accuracy (result of dr for pair forces, internal force compute, etc.). It would still be valid to hard code (native) double where it is absolutely unacceptable to use fp32, and there should more or less never be a spot that uses (native) float.

Then, the default build could be mixed precision with Double = double and Float = float. We can force true double precision by setting Double = double and Float = double for applications that really need, or someone can use full single precision if they really insist on it with Double = float, Float = float. I would vote to test full double and mixed precision, but I rarely use true single precision builds anymore, even on GeForce cards.

I think it's an open question how much performance is really available to be gained from doing this (we could try to guess by comparing double to single in the current version for some representative benchmarks). It has always sounded like an awful lot of effort because someone will really need to scrape line-by-line and make the needed changes very carefully.

@joaander joaander added the md MD component label May 9, 2019
@joaander joaander removed this from the v3.0.beta.1 milestone Oct 7, 2020
@joaander
Copy link
Member Author

This has come up again. We don't need to necessarily take on this entire project in one PR. If we implement the base data types and compilation options, then we can slowly introduce mixed precision to portions of the codebase as key bottlenecks are identified.

With this in mind: While an eventual goal might be to remove Scalar entirely, let's for now keep Scalar as a typedef to the high precision data type.

As discussed by @mphoward above, the first thing we need is a name for the new data types. Double and Float are an option, though potentially too easy to confuse with double and float which we will hard code in some places on purpose. Real seems to be in common usage for floating point typedefs. I haven't come to a conclusion on what names to use, here are a few ideas to consider:

  • HPReal / LPReal (high-precision, low precision)
  • WideReal / NarrowReal
  • Real64 / Real32 - misnomers as these can be typedefed both to 64-bit or both to 32-bit.

The alternative is to stop considering supporting full single precision builds and only support full double and mixed modes. In this case Real is either double or float (default float). The existing Scalar would always be double and as I said above, it will probably continue to exist for a while only because there is little point in modifying the entire codebase at this time (though it could be done with global search and replace tools).

@mphoward
Copy link
Collaborator

I think we should decide how we want to approach this problem in general, as it might affect whether it makes sense to keep supporting single-precision builds. If we only want to support double and mixed modes, we would need to go through the code and identify points were doubles could safely be converted to floats. If we want to support single precision, there could still be parts where doubles are required (this is the case in MPCD, and I could imagine is also the case in HPMC), so we could need to decide split all the doubles into "should-be" (i.e., typedef double) vs. "must-be" (i.e., native double). These points are rare, but could make automating this slow.

In any case, I'm in favor of typenames like the first option, but I might propose an alternative of LongReal/Real. I think this would keep the names readable (a little hard with the abbreviations in option 1), short (which I don't like in option 2), and I agree that "64" and "32" in option 3 both imply a number of bits in the type (as in ), which could cause confusion. I think that keeping Scalar around for now is a good idea, but I would favor removing it eventually because I know I would get confused between Scalar and Real.

@joaander
Copy link
Member Author

I like the LongReal/ Real suggestion. I too would like to get rid of Scalar eventually.

The general wisdom from other projects that have implemented mixed precision MD is that the particle coordinates and integration steps need to be in double precision, while individual force computations can be performed with floats. The summation of net forces and energies either needs to be double or use something like Kahan summation, or both (easiest to just use doubles).

HPMC already has a complete mixed precision mode. We would only need to replace OverlapReal with the new nominally low precision Real type.

I agree that fully automating this process is challenging. That's why I'm proposing a process for gradually adding in support. E.g. we could convert MD pair potentials in one PR, bond potentials in another, MPCD in another, and so on.

We will need to get our MD validation framework enabled for the v3 builds in order to verify that the changes don't break simulation correctness. We will also need to put together some energy conservation test scripts to further validate the changes beyond what the automated tests can check.

@mphoward
Copy link
Collaborator

Excellent, I support this plan. MPCD already works in mixed precision as well, although it is in a different sense of some steps require double. I will think about whether that particle data needs to be single or double to address in a later PR.

One other core change we might need to make is in HOOMDMath. I think most of the vector operations are defined using Scalar. Will we support mixed-precision vector operations for LongReal and Real (and then the compiler would promote each element internally), or will we require that the user explicitly do the typecast (math between same type only)? The first is more convenient, but the second prevents possible issues of someone accidentally mixing in fp64 math.

@vyasr
Copy link
Contributor

vyasr commented Mar 29, 2021

@mphoward's proposal (two separate typedefs) is how I imagined this would be implemented, so I also support this plan. I don't have a strong opinion on the naming; I agree that avoiding names that could be confuse with built-names (Double) or that imply a certain width (Real64) should be avoided, but either LongReal or HPReal would be fine with me. Gradually adding support seems like a good idea to me. In addition to the benefits already discussed, it would also allow us to develop a robust testing/validation suite to convince ourselves of the conservation properties of the mixed precision model that we could then apply to other potentials.

@joaander
Copy link
Member Author

While I would like to get this started, I'm going to focus on completing the 3.0 release first. Will start work on mixed precision in 3.1+

@joaander
Copy link
Member Author

When adding mixed precision MD, consider adding single-precision optimized GPU code to the fat binary: https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#gpu-feature-list
This would increase compilation time and binary size, but would reduce application startup time.

@joaander joaander added the essential Work that must be completed. label Feb 4, 2022
@joaander joaander removed this from the future milestone Mar 8, 2022
@joaander
Copy link
Member Author

Also:

Devices of compute capability 8.6 have 2x more FP32 operations per cycle per SM than devices of compute capability 8.0. While a binary compiled for 8.0 will run as is on 8.6, it is recommended to compile explicitly for 8.6 to benefit from the increased FP32 throughput.

https://docs.nvidia.com/cuda/ampere-tuning-guide/index.html#improved_fp32

@joaander joaander added the breaking Changes that will break API. label Aug 12, 2022
@joaander
Copy link
Member Author

The current single precision optimized GPUS have a 1:64 double:single ALU ratio. This makes double operations so expensive that even a small number of ops per particle pair drastically drops performance. I implemented the mixed precision mode discussed here. I found that the main bottleneck kernels are the 1) Neighbor list build and 2) Pair force evaluation. The net force summation, integration, cell list, and other kernels on A40 were comparable to timings on A100 as the time to run these kernels is dominated by launch latency.

To support large and/or dilute systems, we must compute the delta r and box minimum image convention for particle pairs in double. Otherwise, particles near each other in large boxes will loose precision: For example single precision coordinates in the 1000's have only 2 sig figs left in the delta: 1000.XY - 1000.ZW. Comparing A40 to A100, performing just delta r in double (in both the neighbor list and pair force kernels) reduces performance to 1/2. Performing both the delta r and minimum image convention in double reduces performance to ~1/10. The minimum image convention optimizes down to only a few SASS instructions and I was not able to find a way to make it any faster. One alternative is a LAMMPS style approach using ghost particles even in serial, but this would be a massive restructuring of HOOMD that I do not wish to implement.

1/10 is far below useable performance, so I am not going to pursue this work further at this time. I will open a PR on trunk-major that implements the ShortReal and LongReal type so that other places in the codebase are able to implement build time configurable mixed precision, but I will not be including an implementation for MD.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking Changes that will break API. enhancement New feature or request essential Work that must be completed. md MD component
Projects
None yet
Development

No branches or pull requests

3 participants