Profiling #34

chrisiacovella · 2024-02-22T22:28:13Z

This issue is meant to serve as a placeholder with regards to optimization and profiling of the code and how this may be impacted by our design choices (e.g., use of deepcopy of SamplerState in places.

chrisiacovella · 2024-02-22T23:34:43Z

Reproducing John Chodera's comment from the PR:

While "premature optimization is the root of all evil",
we should think ahead to the very small number of key methods we will need to ensure can be efficiently jitted to perform well on GPU hardware and pay attention to how we structure these methods.

My current understanding is that we will really want the following methods to be very fast:

The inner loop of MD integration

The computation of the potential energy gradient (from grad applied to the potential energy function) within this MD step loop

The pairlist/displacement vector (re)generation that is also called within the MD step loop

It's worth making sure we clear any irregular computation out of the way within these loops, and potentially to build some profiling harnesses so we can monitor what kernel execution looks like in these inner loops to make sure we are keeping the GPU busy.

We can always systematically improve this later, but they key idea here is that we want to make sure our API and code structure will permit us to work hard on those three methods, since we expect the overwhelmingly large fraction of our compute time to be spent there.

I think those are the 3 parts that are going to need the most attention.

Right now, the pair computation (currently limited just to LJ as that is really the only function implemented) and neighbor/pair list functions are already coded up with the appropriate functions jitted for performance. A lot of the routines will not benefit from jitting (or in many cases can't really be jitted).

I had done some "offline" (i.e., outside of this PR) benchmarking of the langevin integrator, looking at different approaches to how we jit the routine (e.g., there are a few consistent patterns that can be coded up to minimize overhead of jit). Once this is all merged, I was going to create a few benchmarks of different approaches in an issue so we can figure out, discuss, brainstorm, etc. the most efficient implementation. Preliminarily, it seems that jax.numpy itself is already quite efficient even without jitting the integration itself, which I suspect is because the amount of "work" done to integrate is much less compared to, say computing the potential. That is any speed gains are offset by overhead and moving data from cpu to gpu.

chrisiacovella · 2024-02-22T23:35:14Z

https://ubiquity.acm.org/article.cfm?id=1513451

chrisiacovella mentioned this issue Feb 22, 2024

Refactoring of MC moves #21

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Profiling #34

Profiling #34

chrisiacovella commented Feb 22, 2024

chrisiacovella commented Feb 22, 2024

chrisiacovella commented Feb 22, 2024

Profiling #34

Profiling #34

Comments

chrisiacovella commented Feb 22, 2024

chrisiacovella commented Feb 22, 2024

chrisiacovella commented Feb 22, 2024