speed up step_boundaries #1710

stevengj · 2021-07-28T16:55:22Z

step_boundaries copies all of the chunk's boundary points into a buffer, sends them via MPI (if the destination is a different process), and then copies the received buffer to the destination boundary. This code was all written many years ago under the assumption that the computational cost was not significant (since it is proportional to the surface area rather than the volume). However, in practice it appears to be a substantial portion of the time in some cases.

Fortunately, the fact that it was never seriously optimized also means that there are probably many ways to speed it up. Here are three ideas:

The copy to/from buffer implementation uses an array of pointers to the source/destination locations, which means that there is a lot of pointer chasing in the inner loops. One simple optimization would be to (a) sort the pointers by address and (b) replace arithmetic sequences (sequences of n addresses with constant offsets s) by a (pointer,s,n) triplet so that the sequence can be read using a simple loop.
Similarly, the phase factors are stored and computed individually per boundary point here, when in fact many of these phases are the same, so they could be compressed with run-length encoding.
When the source and destination process are the same (e.g. copying from a chunk to a neighboring PML chunk on the same process), we can presumably skip the buffer and copy directly to the destination.

The text was updated successfully, but these errors were encountered:

* Replace `connections` with separate maps for incoming and outgoing connections. * Maintain separate connections for each chunk pair. These changes unblock some of the improvements disucssed on NanoComp#1710 and NanoComp#1670.

* Replace `connections` with separate maps for incoming and outgoing connections. * Maintain separate connections for each chunk pair. These changes unblock some of the improvements disucssed on #1710 and #1670. Co-authored-by: Andreas Hoenselaar <[email protected]>

stevengj · 2021-08-19T19:24:38Z

Would be good to have some profiling data to know what fraction of the time is typically spent on copying to/from these buffers, to know if it is worth optimizing further.

* Replace `connections` with separate maps for incoming and outgoing connections. * Maintain separate connections for each chunk pair. These changes unblock some of the improvements disucssed on NanoComp#1710 and NanoComp#1670. Co-authored-by: Andreas Hoenselaar <[email protected]>

stevengj added the enhancement label Jul 28, 2021

ahoenselaar mentioned this issue Aug 6, 2021

Restructure connections #1721

Merged

ahoenselaar mentioned this issue Aug 12, 2021

Process incoming chunk data immediately upon receipt. #1730

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

speed up step_boundaries #1710

speed up step_boundaries #1710

stevengj commented Jul 28, 2021 •

edited

Loading

stevengj commented Aug 19, 2021

speed up step_boundaries #1710

speed up step_boundaries #1710

Comments

stevengj commented Jul 28, 2021 • edited Loading

stevengj commented Aug 19, 2021

stevengj commented Jul 28, 2021 •

edited

Loading