Feature/madd #233

peytondmurray · 2019-07-03T09:44:50Z

The RK23, RK4, RK45, and RK56 solvers make use of the Madd2, Madd3, Madd4, Madd5, Madd6, and Madd7 functions, but only Madd2 and Madd3 are implemented as cuda kernels. The rest of these functions essentially just call nested combinations of Madd2 and Madd3 multiple times. At each timestep, the solvers are therefore launching more cuda kernels than needed each time Madd4, Madd5, Madd6, and Madd7 are being called. As I understand it, the overhead associated with launching cuda kernels can be large, and to a lesser extent there's also an overhead to calling Go functions.

Solver	# kernels launched per step	# kernel launches needed per step
RK23	6	4
RK4	5	4
RK45	15	7
RK56	21	9

I implemented cuda versions of Madd4, Madd5, Madd6, and Madd7, and modified the solvers to use these functions. The simple benchmark included in the test folder (sp4_madd_bench.mx3) shows basically no improvement for RK23 and RK4, but there is a few percent (~5%) improvement in the time it takes for run() to finish on my machine for RK45 and RK56.

godsic · 2019-07-03T14:12:48Z

@peytondmurray Have you ran unit tests (test/run.bash)? Have you tried our benchmark script (bench/bench.mx3)?

peytondmurray · 2019-07-03T19:01:01Z

The first time I ran test/run.bash, the tests failed at minimizer.mx3, but the expected value was within 0.00002 of the test value, very close to the tolerance. Interestingly, the second time I ran test/run.bash, all tests passed. I wonder what causes this variability?

Here is the benchmark result: benchmark.txt

godsic · 2019-07-04T11:43:21Z

@peytondmurray minimizer is not a core, but contributed mumax3 module. We never felt it is deterministic enough to replace relax due to the numerical stability issues like those you see. I believe, the issue is partly due to non-deterministic execution of commands on NVidia GPUs. IEEE 754 standard does not demand associativity of floating point operations. So any reshuffling of calculations by the driver or hardware schedulers to maximize GPU occupancy will generally produce different numerical noises and might lead minimizer to different states.

The bottom line is don't bother with the minimizer unit test, unless it gives terribly wrong result.

What I would like to ask you to do is to run bench without you patches to figure out any performance benefits.

peytondmurray · 2019-07-06T20:37:53Z

@godsic The benchmark uses the Heun solver, which is unaffected by this change. I instead ran the benchmark for each of the four different solvers above for both the master and feature/madd branches, but I didn't actually see much of a difference at first. I thought this might be due to the fact that the benchmark only runs 100 steps of the solver, so I thought it might be a better test if I increased that number. Instead, I set the benchmark script to run 1000 steps, but the results still don't make much sense to me. While my initial tests above showed some modest improvement in performance, the story now is much less clear:

godsic · 2020-07-04T11:05:05Z

@peytondmurray Would be possible for you to run the benchmark again? We are about to release 3.10 and I am happy to merge this one if it provides performance benefits.

…there is any noticeable improvement.

…ing the new cuda.madd functions

peytondmurray · 2020-07-09T04:24:20Z

@godsic Sorry to be slow - I'm preparing the benchmarks to run overnight, and will post results tomorrow.

peytondmurray · 2020-07-12T05:37:42Z

@godsic I rebased feature/madd onto develop and reran a benchmark on both branches. The benchmark was modified from standard problem #4:

The magnetization was initialized and relaxed before being saved in an .ovf.
For each of the RK23BS, RK4, RK45DP, and RK56 solvers, the magnetization was loaded from the same .ovf, the same B_ext from SP4 was applied, and the time the solver took to run for 1000 steps was recorded. This procedure was repeated 10 times for each solver. Here's the script I used to test the solvers:

nx := 128
ny := 32
nz := 1
t0 := now()
t1 := now()
t2 := now()
t3 := now()
t4 := now()
t5 := now()
t6 := now()
t7 := now()
nsteps := 1000
Msat = 1600e3
Aex = 13e-12
Msat = 800e3
alpha = 0.02
B_ext = vector(-24.6E-3, 4.3E-3, 0)
setcellsize(1e-9, 1e-9, 3e-9)

for i := 0; i < 6; i += 1 {
        t_start := now()
        setgridsize(nx, ny, nz)

        for j := 0; j < 10; j += 1 {
                mfile := sprint("/home/pdmurray/go/src/github.com/mumax/3/test/make_relaxed_configs.out/m_", nx, "-", ny, "-", nz, ".ovf")

                SetSolver(3)
                m.LoadFile(mfile)
                t0 = now()
                steps(nsteps)
                t1 = now()

                SetSolver(4)
                m.LoadFile(mfile)
                t2 = now()
                steps(nsteps)
                t3 = now()

                SetSolver(5)
                m.LoadFile(mfile)
                t4 = now()
                steps(nsteps)
                t5 = now()

                SetSolver(6)
                m.LoadFile(mfile)
                t6 = now()
                steps(nsteps)
                t7 = now()
                print(sprintf("%d, %d, %d, %6.6E, %6.6E, %6.6E, %6.6E", nx, ny, nz, t1.sub(t0).Seconds(), t3.sub(t2).Seconds(), t5.sub(t4).Seconds(), t7.sub(t6).Seconds()))
        }
        t_end := now()
        print(t_end.sub(t_start).Seconds())
        nx = 2 * nx
        ny = 2 * ny
}

The benchmarks were run on a GTX 1080 Ti. After running the simulations, I plotted the execution times via matplotlib; the first row is a scatter plot of all the raw execution times as a function of the number of cells N; the second row is the mean execution time as a function of N; and the final row is the ratio of the mean execution time on the feature/madd branch to the mean execution time using the master branch. In all cases, black points correspond to the master branch and red points correspond to feature/madd. Here are the results:

And here are the raw execution times I recorded, formatted into csv files:
csv_data.zip

Overall it looks like feature/madd is maybe bit faster, depending on the solver.

This script is used for the benchmarks mentioned in pull request #233. This script does not actively check anything, it takes a long time to run, and has no longer any purpose. There is no reason to have this as a 'unit test' in the test directory. Hence the file is removed.

peytondmurray added 3 commits July 8, 2020 21:13

Added madd4, madd5, madd6 cuda functions. Benchmarking now to see if …

9f2a584

…there is any noticeable improvement.

Modified and benchmarked the RK23BS, RK4, RK45DP, and RK56 solvers us…

2d7bf8a

…ing the new cuda.madd functions

Improved documentation.

53a0217

peytondmurray force-pushed the feature/madd branch from d877b38 to 53a0217 Compare July 9, 2020 04:13

godsic merged commit b3ce215 into mumax:master Aug 7, 2020

jplauzie mentioned this pull request Nov 13, 2023

Same input code but different result sometimes. #328

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/madd #233

Feature/madd #233

peytondmurray commented Jul 3, 2019

godsic commented Jul 3, 2019

peytondmurray commented Jul 3, 2019 •

edited

Loading

godsic commented Jul 4, 2019

peytondmurray commented Jul 6, 2019

godsic commented Jul 4, 2020

peytondmurray commented Jul 9, 2020 •

edited

Loading

peytondmurray commented Jul 12, 2020 •

edited

Loading

Feature/madd #233

Feature/madd #233

Conversation

peytondmurray commented Jul 3, 2019

godsic commented Jul 3, 2019

peytondmurray commented Jul 3, 2019 • edited Loading

godsic commented Jul 4, 2019

peytondmurray commented Jul 6, 2019

godsic commented Jul 4, 2020

peytondmurray commented Jul 9, 2020 • edited Loading

peytondmurray commented Jul 12, 2020 • edited Loading

peytondmurray commented Jul 3, 2019 •

edited

Loading

peytondmurray commented Jul 9, 2020 •

edited

Loading

peytondmurray commented Jul 12, 2020 •

edited

Loading