Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/madd #233

Merged
merged 3 commits into from
Aug 7, 2020
Merged

Feature/madd #233

merged 3 commits into from
Aug 7, 2020

Conversation

peytondmurray
Copy link
Contributor

The RK23, RK4, RK45, and RK56 solvers make use of the Madd2, Madd3, Madd4, Madd5, Madd6, and Madd7 functions, but only Madd2 and Madd3 are implemented as cuda kernels. The rest of these functions essentially just call nested combinations of Madd2 and Madd3 multiple times. At each timestep, the solvers are therefore launching more cuda kernels than needed each time Madd4, Madd5, Madd6, and Madd7 are being called. As I understand it, the overhead associated with launching cuda kernels can be large, and to a lesser extent there's also an overhead to calling Go functions.

Solver # kernels launched per step # kernel launches needed per step
RK23 6 4
RK4 5 4
RK45 15 7
RK56 21 9

I implemented cuda versions of Madd4, Madd5, Madd6, and Madd7, and modified the solvers to use these functions. The simple benchmark included in the test folder (sp4_madd_bench.mx3) shows basically no improvement for RK23 and RK4, but there is a few percent (~5%) improvement in the time it takes for run() to finish on my machine for RK45 and RK56.

@godsic
Copy link
Contributor

godsic commented Jul 3, 2019

@peytondmurray Have you ran unit tests (test/run.bash)? Have you tried our benchmark script (bench/bench.mx3)?

@peytondmurray
Copy link
Contributor Author

peytondmurray commented Jul 3, 2019

The first time I ran test/run.bash, the tests failed at minimizer.mx3, but the expected value was within 0.00002 of the test value, very close to the tolerance. Interestingly, the second time I ran test/run.bash, all tests passed. I wonder what causes this variability?

Here is the benchmark result: benchmark.txt

@godsic
Copy link
Contributor

godsic commented Jul 4, 2019

@peytondmurray minimizer is not a core, but contributed mumax3 module. We never felt it is deterministic enough to replace relax due to the numerical stability issues like those you see. I believe, the issue is partly due to non-deterministic execution of commands on NVidia GPUs. IEEE 754 standard does not demand associativity of floating point operations. So any reshuffling of calculations by the driver or hardware schedulers to maximize GPU occupancy will generally produce different numerical noises and might lead minimizer to different states.

The bottom line is don't bother with the minimizer unit test, unless it gives terribly wrong result.

What I would like to ask you to do is to run bench without you patches to figure out any performance benefits.

@peytondmurray
Copy link
Contributor Author

@godsic The benchmark uses the Heun solver, which is unaffected by this change. I instead ran the benchmark for each of the four different solvers above for both the master and feature/madd branches, but I didn't actually see much of a difference at first. I thought this might be due to the fact that the benchmark only runs 100 steps of the solver, so I thought it might be a better test if I increased that number. Instead, I set the benchmark script to run 1000 steps, but the results still don't make much sense to me. While my initial tests above showed some modest improvement in performance, the story now is much less clear:

result

@godsic
Copy link
Contributor

godsic commented Jul 4, 2020

@peytondmurray Would be possible for you to run the benchmark again? We are about to release 3.10 and I am happy to merge this one if it provides performance benefits.

@peytondmurray
Copy link
Contributor Author

peytondmurray commented Jul 9, 2020

@godsic Sorry to be slow - I'm preparing the benchmarks to run overnight, and will post results tomorrow.

@peytondmurray
Copy link
Contributor Author

peytondmurray commented Jul 12, 2020

@godsic I rebased feature/madd onto develop and reran a benchmark on both branches. The benchmark was modified from standard problem #4:

  1. The magnetization was initialized and relaxed before being saved in an .ovf.
  2. For each of the RK23BS, RK4, RK45DP, and RK56 solvers, the magnetization was loaded from the same .ovf, the same B_ext from SP4 was applied, and the time the solver took to run for 1000 steps was recorded. This procedure was repeated 10 times for each solver. Here's the script I used to test the solvers:
nx := 128
ny := 32
nz := 1
t0 := now()
t1 := now()
t2 := now()
t3 := now()
t4 := now()
t5 := now()
t6 := now()
t7 := now()
nsteps := 1000
Msat = 1600e3
Aex = 13e-12
Msat = 800e3
alpha = 0.02
B_ext = vector(-24.6E-3, 4.3E-3, 0)
setcellsize(1e-9, 1e-9, 3e-9)

for i := 0; i < 6; i += 1 {
        t_start := now()
        setgridsize(nx, ny, nz)

        for j := 0; j < 10; j += 1 {
                mfile := sprint("/home/pdmurray/go/src/github.com/mumax/3/test/make_relaxed_configs.out/m_", nx, "-", ny, "-", nz, ".ovf")

                SetSolver(3)
                m.LoadFile(mfile)
                t0 = now()
                steps(nsteps)
                t1 = now()

                SetSolver(4)
                m.LoadFile(mfile)
                t2 = now()
                steps(nsteps)
                t3 = now()

                SetSolver(5)
                m.LoadFile(mfile)
                t4 = now()
                steps(nsteps)
                t5 = now()

                SetSolver(6)
                m.LoadFile(mfile)
                t6 = now()
                steps(nsteps)
                t7 = now()
                print(sprintf("%d, %d, %d, %6.6E, %6.6E, %6.6E, %6.6E", nx, ny, nz, t1.sub(t0).Seconds(), t3.sub(t2).Seconds(), t5.sub(t4).Seconds(), t7.sub(t6).Seconds()))
        }
        t_end := now()
        print(t_end.sub(t_start).Seconds())
        nx = 2 * nx
        ny = 2 * ny
}

The benchmarks were run on a GTX 1080 Ti. After running the simulations, I plotted the execution times via matplotlib; the first row is a scatter plot of all the raw execution times as a function of the number of cells N; the second row is the mean execution time as a function of N; and the final row is the ratio of the mean execution time on the feature/madd branch to the mean execution time using the master branch. In all cases, black points correspond to the master branch and red points correspond to feature/madd. Here are the results:

bench

And here are the raw execution times I recorded, formatted into csv files:
csv_data.zip

Overall it looks like feature/madd is maybe bit faster, depending on the solver.

@godsic godsic merged commit b3ce215 into mumax:master Aug 7, 2020
JeroenMulkers added a commit that referenced this pull request Aug 10, 2020
This script is used for the benchmarks mentioned in pull request #233.

This script does not actively check anything, it takes a long time to run, and has no longer any purpose. There is no reason to have this as a 'unit test' in the test directory. Hence the file is removed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants