Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vertically stretched grids for all topologies on CPUs and GPUs #1403

Merged
merged 19 commits into from
Feb 26, 2021

Conversation

ali-ramadhan
Copy link
Member

@ali-ramadhan ali-ramadhan commented Feb 26, 2021

This PR completes the implementation of the vertically stretched grid (PR #1348) by supporting all topologies on the CPU and GPU.

Well, I think a batched tridiagonal solve in the vertical is only possible if the z dimension is Bounded, so "all topologies" only includes the four topologies with z being Bounded.

If the stretched dimension is periodic, then I think the system of linear equations is no longer tridiagonal (you get non-zero elements in the corner of the matrix). I vaguely recall @christophernhill mentioning that there may be a way around this?

I'm also going to bump v0.51.0 in this PR.

X-Ref: #586

src/Grids/vertically_stretched_rectilinear_grid.jl Outdated Show resolved Hide resolved
(Periodic, Bounded, Bounded),
(Bounded, Periodic, Bounded),
(Bounded, Bounded, Bounded)
]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

@ali-ramadhan
Copy link
Member Author

ali-ramadhan commented Feb 26, 2021

Benchmarks!

As mentioned in the docs, the Fourier-tridiagonal solver is theoretically faster for large problems.

Raw numbers in all their glory below but to summarize:

  1. The Fourier-tridiagonal Poisson solver is indeed faster! But only on the GPU. About ~1.5x faster for 256³ grids (15 -> 10 ms/solve).
  2. As a result, GPU incompressible models are faster with FourierTridiagonalPoissonSolver! But only by ~1.15x.
  3. More features with a faster solver though, seems like a win-win.
  4. I should clarify the speedup is only enjoyed by (Periodic, Periodic, Bounded) on the GPU. For channel topologies, FourierTridiagonalPoissonSolver is slower.
  5. Oceananigans.jl is roughly as fast as it was before HydrostaticFreeSurfaceModel.

Vertically stretched incompressible model benchmarks

Raw benchmarks

┌───────────────┬─────────────┬─────┬────────────┬────────────┬────────────┬────────────┬────────────┬────────┐
│ Architectures │ Float_types │  Ns │        min │     median │       mean │        max │     memory │ allocs │
├───────────────┼─────────────┼─────┼────────────┼────────────┼────────────┼────────────┼────────────┼────────┤
│           CPU │     Float32 │  32 │   6.370 ms │   7.323 ms │   7.089 ms │   7.649 ms │ 139.61 KiB │   1832 │
│           CPU │     Float32 │  64 │  48.449 ms │  49.064 ms │  49.345 ms │  51.689 ms │ 139.61 KiB │   1832 │
│           CPU │     Float32 │ 128 │ 400.112 ms │ 402.124 ms │ 409.183 ms │ 469.727 ms │ 139.61 KiB │   1832 │
│           CPU │     Float32 │ 256 │    4.036 s │    4.074 s │    4.074 s │    4.112 s │ 139.61 KiB │   1832 │
│           CPU │     Float64 │  32 │   6.343 ms │   6.425 ms │   6.573 ms │   7.301 ms │ 139.84 KiB │   1832 │
│           CPU │     Float64 │  64 │  47.988 ms │  48.355 ms │  48.603 ms │  50.857 ms │ 139.84 KiB │   1832 │
│           CPU │     Float64 │ 128 │ 405.724 ms │ 408.699 ms │ 409.013 ms │ 414.735 ms │ 139.84 KiB │   1832 │
│           CPU │     Float64 │ 256 │    4.333 s │    4.333 s │    4.333 s │    4.334 s │ 139.84 KiB │   1832 │
│           GPU │     Float32 │  32 │   2.452 ms │   2.493 ms │   2.593 ms │   3.379 ms │ 506.63 KiB │   6656 │
│           GPU │     Float32 │  64 │   2.752 ms │   2.813 ms │   2.952 ms │   4.094 ms │ 546.64 KiB │   6657 │
│           GPU │     Float32 │ 128 │   4.599 ms │   4.652 ms │   4.848 ms │   6.647 ms │ 631.61 KiB │   6655 │
│           GPU │     Float32 │ 256 │  25.930 ms │  31.884 ms │  31.279 ms │  32.044 ms │ 799.17 KiB │   6659 │
│           GPU │     Float64 │  32 │   2.484 ms │   2.551 ms │   2.660 ms │   3.582 ms │ 507.69 KiB │   6656 │
│           GPU │     Float64 │  64 │   2.793 ms │   2.828 ms │   2.972 ms │   3.709 ms │ 547.70 KiB │   6657 │
│           GPU │     Float64 │ 128 │   4.682 ms │   4.725 ms │   4.880 ms │   5.654 ms │ 632.67 KiB │   6655 │
│           GPU │     Float64 │ 256 │  25.907 ms │  32.091 ms │  31.487 ms │  32.483 ms │ 801.86 KiB │   6763 │
└───────────────┴─────────────┴─────┴────────────┴────────────┴────────────┴────────────┴────────────┴────────┘

CPU to GPU speedup

┌─────────────┬─────┬─────────┬─────────┬─────────┐
│ Float_types │  Ns │ speedup │  memory │  allocs │
├─────────────┼─────┼─────────┼─────────┼─────────┤
│     Float32 │  32 │ 2.93732 │ 3.62888 │ 3.63319 │
│     Float32 │  64 │  17.443 │  3.9155 │ 3.63373 │
│     Float32 │ 128 │  86.446 │ 4.52412 │ 3.63264 │
│     Float32 │ 256 │ 127.777 │ 5.72434 │ 3.63483 │
│     Float64 │  32 │ 2.51868 │ 3.63039 │ 3.63319 │
│     Float64 │  64 │ 17.1005 │ 3.91654 │ 3.63373 │
│     Float64 │ 128 │ 86.4912 │ 4.52413 │ 3.63264 │
│     Float64 │ 256 │ 135.031 │ 5.73397 │ 3.69159 │
└─────────────┴─────┴─────────┴─────────┴─────────┘

Incompressible model benchmarks (regular Rectilinear grid)

Raw benchmarks

                                        Incompressible model benchmarks
┌───────────────┬─────────────┬─────┬────────────┬────────────┬────────────┬────────────┬────────────┬────────┐
│ Architectures │ Float_types │  Ns │        min │     median │       mean │        max │     memory │ allocs │
├───────────────┼─────────────┼─────┼────────────┼────────────┼────────────┼────────────┼────────────┼────────┤
│           CPU │     Float32 │  32 │   5.408 ms │   5.713 ms │   5.871 ms │   6.634 ms │ 287.98 KiB │   2136 │
│           CPU │     Float32 │  64 │  36.120 ms │  38.174 ms │  38.435 ms │  41.795 ms │ 287.98 KiB │   2136 │
│           CPU │     Float32 │ 128 │ 304.741 ms │ 311.332 ms │ 311.085 ms │ 315.204 ms │ 287.98 KiB │   2136 │
│           CPU │     Float32 │ 256 │    2.598 s │    2.598 s │    2.598 s │    2.599 s │ 287.98 KiB │   2136 │
│           CPU │     Float64 │  32 │   6.419 ms │   6.647 ms │   6.733 ms │   7.657 ms │ 350.52 KiB │   2136 │
│           CPU │     Float64 │  64 │  42.856 ms │  46.229 ms │  45.719 ms │  47.103 ms │ 350.52 KiB │   2136 │
│           CPU │     Float64 │ 128 │ 369.043 ms │ 380.330 ms │ 380.214 ms │ 385.820 ms │ 350.52 KiB │   2136 │
│           CPU │     Float64 │ 256 │    3.934 s │    3.943 s │    3.943 s │    3.953 s │ 350.52 KiB │   2136 │
│           GPU │     Float32 │  32 │   2.520 ms │   2.588 ms │   2.663 ms │   3.287 ms │ 685.19 KiB │   7043 │
│           GPU │     Float32 │  64 │   2.649 ms │   2.739 ms │   2.862 ms │   3.859 ms │ 724.81 KiB │   7035 │
│           GPU │     Float32 │ 128 │   3.376 ms │   3.440 ms │   3.644 ms │   5.526 ms │ 810.00 KiB │   7047 │
│           GPU │     Float32 │ 256 │  14.211 ms │  21.322 ms │  20.605 ms │  21.411 ms │ 977.47 KiB │   7045 │
│           GPU │     Float64 │  32 │   2.756 ms │   2.836 ms │   2.952 ms │   3.881 ms │ 791.30 KiB │   7043 │
│           GPU │     Float64 │  64 │   2.714 ms │   2.841 ms │   2.894 ms │   3.376 ms │ 830.92 KiB │   7035 │
│           GPU │     Float64 │ 128 │   4.960 ms │   5.049 ms │   5.117 ms │   5.493 ms │ 916.33 KiB │   7061 │
│           GPU │     Float64 │ 256 │  22.908 ms │  36.834 ms │  35.414 ms │  36.918 ms │   1.06 MiB │   7045 │
└───────────────┴─────────────┴─────┴────────────┴────────────┴────────────┴────────────┴────────────┴────────┘

CPU to GPU speedup

      Incompressible model CPU -> GPU speedup
┌─────────────┬─────┬─────────┬─────────┬─────────┐
│ Float_types │  Ns │ speedup │  memory │  allocs │
├─────────────┼─────┼─────────┼─────────┼─────────┤
│     Float32 │  32 │ 2.20733 │ 2.37925 │ 3.29728 │
│     Float32 │  64 │ 13.9392 │ 2.51685 │ 3.29354 │
│     Float32 │ 128 │ 90.5074 │ 2.81265 │ 3.29916 │
│     Float32 │ 256 │ 121.854 │ 3.39417 │ 3.29822 │
│     Float64 │  32 │ 2.34399 │ 2.25752 │ 3.29728 │
│     Float64 │  64 │ 16.2694 │ 2.37057 │ 3.29354 │
│     Float64 │ 128 │ 75.3331 │ 2.61423 │ 3.30571 │
│     Float64 │ 256 │ 107.062 │ 3.09138 │ 3.29822 │
└─────────────┴─────┴─────────┴─────────┴─────────┘

Fourier-tridiagonal Poisson solver benchmarks

Raw benchmarks

                                       Fourier-tridiagonal Poisson solver benchmarks                         
┌───────────────┬─────┬───────────────────────────────┬───────────┬───────────┬───────────┬───────────┬───────────┬────────┐
│ Architectures │  Ns │                    Topologies │       min │    median │      mean │       max │    memory │ allocs │
├───────────────┼─────┼───────────────────────────────┼───────────┼───────────┼───────────┼───────────┼───────────┼────────┤
│           CPU │ 256 │   (Bounded, Bounded, Bounded) │   1.679 s │   1.681 s │   1.703 s │   1.747 s │  2.02 KiB │     27 │
│           CPU │ 256 │  (Bounded, Periodic, Bounded) │   1.319 s │   1.324 s │   1.332 s │   1.363 s │  1.86 KiB │     27 │
│           CPU │ 256 │  (Periodic, Bounded, Bounded) │   1.349 s │   1.351 s │   1.374 s │   1.444 s │  1.86 KiB │     27 │
│           CPU │ 256 │ (Periodic, Periodic, Bounded) │   1.052 s │   1.063 s │   1.062 s │   1.068 s │  2.02 KiB │     27 │
│           GPU │ 256 │   (Bounded, Bounded, Bounded) │ 32.863 ms │ 33.356 ms │ 33.347 ms │ 33.543 ms │ 43.38 KiB │    876 │
│           GPU │ 256 │  (Bounded, Periodic, Bounded) │ 25.173 ms │ 25.849 ms │ 25.794 ms │ 25.928 ms │ 29.56 KiB │    629 │
│           GPU │ 256 │  (Periodic, Bounded, Bounded) │ 25.185 ms │ 25.761 ms │ 25.702 ms │ 25.865 ms │ 29.56 KiB │    629 │
│           GPU │ 256 │ (Periodic, Periodic, Bounded) │  9.832 ms │ 10.689 ms │ 10.631 ms │ 10.849 ms │ 13.06 KiB │    290 │
└───────────────┴─────┴───────────────────────────────┴───────────┴───────────┴───────────┴───────────┴───────────┴────────┘

CPU to GPU speedup

        Fourier-tridiagonal Poisson solver CPU -> GPU speedup
┌─────┬───────────────────────────────┬─────────┬─────────┬─────────┐
│  Ns │                    Topologies │ speedup │  memory │  allocs │
├─────┼───────────────────────────────┼─────────┼─────────┼─────────┤
│ 256 │   (Bounded, Bounded, Bounded) │ 50.4045 │ 21.5194 │ 32.4444 │
│ 256 │  (Bounded, Periodic, Bounded) │ 51.2039 │ 15.8992 │ 23.2963 │
│ 256 │  (Periodic, Bounded, Bounded) │ 52.4472 │ 15.8992 │ 23.2963 │
│ 256 │ (Periodic, Periodic, Bounded) │ 99.4371 │ 6.48062 │ 10.7407 │
└─────┴───────────────────────────────┴─────────┴─────────┴─────────┘

Relative performance on the CPU

            Fourier-tridiagonal Poisson solver relative performance (CPU)
┌───────────────┬─────┬───────────────────────────────┬──────────┬──────────┬────────┐
│ Architectures │  Ns │                    Topologies │ slowdown │   memory │ allocs │
├───────────────┼─────┼───────────────────────────────┼──────────┼──────────┼────────┤
│           CPU │ 256 │   (Bounded, Bounded, Bounded) │  1.58185 │      1.0 │    1.0 │
│           CPU │ 256 │  (Bounded, Periodic, Bounded) │  1.24529 │ 0.922481 │    1.0 │
│           CPU │ 256 │  (Periodic, Bounded, Bounded) │  1.27117 │ 0.922481 │    1.0 │
│           CPU │ 256 │ (Periodic, Periodic, Bounded) │      1.0 │      1.0 │    1.0 │
└───────────────┴─────┴───────────────────────────────┴──────────┴──────────┴────────┘

Relative performance on the GPU

            Fourier-tridiagonal Poisson solver relative performance (GPU)
┌───────────────┬─────┬───────────────────────────────┬──────────┬─────────┬─────────┐
│ Architectures │  Ns │                    Topologies │ slowdown │  memory │  allocs │
├───────────────┼─────┼───────────────────────────────┼──────────┼─────────┼─────────┤
│           GPU │ 256 │   (Bounded, Bounded, Bounded) │  3.12065 │ 3.32057 │ 3.02069 │
│           GPU │ 256 │  (Bounded, Periodic, Bounded) │  2.41833 │ 2.26316 │ 2.16897 │
│           GPU │ 256 │  (Periodic, Bounded, Bounded) │  2.41007 │ 2.26316 │ 2.16897 │
│           GPU │ 256 │ (Periodic, Periodic, Bounded) │      1.0 │     1.0 │     1.0 │
└───────────────┴─────┴───────────────────────────────┴──────────┴─────────┴─────────┘

FFT-based Poisson solver

Raw benchmarks

                                              FFT-based Poisson solver benchmarks                  
┌───────────────┬─────┬───────────────────────────────┬────────────┬────────────┬────────────┬────────────┬───────────┬────────┐
│ Architectures │  Ns │                    Topologies │        min │     median │       mean │        max │    memory │ allocs │
├───────────────┼─────┼───────────────────────────────┼────────────┼────────────┼────────────┼────────────┼───────────┼────────┤
│           CPU │ 256 │   (Bounded, Bounded, Bounded) │    1.366 s │    1.370 s │    1.369 s │    1.373 s │ 192 bytes │      4 │
│           CPU │ 256 │  (Bounded, Periodic, Bounded) │    1.138 s │    1.146 s │    1.148 s │    1.157 s │ 160 bytes │      2 │
│           CPU │ 256 │  (Periodic, Bounded, Bounded) │    1.147 s │    1.148 s │    1.152 s │    1.161 s │ 160 bytes │      2 │
│           CPU │ 256 │ (Periodic, Periodic, Bounded) │ 843.212 ms │ 849.492 ms │ 849.080 ms │ 853.401 ms │ 160 bytes │      2 │
│           GPU │ 256 │   (Bounded, Bounded, Bounded) │  17.252 ms │  38.642 ms │  36.505 ms │  38.756 ms │ 84.38 KiB │    898 │
│           GPU │ 256 │  (Bounded, Periodic, Bounded) │  13.979 ms │  31.085 ms │  29.365 ms │  31.110 ms │ 57.56 KiB │    641 │
│           GPU │ 256 │  (Periodic, Bounded, Bounded) │  13.975 ms │  30.948 ms │  29.250 ms │  30.985 ms │ 57.75 KiB │    647 │
│           GPU │ 256 │ (Periodic, Periodic, Bounded) │   7.257 ms │  15.907 ms │  15.044 ms │  15.927 ms │ 27.97 KiB │    292 │
└───────────────┴─────┴───────────────────────────────┴────────────┴────────────┴────────────┴────────────┴───────────┴────────┘

CPU to GPU speedup

            FFT-based Poisson solver CPU -> GPU speedup
┌─────┬───────────────────────────────┬─────────┬────────┬────────┐
│  Ns │                    Topologies │ speedup │ memory │ allocs │
├─────┼───────────────────────────────┼─────────┼────────┼────────┤
│ 256 │   (Bounded, Bounded, Bounded) │ 35.4428 │  450.0 │  224.5 │
│ 256 │  (Bounded, Periodic, Bounded) │ 36.8697 │  368.4 │  320.5 │
│ 256 │  (Periodic, Bounded, Bounded) │ 37.0953 │  369.6 │  323.5 │
│ 256 │ (Periodic, Periodic, Bounded) │ 53.4034 │  179.0 │  146.0 │
└─────┴───────────────────────────────┴─────────┴────────┴────────┘

CPU relative performance

                FFT-based Poisson solver relative performance (CPU)
┌───────────────┬─────┬───────────────────────────────┬──────────┬────────┬────────┐
│ Architectures │  Ns │                    Topologies │ slowdown │ memory │ allocs │
├───────────────┼─────┼───────────────────────────────┼──────────┼────────┼────────┤
│           CPU │ 256 │   (Bounded, Bounded, Bounded) │  1.61224 │    1.2 │    2.0 │
│           CPU │ 256 │  (Bounded, Periodic, Bounded) │  1.34917 │    1.0 │    1.0 │
│           CPU │ 256 │  (Periodic, Bounded, Bounded) │  1.35141 │    1.0 │    1.0 │
│           CPU │ 256 │ (Periodic, Periodic, Bounded) │      1.0 │    1.0 │    1.0 │
└───────────────┴─────┴───────────────────────────────┴──────────┴────────┴────────┘

GPU relative performance

                 FFT-based Poisson solver relative performance (GPU)
┌───────────────┬─────┬───────────────────────────────┬──────────┬─────────┬─────────┐
│ Architectures │  Ns │                    Topologies │ slowdown │  memory │  allocs │
├───────────────┼─────┼───────────────────────────────┼──────────┼─────────┼─────────┤
│           GPU │ 256 │   (Bounded, Bounded, Bounded) │  2.42923 │ 3.01676 │ 3.07534 │
│           GPU │ 256 │  (Bounded, Periodic, Bounded) │  1.95419 │  2.0581 │ 2.19521 │
│           GPU │ 256 │  (Periodic, Bounded, Bounded) │  1.94553 │  2.0648 │ 2.21575 │
│           GPU │ 256 │ (Periodic, Periodic, Bounded) │      1.0 │     1.0 │     1.0 │
└───────────────┴─────┴───────────────────────────────┴──────────┴─────────┴─────────┘

System info

Oceananigans v0.50.0
Julia Version 1.5.2
Commit 539f3ce943 (2020-09-23 23:17 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) Silver 4214 CPU @ 2.20GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, cascadelake)
  GPU: TITAN V

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants