-
Notifications
You must be signed in to change notification settings - Fork 195
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adapt Field, AveragedField, and ComputedField for GPU, round 2 #1057
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1057 +/- ##
==========================================
- Coverage 57.70% 57.49% -0.22%
==========================================
Files 158 161 +3
Lines 3807 3807
==========================================
- Hits 2197 2189 -8
- Misses 1610 1618 +8
Continue to review full report at Codecov.
|
…eananigans.jl into glw/adapt-field-round-2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Static ocean benchmarks (see below) show no performance regression on GPUs. Actually, it seems that CPU models are actually ~30% faster now 🎉
Side note: Some potential performance regressions may not be caught by benchmark_static_ocean.jl
. I think we should merge this PR as the static ocean benchmarks would test whether adapting Field
introduces performance regressions.
I'm hoping to refactor the benchmarks to reduce boilerplate and produce more useful statistics/tables. As part of that I'll add a more comprehensive benchmark that tests benchmarking with an LES closure, output writing, time averaging, etc.
Environment:
Oceananigans v0.42.0 (DEVELOPMENT BRANCH)
Julia Version 1.5.2
Commit 539f3ce943 (2020-09-23 23:17 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: Intel(R) Xeon(R) Silver 4214 CPU @ 2.20GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-9.0.1 (ORCJIT, cascadelake)
GPU: TITAN V
Static ocean benchmarks from master branch:
Static ocean benchmarks Time Allocations
────────────────────── ───────────────────────
Tot / % measured: 448s / 28.2% 31.2GiB / 0.40%
Section ncalls time %tot avg alloc %tot avg
──────────────────────────────────────────────────────────────────────────────────────
16× 16× 16 [CPU, Float32] 10 25.9ms 0.02% 2.59ms 3.40MiB 2.68% 348KiB
16× 16× 16 [CPU, Float64] 10 32.1ms 0.03% 3.21ms 3.40MiB 2.68% 348KiB
16× 16× 16 [GPU, Float32] 10 41.2ms 0.03% 4.12ms 9.28MiB 7.32% 950KiB
16× 16× 16 [GPU, Float64] 10 45.3ms 0.04% 4.53ms 9.28MiB 7.32% 950KiB
32× 32× 32 [CPU, Float32] 10 120ms 0.09% 12.0ms 3.40MiB 2.68% 348KiB
32× 32× 32 [CPU, Float64] 10 117ms 0.09% 11.7ms 3.40MiB 2.68% 348KiB
32× 32× 32 [GPU, Float32] 10 63.0ms 0.05% 6.30ms 9.28MiB 7.32% 950KiB
32× 32× 32 [GPU, Float64] 10 41.1ms 0.03% 4.11ms 9.29MiB 7.32% 951KiB
64× 64× 64 [CPU, Float32] 10 675ms 0.53% 67.5ms 3.40MiB 2.68% 348KiB
64× 64× 64 [CPU, Float64] 10 705ms 0.56% 70.5ms 3.40MiB 2.68% 348KiB
64× 64× 64 [GPU, Float32] 10 42.7ms 0.03% 4.27ms 9.28MiB 7.32% 950KiB
64× 64× 64 [GPU, Float64] 10 43.7ms 0.03% 4.37ms 9.29MiB 7.32% 951KiB
128×128×128 [CPU, Float32] 10 5.85s 4.64% 585ms 3.40MiB 2.68% 348KiB
128×128×128 [CPU, Float64] 10 5.23s 4.14% 523ms 3.40MiB 2.68% 348KiB
128×128×128 [GPU, Float32] 10 57.8ms 0.05% 5.78ms 9.28MiB 7.32% 951KiB
128×128×128 [GPU, Float64] 10 53.5ms 0.04% 5.35ms 9.29MiB 7.32% 951KiB
256×256×256 [CPU, Float32] 10 58.5s 46.4% 5.85s 3.40MiB 2.68% 348KiB
256×256×256 [CPU, Float64] 10 53.9s 42.7% 5.39s 3.40MiB 2.68% 348KiB
256×256×256 [GPU, Float32] 10 317ms 0.25% 31.7ms 9.32MiB 7.35% 955KiB
256×256×256 [GPU, Float64] 10 321ms 0.25% 32.1ms 9.29MiB 7.32% 951KiB
──────────────────────────────────────────────────────────────────────────────────────
Static ocean benchmarks from glw/adapt-field-round-2
branch:
──────────────────────────────────────────────────────────────────────────────────────
Static ocean benchmarks Time Allocations
────────────────────── ───────────────────────
Tot / % measured: 369s / 25.7% 31.0GiB / 0.36%
Section ncalls time %tot avg alloc %tot avg
──────────────────────────────────────────────────────────────────────────────────────
16× 16× 16 [CPU, Float32] 10 24.4ms 0.03% 2.44ms 2.87MiB 2.49% 293KiB
16× 16× 16 [CPU, Float64] 10 25.3ms 0.03% 2.53ms 2.87MiB 2.49% 293KiB
16× 16× 16 [GPU, Float32] 10 40.3ms 0.04% 4.03ms 8.63MiB 7.50% 884KiB
16× 16× 16 [GPU, Float64] 10 38.3ms 0.04% 3.83ms 8.63MiB 7.50% 884KiB
32× 32× 32 [CPU, Float32] 10 74.6ms 0.08% 7.46ms 2.87MiB 2.49% 293KiB
32× 32× 32 [CPU, Float64] 10 72.4ms 0.08% 7.24ms 2.87MiB 2.49% 293KiB
32× 32× 32 [GPU, Float32] 10 63.5ms 0.07% 6.35ms 8.64MiB 7.50% 884KiB
32× 32× 32 [GPU, Float64] 10 44.6ms 0.05% 4.46ms 8.64MiB 7.51% 885KiB
64× 64× 64 [CPU, Float32] 10 527ms 0.56% 52.7ms 2.87MiB 2.49% 293KiB
64× 64× 64 [CPU, Float64] 10 648ms 0.68% 64.8ms 2.87MiB 2.49% 293KiB
64× 64× 64 [GPU, Float32] 10 40.5ms 0.04% 4.05ms 8.64MiB 7.50% 884KiB
64× 64× 64 [GPU, Float64] 10 50.8ms 0.05% 5.08ms 8.64MiB 7.51% 885KiB
128×128×128 [CPU, Float32] 10 4.86s 5.13% 486ms 2.87MiB 2.49% 293KiB
128×128×128 [CPU, Float64] 10 3.93s 4.15% 393ms 2.87MiB 2.49% 293KiB
128×128×128 [GPU, Float32] 10 128ms 0.13% 12.8ms 8.65MiB 7.52% 886KiB
128×128×128 [GPU, Float64] 10 46.8ms 0.05% 4.68ms 8.64MiB 7.51% 885KiB
256×256×256 [CPU, Float32] 10 43.0s 45.3% 4.30s 2.87MiB 2.49% 293KiB
256×256×256 [CPU, Float64] 10 40.6s 42.8% 4.06s 2.87MiB 2.49% 293KiB
256×256×256 [GPU, Float32] 10 317ms 0.33% 31.7ms 8.68MiB 7.54% 889KiB
256×256×256 [GPU, Float64] 10 322ms 0.34% 32.2ms 8.65MiB 7.51% 885KiB
──────────────────────────────────────────────────────────────────────────────────────
│ │ └── OffsetArrays.OffsetArray{Float64,3,Array{Float64,3}} | ||
│ └── / at (Cell, Cell, Cell) via Oceananigans.AbstractOperations.identity | ||
* at (Cell, Cell, Cell) via identity | ||
├── 0.3333333333333333 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just an idea: We can pretty print rational numbers that show up in abstract operations
julia> rationalize(0.3333333333333333)
1//3
but perhaps this is misleading as Julia is actually multiplying by 0.3333333333333333
and not 1//3
.
So probably the best thing to do is just print with eltype(model)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm we can also truncate floating point numbers to fewer significant digits by redefining tree_show(a::Number, depth, nesting)
:
tree_show(a::Union{Number, Function}, depth, nesting) = string(a) |
This PR writes new
Adapt.adapt_structure
methods forField
,AveragedField
, andComputedField
:Field
andComputedField
are adapted to their data (thus shedding location information, the grid, and boundary conditions). This is fine because we don't reference location information or boundary conditions inside GPU kernelsAveragedField
shedsoperand
andgrid
when adapted to the GPU.AveragedField
still needs location information forgetindex
to work correctly.This obviates the need for
datatuple
(we still keep the function around however because its useful for tests). It also obviates the need forgpufriendly
.We can now useThis still doesn't work. We need to open an issue once this PR is merged.AveragedField
andComputedField
inside kernels.This PR supersedes #746 .
Finally, we can dramatically simplify the time-stepping routine since we don't need to "unwrap" fields anymore.
It's probably worthwhile running a benchmark before merging but hopefully there's no issue.
Resolves #722 .