Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Complex AbstractOperations cannot be computed on GPU #1241

Closed
tomchor opened this issue Dec 2, 2020 · 52 comments · Fixed by #1595
Closed

Complex AbstractOperations cannot be computed on GPU #1241

tomchor opened this issue Dec 2, 2020 · 52 comments · Fixed by #1595

Comments

@tomchor
Copy link
Collaborator

tomchor commented Dec 2, 2020

I have the following output writer set-up in my simulation:

import Oceananigans.Fields: ComputedField
u, v, w = model.velocities.u, model.velocities.v, model.velocities.w
b, pHY′, pNHS = model.tracers.b, model.pressures.pHY′, model.pressures.pNHS
ν, νₑ = model.closure.ν, model.diffusivities.νₑ
#-----

#-----
import Oceananigans.AbstractOperations: ∂x, ∂y, ∂z

ddx = ∂x(u)^2 + ∂x(v)^2 + ∂x(w)^2
ddy = ∂y(u)^2 + ∂y(v)^2 + ∂y(w)^2
ddz = ∂z(u)^2 + ∂z(v)^2 + ∂z(w)^2
#avg_ε = AveragedField(ComputedField(ddx + ddy + ddz), dims=(1,))
avg_ε = AveragedField*(ddx + ddy + ddz), dims=(1,))

outputs == avg_ε,)
simulation.output_writers[:avg_field_writer] =
    NetCDFOutputWriter(model, outputs,
                       filepath = "avg.jd15_3dbounded.nc",
                       schedule = TimeInterval(2minutes),
                       mode = "c")

This works successfully on CPUs, but running on GPUs I get a huge amount of error lines with some

 [16] compute! at /glade/u/home/tomasc/.julia/packages/Oceananigans/6JcUu/src/Fields/averaged_field.jl:86 [inlined]

Running the simulation without that output works for both GPUs and CPUs.

Am I doing something wrong here?

@glwagner
Copy link
Member

glwagner commented Dec 2, 2020

Ah, you may be running into some hidden limitations of AbstractOperations --- they seem to fail when the operations are too complex.

It's a shifty problem, because it appears to depend on the julia compiler. It's good that you opened this issue because it would be nice to document our efforts for solving this tricky problem.

I don't think this is a problem of AveragedFields specifically. Rather it's an issue with ComputedFields, are more specifically, evaluating complex AbstractOperations on the GPU.

Over at LESbrary, we are circumventing this issue by hand-writing particularly important complicated kernels. I think ViscousDissipation may be the object you're looking for:

https://github.com/CliMA/LESbrary.jl/blob/master/src/TurbulenceStatistics/viscous_dissipation.jl

@tomchor
Copy link
Collaborator Author

tomchor commented Dec 2, 2020

Thanks! Yeah, that's what I was trying to do. I see you already have some SGS stuff there as well.

Do you have an example of how to use LESbrary? I'm not that familiar with Julia and just from reading the README I can' quite understand how it's supposed to work.

Thanks!

@glwagner
Copy link
Member

glwagner commented Dec 2, 2020

There's some LESbrary examples but the package definitely needs some love (wink wink):

https://github.com/CliMA/LESbrary.jl/blob/6cfe60cf44031f9357936058f08eed0f1bb5f9fc/examples/langmuir_turbulence.jl#L240

To get LESbrary.jl into your environment you use the package manager, eg:

using Pkg
Pkg.add("https://github.com/CliMA/LESbrary.jl.git")

using LESbrary.TurbulenceStatistics: ViscousDissipation

@glwagner
Copy link
Member

glwagner commented Dec 2, 2020

One strategy that might help with complex AbstractOperations is to define intermediate ComputedFields, and then reuse these ComputedFields in higher-order AbstractOperations. This incurs extra computation but might be a stopgap if a low-level kernel like the one that's been written for ViscousDissipation is not available.

@glwagner glwagner added the abstractions 🎨 Whatever that means label Dec 2, 2020
@glwagner glwagner changed the title Calculation of averaged field fails for GPUs but is successful or CPUs Complex AbstractOperations cannot be computed on GPU Dec 2, 2020
@glwagner
Copy link
Member

glwagner commented Dec 2, 2020

This is probably related to some of the struggles in #870

@tomchor
Copy link
Collaborator Author

tomchor commented Dec 2, 2020

One strategy that might help with complex AbstractOperations is to define intermediate ComputedFields, and then reuse these ComputedFields in higher-order AbstractOperations. This incurs extra computation but might be a stopgap if a low-level kernel like the one that's been written for ViscousDissipation is not available.

I've tried a version of that already and it didn't work, but there are probably more things I could try. For now it's best not to reinvent the wheel and use LESbrary! But I'll keep this in mind when doing stuff like this in the future.

@glwagner
Copy link
Member

glwagner commented Dec 2, 2020

Ultimately @tomchor we would definitely rather use AbstractOperations to diagnose ViscousDissipation rather than hard-coding it, as we had to resort to in LESbrary.jl. But this is tough issue to solve I think so yes, I think LESbrary.jl stuff is good for now. It'd be nice also to add other diagnostics that cannot be specified by AbstractOperations to LESbrary.jl. And we need documentation and tests, etc, etc...

@tomchor tomchor mentioned this issue Dec 3, 2020
8 tasks
@glwagner
Copy link
Member

glwagner commented Dec 3, 2020

Another issue is that I believe one cannot embed AveragedFields into AbstractOperations on the GPU, despite that we have developed the abstraction and machinery so that everything does work as expected on the CPU. Embedding AveragedFields in AbstractOperations is needed to calculate things like turbulent kinetic energy and so is important for LES. Currently there's a custom field TurbulentKineticEnergy in the LESbrary.jl for this purpose:

https://github.com/CliMA/LESbrary.jl/blob/master/src/TurbulenceStatistics/turbulent_kinetic_energy.jl

It might be good to come up with a list of AbstractOperations that we would like to work on the GPU eventually, and then write tests for them and add them in a PR. We can then @test_skip them until either we figure out how to solve the issue or, perhaps, the julia compiler changes in a way that makes the tests pass... :-D

@francispoulin
Copy link
Collaborator

I have not used GPUs and don't appreciate the difficulty here at all but would be happy to discuss this sometime if people wanted to have a brainstorming session. Certainly starting simple is what I would recommend.

@ali-ramadhan
Copy link
Member

ali-ramadhan commented Dec 4, 2020

Yeah I think it's just an unknown GPU compilation problem/failure. Unclear to me whether it's the fault of Oceananigans.AbstractOperations or something in CUDA.jl (or even KernelAbstractions.jl).

Certainly starting simple is what I would recommend.

Maybe it makes sense to try and condense some abstract operations into an isolated minimal working example which we can use to open an issue on CUDA.jl if that turns out to be the problem? Might help isolate the problem, and would certainly be much easier to debug.

I recall we encounter a limitation of the GPU compiler when trying to construct a GPU model with too many arbitrary tracers. I think in this case the type information was too large to even fit into the argument of a CUDA kernel or something so maybe this was a hard GPU limitation? Would be unfortunate if something similar is happening for complex abstract operations.

@glwagner
Copy link
Member

glwagner commented Dec 4, 2020

Here's a starting list:

@ali-ramadhan not sure if this is what you mean but when trying to adapt Field to the GPU we did encounter an error that was something like "ptx arguments consume too much parameter space". I believe this is an issue compiling very large objects (Field is large because it has boundary conditions). I don't think that this is the error we get for too-complex AbstractOperations though. I think for abstract operations its a type-inference issue. We should dump the whole error into this issue so that we know. I also think the error might change as the julia compiler evolves, or when we upgrade CUDA.jl.

@glwagner
Copy link
Member

glwagner commented Dec 4, 2020

Here's the specific error we got when we tried to get Field, including all its glorious boundary conditions, to compile on the GPU:

Entry function 'ptxcall_calculate_Gu__66' uses too much parameter space (0x16c8 bytes, 0x1100 max)

dredged up from #746 . Some workarounds were suggested there, but I think our solution is actually better / simpler (adapt fields by unwrapping the underlying data and throwing away boundary conditions, rather than wrestling to get all the field info onto the poor GPU).

@francispoulin
Copy link
Collaborator

Does someone have a minimal example that reproduces this error? I'm just curious to learn more about the problem.

@glwagner
Copy link
Member

glwagner commented Dec 4, 2020

using Oceananigans
using Oceananigans.AbstractOperations
using Oceananigans.Fields

grid = RegularCartesianGrid(size=(1, 1, 1), extent=(1, 1, 1)) 

model = IncompressibleModel(architecture=GPU(), grid=grid)

u, v, w = model.velocities

Σˣˣ = ∂x(u)
Σʸʸ = ∂y(v)
Σᶻᶻ = ∂z(w)

Σˣʸ = (∂y(u) + ∂x(v)) / 2 
Σˣᶻ = (∂z(u) + ∂x(w)) / 2 
Σʸᶻ = (∂z(v) + ∂y(w)) / 2 

ϵ = model.closure.ν * 2 * (Σˣˣ^2 + Σʸʸ^2 + Σᶻᶻ^2 + 2 * (Σˣʸ^2 + Σˣᶻ^2 + Σʸᶻ^2))

ϵ_field = ComputedField(ϵ)

compute!(ϵ_field)

produces

julia> include("complex_output.jl")
[ Info: Precompiling Oceananigans [9e8cae18-63c1-5223-a75c-80ca9d6e9a09]
[ Info: Oceananigans will use 24 threads
ERROR: LoadError: InvalidIRError: compiling kernel gpu__compute!(Cassette.Context{nametype(CUDACtx),KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.StaticSize{(1, 1, 1)},KernelAbstractions.NDIteration.DynamicCheck,Nothing,Nothing,KernelAbstractions.NDIteration.NDRange{3,KernelAbstractions.NDIteration.StaticSize{(1, 1, 1)},KernelAbstractions.NDIteration.StaticSize{(1, 1, 1)},Nothing,Nothing}},Nothing,KernelAbstractions.var"##PassType#253",Nothing,Cassette.DisableHooks}, typeof(Oceananigans.Fields.gpu__compute!), OffsetArrays.OffsetArray{Float64,3,CUDA.CuDeviceArray{Float64,3,1}}, Oceananigans.AbstractOperations.BinaryOperation{Cell,Cell,Cell,typeof(*),Float64,Oceananigans.AbstractOperations.MultiaryOperation{Cell,Cell,Cell,4,typeof(+),Tuple{Oceananigans.AbstractOperations.BinaryOperation{Cell,Cell,Cell,typeof(^),Oceananigans.AbstractOperations.Derivative{Cell,Cell,Cell,typeof(Oceananigans.Operators.∂xᶜᵃᵃ),OffsetArrays.OffsetArray{Float64,3,CUDA.CuDeviceArray{Float64,3,1}},typeof(identity),RegularCartesianGrid{Float64,Periodic,Periodic,Bounded,OffsetArrays.OffsetArray{Float64,1,StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}}}},Int64,typeof(identity),typeof(identity),typeof(identity),RegularCartesianGrid{Float64,Periodic,Periodic,Bounded,OffsetArrays.OffsetArray{Float64,1,StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}}}},Oceananigans.AbstractOperations.BinaryOperation{Cell,Cell,Cell,typeof(^),Oceananigans.AbstractOperations.Derivative{Cell,Cell,Cell,typeof(Oceananigans.Operators.∂yᵃᶜᵃ),OffsetArrays.OffsetArray{Float64,3,CUDA.CuDeviceArray{Float64,3,1}},typeof(identity),RegularCartesianGrid{Float64,Periodic,Periodic,Bounded,OffsetArrays.OffsetArray{Float64,1,StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}}}},Int64,typeof(identity),typeof(identity),typeof(identity),RegularCartesianGrid{Float64,Periodic,Periodic,Bounded,OffsetArrays.OffsetArray{Float64,1,StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}}}},Oceananigans.AbstractOperations.BinaryOperation{Cell,Cell,Cell,typeof(^),Oceananigans.AbstractOperations.Derivative{Cell,Cell,Cell,typeof(Oceananigans.Operators.∂zᵃᵃᶜ),OffsetArrays.OffsetArray{Float64,3,CUDA.CuDeviceArray{Float64,3,1}},typeof(identity),RegularCartesianGrid{Float64,Periodic,Periodic,Bounded,OffsetArrays.OffsetArray{Float64,1,StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}}}},Int64,typeof(identity),typeof(identity),typeof(identity),RegularCartesianGrid{Float64,Periodic,Periodic,Bounded,OffsetArrays.OffsetArray{Float64,1,StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}}}},Oceananigans.AbstractOperations.BinaryOperation{Face,Face,Cell,typeof(*),Int64,Oceananigans.AbstractOperations.MultiaryOperation{Face,Face,Cell,3,typeof(+),Tuple{Oceananigans.AbstractOperations.BinaryOperation{Face,Face,Cell,typeof(^),Oceananigans.AbstractOperations.BinaryOperation{Face,Face,Cell,typeof(/),Oceananigans.AbstractOperations.BinaryOperation{Face,Face,Cell,typeof(+),Oceananigans.AbstractOperations.Derivative{Face,Face,Cell,typeof(Oceananigans.Operators.∂yᵃᶠᵃ),OffsetArrays.OffsetArray{Float64,3,CUDA.CuDeviceArray{Float64,3,1}},typeof(identity),RegularCartesianGrid{Float64,Periodic,Periodic,Bounded,OffsetArrays.OffsetArray{Float64,1,StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}}}},Oceananigans.AbstractOperations.Derivative{Face,Face,Cell,typeof(Oceananigans.Operators.∂xᶠᵃᵃ),OffsetArrays.OffsetArray{Float64,3,CUDA.CuDeviceArray{Float64,3,1}},typeof(identity),RegularCartesianGrid{Float64,Periodic,Periodic,Bounded,OffsetArrays.OffsetArray{Float64,1,StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}}}},typeof(identity),typeof(identity),typeof(identity),RegularCartesianGrid{Float64,Periodic,Periodic,Bounded,OffsetArrays.OffsetArray{Float64,1,StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}}}},Int64,typeof(identity),typeof(identity),typeof(identity),RegularCartesianGrid{Float64,Periodic,Periodic,Bounded,OffsetArrays.OffsetArray{Float64,1,StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}}}},Int64,typeof(identity),typeof(identity),typeof(identity),RegularCartesianGrid{Float64,Periodic,Periodic,Bounded,OffsetArrays.OffsetArray{Float64,1,StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}}}},Oceananigans.AbstractOperations.BinaryOperation{Face,Cell,Face,typeof(^),Oceananigans.AbstractOperations.BinaryOperation{Face,Cell,Face,typeof(/),Oceananigans.AbstractOperations.BinaryOperation{Face,Cell,Face,typeof(+),Oceananigans.AbstractOperations.Derivative{Face,Cell,Face,typeof(Oceananigans.Operators.∂zᵃᵃᶠ),OffsetArrays.OffsetArray{Float64,3,CUDA.CuDeviceArray{Float64,3,1}},typeof(identity),RegularCartesianGrid{Float64,Periodic,Periodic,Bounded,OffsetArrays.OffsetArray{Float64,1,StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}}}},Oceananigans.AbstractOperations.Derivative{Face,Cell,Face,typeof(Oceananigans.Operators.∂xᶠᵃᵃ),OffsetArrays.OffsetArray{Float64,3,CUDA.CuDeviceArray{Float64,3,1}},typeof(identity),RegularCartesianGrid{Float64,Periodic,Periodic,Bounded,OffsetArrays.OffsetArray{Float64,1,StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}}}},typeof(identity),typeof(identity),typeof(identity),RegularCartesianGrid{Float64,Periodic,Periodic,Bounded,OffsetArrays.OffsetArray{Float64,1,StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}}}},Int64,typeof(identity),typeof(identity),typeof(identity),RegularCartesianGrid{Float64,Periodic,Periodic,Bounded,OffsetArrays.OffsetArray{Float64,1,StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}}}},Int64,typeof(identity),typeof(identity),typeof(identity),RegularCartesianGrid{Float64,Periodic,Periodic,Bounded,OffsetArrays.OffsetArray{Float64,1,StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}}}},Oceananigans.AbstractOperations.BinaryOperation{Cell,Face,Face,typeof(^),Oceananigans.AbstractOperations.BinaryOperation{Cell,Face,Face,typeof(/),Oceananigans.AbstractOperations.BinaryOperation{Cell,Face,Face,typeof(+),Oceananigans.AbstractOperations.Derivative{Cell,Face,Face,typeof(Oceananigans.Operators.∂zᵃᵃᶠ),OffsetArrays.OffsetArray{Float64,3,CUDA.CuDeviceArray{Float64,3,1}},typeof(identity),RegularCartesianGrid{Float64,Periodic,Periodic,Bounded,OffsetArrays.OffsetArray{Float64,1,StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}}}},Oceananigans.AbstractOperations.Derivative{Cell,Face,Face,typeof(Oceananigans.Operators.∂yᵃᶠᵃ),OffsetArrays.OffsetArray{Float64,3,CUDA.CuDeviceArray{Float64,3,1}},typeof(identity),RegularCartesianGrid{Float64,Periodic,Periodic,Bounded,OffsetArrays.OffsetArray{Float64,1,StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}}}},typeof(identity),typeof(identity),typeof(identity),RegularCartesianGrid{Float64,Periodic,Periodic,Bounded,OffsetArrays.OffsetArray{Float64,1,StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}}}},Int64,typeof(identity),typeof(identity),typeof(identity),RegularCartesianGrid{Float64,Periodic,Periodic,Bounded,OffsetArrays.OffsetArray{Float64,1,StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}}}},Int64,typeof(identity),typeof(identity),typeof(identity),RegularCartesianGrid{Float64,Periodic,Periodic,Bounded,OffsetArrays.OffsetArray{Float64,1,StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}}}}},Tuple{typeof(identity),typeof(Oceananigans.Operators.ℑyzᵃᶠᶜ),typeof(Oceananigans.Operators.ℑxzᶠᵃᶜ)},RegularCartesianGrid{Float64,Periodic,Periodic,Bounded,OffsetArrays.OffsetArray{Float64,1,StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}}}},typeof(identity),typeof(identity),typeof(identity),RegularCartesianGrid{Float64,Periodic,Periodic,Bounded,OffsetArrays.OffsetArray{Float64,1,StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}}}}},Tuple{typeof(identity),typeof(identity),typeof(identity),typeof(Oceananigans.Operators.ℑxyᶜᶜᵃ)},RegularCartesianGrid{Float64,Periodic,Periodic,Bounded,OffsetArrays.OffsetArray{Float64,1,StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}}}},typeof(identity),typeof(identity),typeof(identity),RegularCartesianGrid{Float64,Periodic,Periodic,Bounded,OffsetArrays.OffsetArray{Float64,1,StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}}}}) resulted in invalid LLVM IR
Reason: unsupported dynamic function invocation (call to overdub(overdub_context::Cassette.Context, overdub_arguments...) in Cassette at /home/glwagner/.julia/packages/Cassette/158rp/src/overdub.jl:586)
Stacktrace:
 [1] getindex at /home/glwagner/.julia/packages/Oceananigans/cLFd3/src/AbstractOperations/binary_operations.jl:34
 [2] macro expansion at /home/glwagner/.julia/packages/Oceananigans/cLFd3/src/Fields/computed_field.jl:86
 [3] gpu__compute! at /home/glwagner/.julia/packages/KernelAbstractions/jAutM/src/macros.jl:80
 [4] overdub at /home/glwagner/.julia/packages/Cassette/158rp/src/overdub.jl:0
Reason: unsupported dynamic function invocation (call to overdub)
Stacktrace:
 [1] macro expansion at /home/glwagner/.julia/packages/Oceananigans/cLFd3/src/Fields/computed_field.jl:86
 [2] gpu__compute! at /home/glwagner/.julia/packages/KernelAbstractions/jAutM/src/macros.jl:80
 [3] overdub at /home/glwagner/.julia/packages/Cassette/158rp/src/overdub.jl:0
Stacktrace:
 [1] check_ir(::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget,CUDA.CUDACompilerParams}, ::LLVM.Module) at /home/glwagner/.julia/packages/GPUCompiler/uTpNx/src/validation.jl:123
 [2] macro expansion at /home/glwagner/.julia/packages/GPUCompiler/uTpNx/src/driver.jl:239 [inlined]
 [3] macro expansion at /home/glwagner/.julia/packages/TimerOutputs/ZmKD7/src/TimerOutput.jl:206 [inlined]
 [4] codegen(::Symbol, ::GPUCompiler.CompilerJob; libraries::Bool, deferred_codegen::Bool, optimize::Bool, strip::Bool, validate::Bool, only_entry::Bool) at /home/glwagner/.julia/packages/GPUCompiler/uTpNx/src/driver.jl:237
 [5] compile(::Symbol, ::GPUCompiler.CompilerJob; libraries::Bool, deferred_codegen::Bool, optimize::Bool, strip::Bool, validate::Bool, only_entry::Bool) at /home/glwagner/.julia/packages/GPUCompiler/uTpNx/src/driver.jl:39
 [6] compile at /home/glwagner/.julia/packages/GPUCompiler/uTpNx/src/driver.jl:35 [inlined]
 [7] cufunction_compile(::GPUCompiler.FunctionSpec; kwargs::Base.Iterators.Pairs{Symbol,Int64,Tuple{Symbol},NamedTuple{(:maxthreads,),Tuple{Int64}}}) at /home/glwagner/.julia/packages/CUDA/YeS8q/src/compiler/execution.jl:310
 [8] check_cache(::Dict{UInt64,Any}, ::Any, ::Any, ::GPUCompiler.FunctionSpec{typeof(Cassette.overdub),Tuple{Cassette.Context{nametype(CUDACtx),KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.StaticSize{(1, 1, 1)},KernelAbstractions.NDIteration.DynamicCheck,Nothing,Nothing,KernelAbstractions.NDIteration.NDRange{3,KernelAbstractions.NDIteration.StaticSize{(1, 1, 1)},KernelAbstractions.NDIteration.StaticSize{(1, 1, 1)},Nothing,Nothing}},Nothing,KernelAbstractions.var"##PassType#253",Nothing,Cassette.DisableHooks},typeof(Oceananigans.Fields.gpu__compute!),OffsetArrays.OffsetArray{Float64,3,CUDA.CuDeviceArray{Float64,3,1}},Oceananigans.AbstractOperations.BinaryOperation{Cell,Cell,Cell,typeof(*),Float64,Oceananigans.AbstractOperations.MultiaryOperation{Cell,Cell,Cell,4,typeof(+),Tuple{Oceananigans.AbstractOperations.BinaryOperation{Cell,Cell,Cell,typeof(^),Oceananigans.AbstractOperations.Derivative{Cell,Cell,Cell,typeof(Oceananigans.Operators.∂xᶜᵃᵃ),OffsetArrays.OffsetArray{Float64,3,CUDA.CuDeviceArray{Float64,3,1}},typeof(identity),RegularCartesianGrid{Float64,Periodic,Periodic,Bounded,OffsetArrays.OffsetArray{Float64,1,StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}}}},Int64,typeof(identity),typeof(identity),typeof(identity),RegularCartesianGrid{Float64,Periodic,Periodic,Bounded,OffsetArrays.OffsetArray{Float64,1,StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}}}},Oceananigans.AbstractOperations.BinaryOperation{Cell,Cell,Cell,typeof(^),Oceananigans.AbstractOperations.Derivative{Cell,Cell,Cell,typeof(Oceananigans.Operators.∂yᵃᶜᵃ),OffsetArrays.OffsetArray{Float64,3,CUDA.CuDeviceArray{Float64,3,1}},typeof(identity),RegularCartesianGrid{Float64,Periodic,Periodic,Bounded,OffsetArrays.OffsetArray{Float64,1,StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}}}},Int64,typeof(identity),typeof(identity),typeof(identity),RegularCartesianGrid{Float64,Periodic,Periodic,Bounded,OffsetArrays.OffsetArray{Float64,1,StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}}}},Oceananigans.AbstractOperations.BinaryOperation{Cell,Cell,Cell,typeof(^),Oceananigans.AbstractOperations.Derivative{Cell,Cell,Cell,typeof(Oceananigans.Operators.∂zᵃᵃᶜ),OffsetArrays.OffsetArray{Float64,3,CUDA.CuDeviceArray{Float64,3,1}},typeof(identity),RegularCartesianGrid{Float64,Periodic,Periodic,Bounded,OffsetArrays.OffsetArray{Float64,1,StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}}}},Int64,typeof(identity),typeof(identity),typeof(identity),RegularCartesianGrid{Float64,Periodic,Periodic,Bounded,OffsetArrays.OffsetArray{Float64,1,StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}}}},Oceananigans.AbstractOperations.BinaryOperation{Face,Face,Cell,typeof(*),Int64,Oceananigans.AbstractOperations.MultiaryOperation{Face,Face,Cell,3,typeof(+),Tuple{Oceananigans.AbstractOperations.BinaryOperation{Face,Face,Cell,typeof(^),Oceananigans.AbstractOperations.BinaryOperation{Face,Face,Cell,typeof(/),Oceananigans.AbstractOperations.BinaryOperation{Face,Face,Cell,typeof(+),Oceananigans.AbstractOperations.Derivative{Face,Face,Cell,typeof(Oceananigans.Operators.∂yᵃᶠᵃ),OffsetArrays.OffsetArray{Float64,3,CUDA.CuDeviceArray{Float64,3,1}},typeof(identity),RegularCartesianGrid{Float64,Periodic,Periodic,Bounded,OffsetArrays.OffsetArray{Float64,1,StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}}}},Oceananigans.AbstractOperations.Derivative{Face,Face,Cell,typeof(Oceananigans.Operators.∂xᶠᵃᵃ),OffsetArrays.OffsetArray{Float64,3,CUDA.CuDeviceArray{Float64,3,1}},typeof(identity),RegularCartesianGrid{Float64,Periodic,Periodic,Bounded,OffsetArrays.OffsetArray{Float64,1,StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}}}},typeof(identity),typeof(identity),typeof(identity),RegularCartesianGrid{Float64,Periodic,Periodic,Bounded,OffsetArrays.OffsetArray{Float64,1,StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}}}},Int64,typeof(identity),typeof(identity),typeof(identity),RegularCartesianGrid{Float64,Periodic,Periodic,Bounded,OffsetArrays.OffsetArray{Float64,1,StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}}}},Int64,typeof(identity),typeof(identity),typeof(identity),RegularCartesianGrid{Float64,Periodic,Periodic,Bounded,OffsetArrays.OffsetArray{Float64,1,StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}}}},Oceananigans.AbstractOperations.BinaryOperation{Face,Cell,Face,typeof(^),Oceananigans.AbstractOperations.BinaryOperation{Face,Cell,Face,typeof(/),Oceananigans.AbstractOperations.BinaryOperation{Face,Cell,Face,typeof(+),Oceananigans.AbstractOperations.Derivative{Face,Cell,Face,typeof(Oceananigans.Operators.∂zᵃᵃᶠ),OffsetArrays.OffsetArray{Float64,3,CUDA.CuDeviceArray{Float64,3,1}},typeof(identity),RegularCartesianGrid{Float64,Periodic,Periodic,Bounded,OffsetArrays.OffsetArray{Float64,1,StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}}}},Oceananigans.AbstractOperations.Derivative{Face,Cell,Face,typeof(Oceananigans.Operators.∂xᶠᵃᵃ),OffsetArrays.OffsetArray{Float64,3,CUDA.CuDeviceArray{Float64,3,1}},typeof(identity),RegularCartesianGrid{Float64,Periodic,Periodic,Bounded,OffsetArrays.OffsetArray{Float64,1,StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}}}},typeof(identity),typeof(identity),typeof(identity),RegularCartesianGrid{Float64,Periodic,Periodic,Bounded,OffsetArrays.OffsetArray{Float64,1,StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}}}},Int64,typeof(identity),typeof(identity),typeof(identity),RegularCartesianGrid{Float64,Periodic,Periodic,Bounded,OffsetArrays.OffsetArray{Float64,1,StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}}}},Int64,typeof(identity),typeof(identity),typeof(identity),RegularCartesianGrid{Float64,Periodic,Periodic,Bounded,OffsetArrays.OffsetArray{Float64,1,StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}}}},Oceananigans.AbstractOperations.BinaryOperation{Cell,Face,Face,typeof(^),Oceananigans.AbstractOperations.BinaryOperation{Cell,Face,Face,typeof(/),Oceananigans.AbstractOperations.BinaryOperation{Cell,Face,Face,typeof(+),Oceananigans.AbstractOperations.Derivative{Cell,Face,Face,typeof(Oceananigans.Operators.∂zᵃᵃᶠ),OffsetArrays.OffsetArray{Float64,3,CUDA.CuDeviceArray{Float64,3,1}},typeof(identity),RegularCartesianGrid{Float64,Periodic,Periodic,Bounded,OffsetArrays.OffsetArray{Float64,1,StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}}}},Oceananigans.AbstractOperations.Derivative{Cell,Face,Face,typeof(Oceananigans.Operators.∂yᵃᶠᵃ),OffsetArrays.OffsetArray{Float64,3,CUDA.CuDeviceArray{Float64,3,1}},typeof(identity),RegularCartesianGrid{Float64,Periodic,Periodic,Bounded,OffsetArrays.OffsetArray{Float64,1,StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}}}},typeof(identity),typeof(identity),typeof(identity),RegularCartesianGrid{Float64,Periodic,Periodic,Bounded,OffsetArrays.OffsetArray{Float64,1,StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}}}},Int64,typeof(identity),typeof(identity),typeof(identity),RegularCartesianGrid{Float64,Periodic,Periodic,Bounded,OffsetArrays.OffsetArray{Float64,1,StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}}}},Int64,typeof(identity),typeof(identity),typeof(identity),RegularCartesianGrid{Float64,Periodic,Periodic,Bounded,OffsetArrays.OffsetArray{Float64,1,StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}}}}},Tuple{typeof(identity),typeof(Oceananigans.Operators.ℑyzᵃᶠᶜ),typeof(Oceananigans.Operators.ℑxzᶠᵃᶜ)},RegularCartesianGrid{Float64,Periodic,Periodic,Bounded,OffsetArrays.OffsetArray{Float64,1,StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}}}},typeof(identity),typeof(identity),typeof(identity),RegularCartesianGrid{Float64,Periodic,Periodic,Bounded,OffsetArrays.OffsetArray{Float64,1,StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}}}}},Tuple{typeof(identity),typeof(identity),typeof(identity),typeof(Oceananigans.Operators.ℑxyᶜᶜᵃ)},RegularCartesianGrid{Float64,Periodic,Periodic,Bounded,OffsetArrays.OffsetArray{Float64,1,StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}}}},typeof(identity),typeof(identity),typeof(identity),RegularCartesianGrid{Float64,Periodic,Periodic,Bounded,OffsetArrays.OffsetArray{Float64,1,StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}}}}}}, ::UInt64; kwargs::Base.Iterators.Pairs{Symbol,Int64,Tuple{Symbol},NamedTuple{(:maxthreads,),Tuple{Int64}}}) at /home/glwagner/.julia/packages/GPUCompiler/uTpNx/src/cache.jl:40
 [9] gpu__compute! at ./array.jl:0 [inlined]
 [10] cufunction(::typeof(Cassette.overdub), ::Type{Tuple{Cassette.Context{nametype(CUDACtx),KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.StaticSize{(1, 1, 1)},KernelAbstractions.NDIteration.DynamicCheck,Nothing,Nothing,KernelAbstractions.NDIteration.NDRange{3,KernelAbstractions.NDIteration.StaticSize{(1, 1, 1)},KernelAbstractions.NDIteration.StaticSize{(1, 1, 1)},Nothing,Nothing}},Nothing,KernelAbstractions.var"##PassType#253",Nothing,Cassette.DisableHooks},typeof(Oceananigans.Fields.gpu__compute!),OffsetArrays.OffsetArray{Float64,3,CUDA.CuDeviceArray{Float64,3,1}},Oceananigans.AbstractOperations.BinaryOperation{Cell,Cell,Cell,typeof(*),Float64,Oceananigans.AbstractOperations.MultiaryOperation{Cell,Cell,Cell,4,typeof(+),Tuple{Oceananigans.AbstractOperations.BinaryOperation{Cell,Cell,Cell,typeof(^),Oceananigans.AbstractOperations.Derivative{Cell,Cell,Cell,typeof(Oceananigans.Operators.∂xᶜᵃᵃ),OffsetArrays.OffsetArray{Float64,3,CUDA.CuDeviceArray{Float64,3,1}},typeof(identity),RegularCartesianGrid{Float64,Periodic,Periodic,Bounded,OffsetArrays.OffsetArray{Float64,1,StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}}}},Int64,typeof(identity),typeof(identity),typeof(identity),RegularCartesianGrid{Float64,Periodic,Periodic,Bounded,OffsetArrays.OffsetArray{Float64,1,StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}}}},Oceananigans.AbstractOperations.BinaryOperation{Cell,Cell,Cell,typeof(^),Oceananigans.AbstractOperations.Derivative{Cell,Cell,Cell,typeof(Oceananigans.Operators.∂yᵃᶜᵃ),OffsetArrays.OffsetArray{Float64,3,CUDA.CuDeviceArray{Float64,3,1}},typeof(identity),RegularCartesianGrid{Float64,Periodic,Periodic,Bounded,OffsetArrays.OffsetArray{Float64,1,StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}}}},Int64,typeof(identity),typeof(identity),typeof(identity),RegularCartesianGrid{Float64,Periodic,Periodic,Bounded,OffsetArrays.OffsetArray{Float64,1,StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}}}},Oceananigans.AbstractOperations.BinaryOperation{Cell,Cell,Cell,typeof(^),Oceananigans.AbstractOperations.Derivative{Cell,Cell,Cell,typeof(Oceananigans.Operators.∂zᵃᵃᶜ),OffsetArrays.OffsetArray{Float64,3,CUDA.CuDeviceArray{Float64,3,1}},typeof(identity),RegularCartesianGrid{Float64,Periodic,Periodic,Bounded,OffsetArrays.OffsetArray{Float64,1,StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}}}},Int64,typeof(identity),typeof(identity),typeof(identity),RegularCartesianGrid{Float64,Periodic,Periodic,Bounded,OffsetArrays.OffsetArray{Float64,1,StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}}}},Oceananigans.AbstractOperations.BinaryOperation{Face,Face,Cell,typeof(*),Int64,Oceananigans.AbstractOperations.MultiaryOperation{Face,Face,Cell,3,typeof(+),Tuple{Oceananigans.AbstractOperations.BinaryOperation{Face,Face,Cell,typeof(^),Oceananigans.AbstractOperations.BinaryOperation{Face,Face,Cell,typeof(/),Oceananigans.AbstractOperations.BinaryOperation{Face,Face,Cell,typeof(+),Oceananigans.AbstractOperations.Derivative{Face,Face,Cell,typeof(Oceananigans.Operators.∂yᵃᶠᵃ),OffsetArrays.OffsetArray{Float64,3,CUDA.CuDeviceArray{Float64,3,1}},typeof(identity),RegularCartesianGrid{Float64,Periodic,Periodic,Bounded,OffsetArrays.OffsetArray{Float64,1,StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}}}},Oceananigans.AbstractOperations.Derivative{Face,Face,Cell,typeof(Oceananigans.Operators.∂xᶠᵃᵃ),OffsetArrays.OffsetArray{Float64,3,CUDA.CuDeviceArray{Float64,3,1}},typeof(identity),RegularCartesianGrid{Float64,Periodic,Periodic,Bounded,OffsetArrays.OffsetArray{Float64,1,StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}}}},typeof(identity),typeof(identity),typeof(identity),RegularCartesianGrid{Float64,Periodic,Periodic,Bounded,OffsetArrays.OffsetArray{Float64,1,StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}}}},Int64,typeof(identity),typeof(identity),typeof(identity),RegularCartesianGrid{Float64,Periodic,Periodic,Bounded,OffsetArrays.OffsetArray{Float64,1,StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}}}},Int64,typeof(identity),typeof(identity),typeof(identity),RegularCartesianGrid{Float64,Periodic,Periodic,Bounded,OffsetArrays.OffsetArray{Float64,1,StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}}}},Oceananigans.AbstractOperations.BinaryOperation{Face,Cell,Face,typeof(^),Oceananigans.AbstractOperations.BinaryOperation{Face,Cell,Face,typeof(/),Oceananigans.AbstractOperations.BinaryOperation{Face,Cell,Face,typeof(+),Oceananigans.AbstractOperations.Derivative{Face,Cell,Face,typeof(Oceananigans.Operators.∂zᵃᵃᶠ),OffsetArrays.OffsetArray{Float64,3,CUDA.CuDeviceArray{Float64,3,1}},typeof(identity),RegularCartesianGrid{Float64,Periodic,Periodic,Bounded,OffsetArrays.OffsetArray{Float64,1,StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}}}},Oceananigans.AbstractOperations.Derivative{Face,Cell,Face,typeof(Oceananigans.Operators.∂xᶠᵃᵃ),OffsetArrays.OffsetArray{Float64,3,CUDA.CuDeviceArray{Float64,3,1}},typeof(identity),RegularCartesianGrid{Float64,Periodic,Periodic,Bounded,OffsetArrays.OffsetArray{Float64,1,StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}}}},typeof(identity),typeof(identity),typeof(identity),RegularCartesianGrid{Float64,Periodic,Periodic,Bounded,OffsetArrays.OffsetArray{Float64,1,StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}}}},Int64,typeof(identity),typeof(identity),typeof(identity),RegularCartesianGrid{Float64,Periodic,Periodic,Bounded,OffsetArrays.OffsetArray{Float64,1,StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}}}},Int64,typeof(identity),typeof(identity),typeof(identity),RegularCartesianGrid{Float64,Periodic,Periodic,Bounded,OffsetArrays.OffsetArray{Float64,1,StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}}}},Oceananigans.AbstractOperations.BinaryOperation{Cell,Face,Face,typeof(^),Oceananigans.AbstractOperations.BinaryOperation{Cell,Face,Face,typeof(/),Oceananigans.AbstractOperations.BinaryOperation{Cell,Face,Face,typeof(+),Oceananigans.AbstractOperations.Derivative{Cell,Face,Face,typeof(Oceananigans.Operators.∂zᵃᵃᶠ),OffsetArrays.OffsetArray{Float64,3,CUDA.CuDeviceArray{Float64,3,1}},typeof(identity),RegularCartesianGrid{Float64,Periodic,Periodic,Bounded,OffsetArrays.OffsetArray{Float64,1,StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}}}},Oceananigans.AbstractOperations.Derivative{Cell,Face,Face,typeof(Oceananigans.Operators.∂yᵃᶠᵃ),OffsetArrays.OffsetArray{Float64,3,CUDA.CuDeviceArray{Float64,3,1}},typeof(identity),RegularCartesianGrid{Float64,Periodic,Periodic,Bounded,OffsetArrays.OffsetArray{Float64,1,StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}}}},typeof(identity),typeof(identity),typeof(identity),RegularCartesianGrid{Float64,Periodic,Periodic,Bounded,OffsetArrays.OffsetArray{Float64,1,StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}}}},Int64,typeof(identity),typeof(identity),typeof(identity),RegularCartesianGrid{Float64,Periodic,Periodic,Bounded,OffsetArrays.OffsetArray{Float64,1,StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}}}},Int64,typeof(identity),typeof(identity),typeof(identity),RegularCartesianGrid{Float64,Periodic,Periodic,Bounded,OffsetArrays.OffsetArray{Float64,1,StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}}}}},Tuple{typeof(identity),typeof(Oceananigans.Operators.ℑyzᵃᶠᶜ),typeof(Oceananigans.Operators.ℑxzᶠᵃᶜ)},RegularCartesianGrid{Float64,Periodic,Periodic,Bounded,OffsetArrays.OffsetArray{Float64,1,StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}}}},typeof(identity),typeof(identity),typeof(identity),RegularCartesianGrid{Float64,Periodic,Periodic,Bounded,OffsetArrays.OffsetArray{Float64,1,StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}}}}},Tuple{typeof(identity),typeof(identity),typeof(identity),typeof(Oceananigans.Operators.ℑxyᶜᶜᵃ)},RegularCartesianGrid{Float64,Periodic,Periodic,Bounded,OffsetArrays.OffsetArray{Float64,1,StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}}}},typeof(identity),typeof(identity),typeof(identity),RegularCartesianGrid{Float64,Periodic,Periodic,Bounded,OffsetArrays.OffsetArray{Float64,1,StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}}}}}}; name::String, kwargs::Base.Iterators.Pairs{Symbol,Int64,Tuple{Symbol},NamedTuple{(:maxthreads,),Tuple{Int64}}}) at /home/glwagner/.julia/packages/CUDA/YeS8q/src/compiler/execution.jl:297
 [11] macro expansion at /home/glwagner/.julia/packages/CUDA/YeS8q/src/compiler/execution.jl:109 [inlined]
 [12] (::KernelAbstractions.Kernel{KernelAbstractions.CUDADevice,KernelAbstractions.NDIteration.StaticSize{(1, 1)},KernelAbstractions.NDIteration.StaticSize{(1, 1, 1)},typeof(Oceananigans.Fields.gpu__compute!)})(::OffsetArrays.OffsetArray{Float64,3,CUDA.CuArray{Float64,3}}, ::Vararg{Any,N} where N; ndrange::Nothing, dependencies::KernelAbstractions.CudaEvent, workgroupsize::Nothing, progress::Function) at /home/glwagner/.julia/packages/KernelAbstractions/jAutM/src/backends/cuda.jl:185
 [13] compute!(::ComputedField{Cell,Cell,Cell

@ali-ramadhan
Copy link
Member

Hmmm, looks like a Cassette issue. Maybe it's just being overly sensitive to the contents of the kernel (or kernel arguments) like was the case with #828?

@glwagner
Copy link
Member

glwagner commented Dec 4, 2020

A "dynamic function invocation" means that the compiler thinks a function is being called whose scope can change "dynamically" (I think). This is the error one gets when a function depends on a global variable that is not const (for example). In this case, the error tells us that the types of the objects involved in calling getindex defined on BinaryOperation:

@inline Base.getindex::BinaryOperation, i, j, k) = β.▶op(i, j, k, β.grid, β.op, β.▶a, β.▶b, β.a, β.b)

are not correctly inferred.

The way getindex comes into play is in the kernel function _compute! that evaluates the AbstractOperation:

@kernel function _compute!(data, operand)
i, j, k = @index(Global, NTuple)
@inbounds data[i, j, k] = operand[i, j, k]
end

calling getindex(operand, i, j, k) (or equivalently operand[i, j, k]) triggers recursive getindex calls that traverse the AbstractOperation tree. It seems that when the tree is too large, this traversal cannot be entirely compiled.

Perhaps there are tricks we might use to help the compiler parse this kind of operation, like putting some type annotations / hints into getindex(b::BinaryOperation, ...). Not sure. Another possibility is to figure out how to simplify the object BinaryOperation, MultiaryOperation, so that the compiler is less stressed trying to compile them... ?

@ali-ramadhan
Copy link
Member

Hmmm, if it's indeed incapable of compiling the entire operation might be helpful to be able to find out exactly at which size it fails.

I'm still thinking that it might be something to be fixed/improved in CUDA.jl.

I can try to do some test as I update #870.

@francispoulin
Copy link
Collaborator

I tried the code above and do get the same error, so I can confirm that.

I tried to trim it down and found that the following, slightly more minimal example, produces the same error. It seems that squaring and multiplying togther is too much. But if you remove one or the other, it seems to work fine.

using Oceananigans
using Oceananigans.AbstractOperations
using Oceananigans.Fields

grid = RegularCartesianGrid(size=(1, 1, 1), extent=(1, 1, 1)) 

model = IncompressibleModel(architecture=GPU(), grid=grid)

u, v, w = model.velocities

ϵ = 2 * (∂x(u)^2)

ϵ_field = ComputedField(ϵ)

compute!(ϵ_field)

@glwagner
Copy link
Member

glwagner commented Dec 4, 2020

Hmm that's good to know the limit of complexity. I can confirm that behavior, eg:

julia> compute!(ComputedField(2 * Σˣˣ))

julia> compute!(ComputedField(Σˣˣ^2))

julia> compute!(ComputedField(2 * Σˣˣ^2))
ERROR: InvalidIRError: compiling kernel gpu__compute!...

Here's the tree for 2 * Σˣˣ^2:

julia> 2 * Σˣˣ^2
BinaryOperation at (Cell, Cell, Cell)
├── grid: RegularCartesianGrid{Float64, Periodic, Periodic, Bounded}(Nx=1, Ny=1, Nz=1)
│   └── domain: x  [0.0, 1.0], y  [0.0, 1.0], z  [-1.0, 0.0]
└── tree: 
    * at (Cell, Cell, Cell) via identity
    ├── 2
    └── ^ at (Cell, Cell, Cell) via identity
        ├── ∂xᶜᵃᵃ at (Cell, Cell, Cell) via identity
        │   └── Field located at (Face, Cell, Cell)
        └── 2

@glwagner
Copy link
Member

glwagner commented Dec 4, 2020

Also

julia> compute!(ComputedField(Σˣˣ^2 + Σʸʸ^2))
ERROR: InvalidIRError: compiling kernel gpu__compute!...

@glwagner
Copy link
Member

glwagner commented Dec 4, 2020

It's not strictly a nesting issue, since:

julia> compute!(ComputedField(∂x(∂x(u))))

julia> compute!(ComputedField(∂x(∂x(∂x(u)))))

julia> compute!(ComputedField(∂x(∂x(∂x(∂x(u))))))

is fine.

@glwagner
Copy link
Member

glwagner commented Dec 4, 2020

Whoa. Here's a hint.

julia> compute!(ComputedField(u^2 + v^2 + w^2))

julia> compute!(ComputedField(u^2 + v^2))
ERROR: InvalidIRError: compiling kernel gpu__compute!

!!

Note:

julia> u^2 + v^2
BinaryOperation at (Face, Cell, Cell)
├── grid: RegularCartesianGrid{Float64, Periodic, Periodic, Bounded}(Nx=1, Ny=1, Nz=1)
│   └── domain: x  [0.0, 1.0], y  [0.0, 1.0], z  [-1.0, 0.0]
└── tree: 
    + at (Face, Cell, Cell) via identity
    ├── ^ at (Face, Cell, Cell) via identity
    │   ├── Field located at (Face, Cell, Cell)
    │   └── 2
    └── ^ at (Cell, Face, Cell) via identity
        ├── Field located at (Cell, Face, Cell)
        └── 2

julia> u^2 + v^2 + w^2
MultiaryOperation at (Face, Cell, Cell)
├── grid: RegularCartesianGrid{Float64, Periodic, Periodic, Bounded}(Nx=1, Ny=1, Nz=1)
│   └── domain: x  [0.0, 1.0], y  [0.0, 1.0], z  [-1.0, 0.0]
└── tree: 
    + at (Face, Cell, Cell)
    ├── ^ at (Face, Cell, Cell) via identity
    │   ├── Field located at (Face, Cell, Cell)
    │   └── 2
    ├── ^ at (Cell, Face, Cell) via identity
    │   ├── Field located at (Cell, Face, Cell)
    │   └── 2
    └── ^ at (Cell, Cell, Face) via identity
        ├── Field located at (Cell, Cell, Face)
        └── 2

Surprisingly, MultiaryOperations are better behaved.

@ali-ramadhan
Copy link
Member

ali-ramadhan commented Dec 4, 2020

I can confirm that I was able to get away with √(u^2 + v^2 + w^2) on GPUs in one of the particle tracking tests (PR #1091).

@glwagner
Copy link
Member

glwagner commented Dec 4, 2020

While getindex for a MultiaryOperation is complicated (unrolls a loop!):

@inline Base.getindex::MultiaryOperation{X, Y, Z, N}, i, j, k) where {X, Y, Z, N} =
Π.op(ntuple-> Π.▶[γ](i, j, k, Π.grid, Π.args[γ]), Val(N))...)

the MultiaryOperation object is simpler than a BinaryOperation because it stores fewer interpolation functions (we only support the case that every member of a multiary operation is first interpolated to a common location before being op'd on):

struct MultiaryOperation{X, Y, Z, N, O, A, I, G} <: AbstractOperation{X, Y, Z, G}
op :: O
args :: A
:: I
grid :: G
function MultiaryOperation{X, Y, Z}(op, args, ▶, grid) where {X, Y, Z}
return new{X, Y, Z, length(args), typeof(op), typeof(args), typeof(▶), typeof(grid)}(op, args, ▶, grid)
end
end

compare with

struct BinaryOperation{X, Y, Z, O, A, B, IA, IB, IΩ, G} <: AbstractOperation{X, Y, Z, G}
op :: O
a :: A
b :: B
▶a :: IA
▶b :: IB
▶op :: IΩ
grid :: G
"""
BinaryOperation{X, Y, Z}(op, a, b, ▶a, ▶b, ▶op, grid)
Returns an abstract representation of the binary operation `op(▶a(a), ▶b(b))`,
followed by interpolation by `▶op` to `(X, Y, Z)`, where `▶a` and `▶b` interpolate
`a` and `b` to a common location.
"""
function BinaryOperation{X, Y, Z}(op, a, b, ▶a, ▶b, ▶op, grid) where {X, Y, Z}
any((X, Y, Z) .=== Nothing) && throw(ArgumentError("Nothing locations are invalid! " *
"Cannot construct BinaryOperation at ($X, $Y, $Z)."))
return new{X, Y, Z, typeof(op), typeof(a), typeof(b), typeof(▶a), typeof(▶b),
typeof(▶op), typeof(grid)}(op, a, b, ▶a, ▶b, ▶op, grid)
end
end

@glwagner
Copy link
Member

glwagner commented Dec 4, 2020

On problem 1: meeting with @vchuravy we think there is a "recursive call cycle with levels of indirection". In other words, calling getindex on a BinaryOperation:

@inline Base.getindex::BinaryOperation, i, j, k) = β.▶op(i, j, k, β.grid, β.op, β.▶a, β.▶b, β.a, β.b)

calls β.▶op = identity:

@inline identity(i, j, k, grid, F::TF, args...) where TF<:Function = F(i, j, k, grid, args...)

which calls op:

@inline $op(i, j, k, grid::AbstractGrid, ▶a, ▶b, a, b) =
@inbounds $op(▶a(i, j, k, grid, a), ▶b(i, j, k, grid, b))

which may invoke another call to either ▶a=identity or ▶b=identity (which might subsequently go back to getindex...) Due to some aspect of this process, the compiler throws up its hands because of something like "you wouldn't want unbound recursion in your compiler, right?"

A possible solution, which is also a hilarious hack, is to define multiple identity functions and use an internal counter to use the different yet identical copies of this function when constructing BinaryOperations. Then we wouldn't be recursing over identity (from the compiler's point of view), since we'd be calling identity1, identity2, etc. If we also need different flavors of BinaryOperation, we can do that too...

@vchuravy
Copy link
Collaborator

vchuravy commented Dec 4, 2020

Alternatively we could think about linearizing the recursion before we go off and compile it.

@tomchor
Copy link
Collaborator Author

tomchor commented Dec 16, 2020

I'm not sure if it matters too much, but I'm going over some examples posted here and I can't reproduce some of the examples by @glwagner. I can compute some stuff that he can't, while the opposite is true for others

julia> compute!(ComputedField(∂x(∂x(u)))) # works as expected

julia> compute!(ComputedField(∂x(∂x(∂x(u))))) # works as expected

julia> compute!(ComputedField(∂x(∂x(∂x(∂x(u)))))) # doesn't work (not expected)

julia> compute!(ComputedField(u^2 + v^2)) # works (unexpected)

julia> compute!(ComputedField(u^2 + v^2 + w^2)) # works (expected)

Hopefully that'll give you guys more hints on what's going on. I'm running this on a Tesla V100 at NCAR's Casper cluster btw.

  • Julia version 1.5.2
  • Oceananigans version 0.45.2.
  • For some reason CUDA doesn't appear when I type ]st, but I think it's 2.2.1 (sorry for the unfamiliarity with Julia)

EDIT:

The results above were obtained initializing a minimum model as posted above. I tried the same examples I just posted initializing a model differently and many more things are failing now (for example, compute!(ComputedField(u^2 + v^2)) fails).

@glwagner
Copy link
Member

Can you please post the errors that you obtain in each case?

@tomchor
Copy link
Collaborator Author

tomchor commented Dec 16, 2020

Hmm, I can't reproduce the same results exactly. All I did before was honestly open a Julia session and just paste the examples you guys posted one by one.

Here's a pastebin with my whole session testing the commands I got in the previous post. (The comments of course don't reflect the outcome anymore.)

@glwagner
Copy link
Member

Awesome, thank you, that's helpful. Stochastic errors are troubling.

@tomchor
Copy link
Collaborator Author

tomchor commented Feb 26, 2021

I'm not sure if people are still thinking about this, but I may have some relevant information (Good news!) that I'd appreciate some feedback on. Consider the following MWE:

using Oceananigans
using Oceananigans.Utils
using Oceananigans.Fields

Lx = 150; Ly = 6000; Lz = 80
topology = (Periodic, Bounded, Bounded)
grid = RegularRectilinearGrid(size=(1, 512, 8), x=(0, Lx), y=(0, Ly), z=(-Lz, 0),
                              topology=(Periodic, Bounded, Bounded))


model = IncompressibleModel(architecture = GPU(),
                            grid = grid,
                            )

w_ic(x, y, z) = 0.01*y
v_ic(x, y, z) = 0.01*x
set!(model, w=w_ic, v=v_ic)


import Oceananigans.Fields: ComputedField, KernelComputedField
using Oceananigans.AbstractOperations: @at, ∂x, ∂y, ∂z
using Oceananigans.Grids: Center, Face


u, v, w = model.velocities

function naive_calc()
    p = sum(model.pressures)
    wp = @at (Center, Center, Face) w*p
    dwpdz = (1/1024) * ∂z(wp)
    println(dwpdz)
    return ComputedField(dwpdz)
end

function nested_calc()
    p = ComputedField(sum(model.pressures))
    wp = ComputedField(@at (Center, Center, Face) w*p)
    dwpdz = (1/1024) * ∂z(wp)
    println(dwpdz)
    return ComputedField(dwpdz)
end

I can include this script in the REPL after which I get the following results. First, when trying to compute the naive calculation using a GPU I get an error, which is expected at this point:

julia> dwpdz_naive = naive_calc()
BinaryOperation at (Center, Center, Center)
├── grid: RegularRectilinearGrid{Float64, Periodic, Bounded, Bounded}(Nx=1, Ny=512, Nz=8)
│   └── domain: x  [0.0, 150.0], y  [0.0, 6000.0], z  [-80.0, 0.0]
└── tree: 
    * at (Center, Center, Center) via identity
    ├── 0.0009765625
    └── ∂zᵃᵃᶜ at (Center, Center, Center) via identity
        └── * at (Center, Center, Face) via identity
            ├── Field located at (Center, Center, Face)
            └── + at (Center, Center, Center) via identity
                ├── Field located at (Center, Center, Center)
                └── Field located at (Center, Center, Center)
ComputedField located at (Center, Center, Center) of BinaryOperation at (Center, Center, Center)
├── data: OffsetArrays.OffsetArray{Float64,3,CUDA.CuArray{Float64,3}}, size: (3, 514, 10)
├── grid: RegularRectilinearGrid{Float64, Periodic, Bounded, Bounded}(Nx=1, Ny=512, Nz=8)
├── operand: BinaryOperation at (Center, Center, Center)
└── status: time=0.0


julia> compute!(dwpdz_naive)
ERROR: InvalidIRError: compiling kernel gpu__compute!(Cassette.Context{nametype(CUDACtx),KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.StaticSize{(1, 512, 8)},KernelAbstractions.NDIteration.DynamicCheck,Nothing,Nothing,KernelAbstractions.NDIteration.NDRange{3,KernelAbstractions.NDIteration.StaticSize{(1, 2, 8)},KernelAbstractions.NDIteration.StaticSize{(1, 256, 1)},Nothing,Nothing}},Nothing,KernelAbstractions.var"##PassType#253",Nothing,Cassette.DisableHooks}, typeof(Oceananigans.Fields.gpu__compute!), OffsetArrays.OffsetArray{Float64,3,CUDA.CuDeviceArray{Float64,3,1}}, Oceananigans.AbstractOperations.BinaryOperation{Center,Center,Center,typeof(*),Float64,Oceananigans.AbstractOperations.Derivative{Center,Center,Center,typeof(Oceananigans.Operators.∂zᵃᵃᶜ),Oceananigans.AbstractOperations.BinaryOperation{Center,Center,Face,typeof(*),OffsetArrays.OffsetArray{Float64,3,CUDA.CuDeviceArray{Float64,3,1}},Oceananigans.AbstractOperations.BinaryOperation{Center,Center,Center,typeof(+),OffsetArrays.OffsetArray{Float64,3,CUDA.CuDeviceArray{Float64,3,1}},OffsetArrays.OffsetArray{Float64,3,CUDA.CuDeviceArray{Float64,3,1}},typeof(identity),typeof(identity),typeof(identity),RegularRectilinearGrid{Float64,Periodic,Bounded,Bounded,OffsetArrays.OffsetArray{Float64,1,StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}}}},typeof(identity),typeof(Oceananigans.Operators.ℑzᵃᵃᶠ),typeof(identity),RegularRectilinearGrid{Float64,Periodic,Bounded,Bounded,OffsetArrays.OffsetArray{Float64,1,StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}}}},typeof(identity),RegularRectilinearGrid{Float64,Periodic,Bounded,Bounded,OffsetArrays.OffsetArray{Float64,1,StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}}}},typeof(identity),typeof(identity),typeof(identity),RegularRectilinearGrid{Float64,Periodic,Bounded,Bounded,OffsetArrays.OffsetArray{Float64,1,StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}}}}) resulted in invalid LLVM IR
Reason: unsupported dynamic function invocation (call to overdub(overdub_context::Cassette.Context, overdub_arguments...) in Cassette at /glade/u/home/tomasc/.julia/packages/Cassette/Wjztv/src/overdub.jl:595)
Stacktrace:
 [1] getindex at /glade/u/home/tomasc/.julia/packages/Oceananigans/WSSHu/src/AbstractOperations/binary_operations.jl:34
 [2] macro expansion at /glade/u/home/tomasc/.julia/packages/Oceananigans/WSSHu/src/Fields/computed_field.jl:114
 [3] gpu__compute! at /glade/u/home/tomasc/.julia/packages/KernelAbstractions/mKsXc/src/macros.jl:80
 [4] overdub at /glade/u/home/tomasc/.julia/packages/Cassette/Wjztv/src/overdub.jl:0
# I truncated the huge error message here

However, the nested calculation appears to work!:

julia> dwpdz_nested = nested_calc()
BinaryOperation at (Center, Center, Center)
├── grid: RegularRectilinearGrid{Float64, Periodic, Bounded, Bounded}(Nx=1, Ny=512, Nz=8)
│   └── domain: x  [0.0, 150.0], y  [0.0, 6000.0], z  [-80.0, 0.0]
└── tree: 
    * at (Center, Center, Center) via identity
    ├── 0.0009765625
    └── ∂zᵃᵃᶜ at (Center, Center, Center) via identity
        └── ComputedField located at (Center, Center, Face) of BinaryOperation at (Center, Center, Face)
ComputedField located at (Center, Center, Center) of BinaryOperation at (Center, Center, Center)
├── data: OffsetArrays.OffsetArray{Float64,3,CUDA.CuArray{Float64,3}}, size: (3, 514, 10)
├── grid: RegularRectilinearGrid{Float64, Periodic, Bounded, Bounded}(Nx=1, Ny=512, Nz=8)
├── operand: BinaryOperation at (Center, Center, Center)
└── status: time=0.0


julia> compute!(dwpdz_nested)

julia> using Adapt

julia> adapt(Array, interior(dwpdz_nested))
1×512×8 view(OffsetArray(::Array{Float64,3}, 0:2, 0:513, 0:9), [1], [1, 2, 3, 4, 5, 6, 7, 8, 9, 10    503, 504, 505, 506, 507, 508, 509, 510, 511, 512], [1, 2, 3, 4, 5, 6, 7, 8]) with eltype Float64:
[:, :, 1] =
 0.0262775  0.014948  0.00902569  0.00559831  0.00351427  0.00221792  0.00140318  0.000888719    0.000263472  0.000420913  0.000674128  0.00108572  0.00177041  0.00296775  0.00528696

[:, :, 2] =
 0.0156731  0.0110235  0.00720451  0.00461103  0.00293129  0.00185902  0.00117792  0.00074609    0.000915116  0.00144732  0.00229007  0.00362638  0.0057484  0.00911035  0.0143023

[:, :, 3] =
 0.00844536  0.00652392  0.00451369  0.00297425  0.00191676  0.00122263  0.000776137  0.000491594    0.00126331  0.00199147  0.00313408  0.00491472  0.00764863  0.0117137  0.0173591

[:, :, 4] =
 0.00263363  0.0021138  0.00150181  0.00100316  0.000649705  0.000414395  0.000262323  0.00016544    0.00117766  0.00185426  0.00291374  0.00455993  0.00707827  0.0108175  0.0160812

[:, :, 5] =
 -0.00276553  -0.00220933  -0.00157462  -0.00105926  -0.000692536  -0.000446557  -0.00028606    0.00062928  0.000993307  0.0015715  0.00249751  0.00400184  0.0064944  0.0106821

[:, :, 6] =
 -0.00852329  -0.00657626  -0.00455259  -0.00300401  -0.00193944  -0.00123966  -0.000788713    -0.000470821  -0.000722127  -0.00106354  -0.00141766  -0.00138651  0.000390487

[:, :, 7] =
 -0.0156445  -0.0109965  -0.0071831  -0.00459454  -0.00291877  -0.00184967  -0.00117105    -0.00143385  -0.00226136  -0.00355998  -0.00557919  -0.00863921  -0.0129097  -0.0170153

[:, :, 8] =
 -0.0260963  -0.0148272  -0.00893539  -0.00552894  -0.00346128  -0.00217808  -0.00137374    -0.00251345  -0.00397508  -0.00630142  -0.0100415  -0.0161907  -0.0268075  -0.0470868

(Btw, the example above obviously works fine with architecture=CPU(), where I can check that dwpdz_nested actually produces the correct result.)

I even ran some other tests with even increased complexity. And they all appear to work on GPUs. For example this one:

function crazy_calc()
    p = ComputedField(sum(model.pressures))
    wp = ComputedField(@at (Center, Center, Face) w*p)
    dwpdz = (1/1024) * ∂z(wp)
    println(dwpdz)
    dwpdz = ComputedField(dwpdz)
    dwpdz2 = ComputedField(dwpdz^2)
    return ComputedField(dwpdz2+dwpdz)
end

I'd appreciate if some of you could try to reproduce this result on other machines. I ran this in one of NCAR's Tesla V100s. If you can reproduce this behavior, then this kinda makes KernelComputedFields obsolete, no?

@glwagner
Copy link
Member

I think there's still a use for KernelComputedField! It's more expensive to construct a calculation by nesting ComputedFields inside it. The difference is in the number of kernel launches: when expressing a calculation with a single kernel, the computation is computed with a single loop over all grid points. When expressing a calculation using intermediate ComputedField, then first we launch kernels to calculate the intermediate ComputedFields, and launch a final kernel to compute the quantity of interest. It can also be more memory intensive since the intermediate ComputedFields need to be stored.

For some applications, the "optimization" of avoiding intermediate kernel launches / calculations may be unimportant (for example, if plenty of memory is available and computations are made very rarely). I think nesting ComputedFields is a nice solution that avoids having to hand-write kernels for those cases. But sometimes I do think that users want to optimize diagnostics calculations.

There are also some other nice applications of KernelComputedField; for example I think it would be nice to make the LES eddy diffusivities into KernelComputedField (which are computed every tendency evaluation so performance is crucial), and perhaps some other auxiliary variables.

I think it would be fun to try the compiler hack-around that I suggested in my comment above (defining multiple identity functions and using an internal counter to scroll through them during compilation). It's just a matter of finding the time to do it.

@tomchor
Copy link
Collaborator Author

tomchor commented Apr 17, 2021

@glwagner Should we keep this open though, since we still haven't solved all the issues? (I know it's unlikely we'll ever solve all of them, but we might be able to make more progress).

@glwagner
Copy link
Member

I'm fine with that. It makes sense if we think there's Oceananigans.jl development we might to do to solve the issue. I'm not entirely sure that's true right now. But here's where we stand on master currently:

julia> using Oceananigans
[ Info: Precompiling Oceananigans [9e8cae18-63c1-5223-a75c-80ca9d6e9a09]
[ Info: Oceananigans will use 24 threads

julia> grid = RegularRectilinearGrid(size=(1, 1, 1), extent=(1, 1, 1)); model = IncompressibleModel(architecture=GPU(), grid=grid)
IncompressibleModel{GPU, Float64}(time = 0 seconds, iteration = 0) 
├── grid: RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded}(Nx=1, Ny=1, Nz=1)
├── tracers: (:T, :S)
├── closure: IsotropicDiffusivity{Float64,NamedTuple{(:T, :S),Tuple{Float64,Float64}}}
├── buoyancy: SeawaterBuoyancy{Float64,LinearEquationOfState{Float64},Nothing,Nothing}
└── coriolis: Nothing

julia> u, v, w = model.velocities
(u = Field located at (Face, Center, Center)
├── data: OffsetArrays.OffsetArray{Float64,3,CUDA.CuArray{Float64,3}}, size: (3, 3, 3)
├── grid: RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded}(Nx=1, Ny=1, Nz=1)
└── boundary conditions: x=(west=Periodic, east=Periodic), y=(south=Periodic, north=Periodic), z=(bottom=ZeroFlux, top=ZeroFlux), v = Field located at (Center, Face, Center)
├── data: OffsetArrays.OffsetArray{Float64,3,CUDA.CuArray{Float64,3}}, size: (3, 3, 3)
├── grid: RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded}(Nx=1, Ny=1, Nz=1)
└── boundary conditions: x=(west=Periodic, east=Periodic), y=(south=Periodic, north=Periodic), z=(bottom=ZeroFlux, top=ZeroFlux), w = Field located at (Center, Center, Face)
├── data: OffsetArrays.OffsetArray{Float64,3,CUDA.CuArray{Float64,3}}, size: (3, 3, 4)
├── grid: RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded}(Nx=1, Ny=1, Nz=1)
└── boundary conditions: x=(west=Periodic, east=Periodic), y=(south=Periodic, north=Periodic), z=(bottom=NormalFlow, top=NormalFlow))

julia> compute!(ComputedField(u + v - w)) # Now this works

julia> compute!(ComputedField(∂x(u)^2 + ∂y(v)^2 + ∂z(w)^2 + ∂x(w)^2)) # still doesnt work
ERROR: CUDA error: device kernel image is invalid (code 200, ERROR_INVALID_IMAGE)

julia> compute!(ComputedField(∂x(u)^2 + ∂y(v)^2 + ∂z(w)^2 + ∂x(w)^2 + ∂y(w)^2)) # still doesn't work
ERROR: CUDA error: a PTX JIT compilation failed (code 218, ERROR_INVALID_PTX)
ptxas application ptx input, line 802; error   : Entry function '_Z19julia_gpu__compute_7ContextI14__CUDACtx_Name16CompilerMetadataI10StaticSizeI9_1__1__1_E12DynamicCheckvv7NDRangeILi3ES2_I9_1__1__1_ES2_I9_1__1__1_EvvEEv14__PassType_253v12DisableHooksE14_gpu__compute_11OffsetArrayI7Float64Li3E13CuDeviceArrayIS9_Li3ELi1EEE17MultiaryOperationI6CenterS12_S12_Li5E2__5TupleI15BinaryOperationIS12_S12_S12_S13_10DerivativeIS12_S12_S12_6__x___S8_IS9_Li3ES10_IS9_Li3ELi1EEE10_identity222RegularRectilinearGridIS9_8PeriodicS20_7BoundedS8_IS9_Li1E12StepRangeLenIS9_14TwicePrecisionIS9_ES23_IS9_EEEEE5Int6410_identity310_identity410_identity5S19_IS9_S20_S20_S21_S8_IS9_Li1ES22_IS9_S23_IS9_ES23_IS9_EEEEES15_IS12_S12_S12_S13_S16_IS12_S12_S12_6__y___S8_IS9_Li3ES10_IS9_Li3ELi1EEE10_identity1S19_IS9_S20_S20_S21_S8_IS9_Li1ES22_IS9_S23_IS9_ES23_IS9_EEEEES24_S18_S25_S26_S19_IS9_S20_S20_S21_S8_IS9_Li1ES22_IS9_S23_IS9_ES23_IS9_EEEEES15_IS12_S12_S12_S13_S16_IS12_S12_S12_6__z___S8_IS9_Li3ES10_IS9_Li3ELi1EEES27_S19_IS9_S20_S20_S21_S8_IS9_Li1ES22_IS9_S23_IS9_ES23_IS9_EEEEES24_S29_S18_S25_S19_IS9_S20_S20_S21_S8_IS9_Li1ES22_IS9_S23_IS9_ES23_IS9_EEEEES15_I4FaceS12_S31_S13_S16_IS31_S12_S31_S17_S8_IS9_Li3ES10_IS9_Li3ELi1EEES26_S19_IS9_S20_S20_S21_S8_IS9_Li1ES22_IS9_S23_IS9_ES23_IS9_EEEEES24_S27_S29_S18_S19_IS9_S20_S20_S21_S8_IS9_Li1ES22_IS9_S23_IS9_ES23_IS9_EEEEES15_IS12_S31_S31_S13_S16_IS12_S31_S31_S28_S8_IS9_Li3ES10_IS9_Li3ELi1EEES25_S19_IS9_S20_S20_S21_S8_IS9_Li1ES22_IS9_S23_IS9_ES23_IS9_EEEEES24_S26_S27_S29_S19_IS9_S20_S20_S21_S8_IS9_Li1ES22_IS9_S23_IS9_ES23_IS9_EEEEEES14_IS18_S25_S26_7__xz___7__yz___ES19_IS9_S20_S20_S21_S8_IS9_Li1ES22_IS9_S23_IS9_ES23_IS9_EEEEE' uses too much parameter space (0x1408 bytes, 0x1100 max).
ptxas fatal   : Ptx assembly aborted due to errors

We haven't discussed the problem with AveragedField on this thread. I think we need a new issue for that (can't remember if there already is one). It actually looks like some, but not all of the issue there have been solved. A minimal example is:

julia> U = AveragedField(u, dims=(1, 2))
AveragedField over dims=(1, 2) located at (, , Center) of Field located at (Face, Center, Center)
├── data: OffsetArrays.OffsetArray{Float64,3,CUDA.CuArray{Float64,3}}, size: (1, 1, 3)
├── grid: RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded}(Nx=1, Ny=1, Nz=1)
├── dims: (1, 2)
├── operand: Field located at (Face, Center, Center)
└── status: time=0.0

julia> compute!(ComputedField(u - U))

julia> compute!(ComputedField((u - U)^2))

julia> V = AveragedField(v, dims=(1, 2))
AveragedField over dims=(1, 2) located at (, , Center) of Field located at (Center, Face, Center)
├── data: OffsetArrays.OffsetArray{Float64,3,CUDA.CuArray{Float64,3}}, size: (1, 1, 3)
├── grid: RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded}(Nx=1, Ny=1, Nz=1)
├── dims: (1, 2)
├── operand: Field located at (Center, Face, Center)
└── status: time=0.0

julia> tke = 1/2 * ((u - U)^2 + (v - V)^2 + w^2)
BinaryOperation at (Face, Center, Center)
├── grid: RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded}(Nx=1, Ny=1, Nz=1)
│   └── domain: x  [0.0, 1.0], y  [0.0, 1.0], z  [-1.0, 0.0]
└── tree: 
    * at (Face, Center, Center) via identity
    ├── 0.5
    └── + at (Face, Center, Center)
        ├── ^ at (Face, Center, Center) via identity
        │   ├── - at (Face, Center, Center) via identity
        │   │   ├── Field located at (Face, Center, Center)
        │   │   └── AveragedField over dims=(1, 2) located at (, , Center) of Field located at (Face, Center, Center)
        │   └── 2
        ├── ^ at (Center, Face, Center) via identity
        │   ├── - at (Center, Face, Center) via identity
        │   │   ├── Field located at (Center, Face, Center)
        │   │   └── AveragedField over dims=(1, 2) located at (, , Center) of Field located at (Center, Face, Center)
        │   └── 2
        └── ^ at (Center, Center, Face) via identity
            ├── Field located at (Center, Center, Face)
            └── 2

julia> compute!(ComputedField(tke))
ERROR: InvalidIRError: compiling kernel gpu__compute!

Interestingly, this works:

julia> tke = ((u - U)^2 + (v - V)^2 + w^2) / 2
BinaryOperation at (Face, Center, Center)
├── grid: RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded}(Nx=1, Ny=1, Nz=1)
│   └── domain: x  [0.0, 1.0], y  [0.0, 1.0], z  [-1.0, 0.0]
└── tree: 
    / at (Face, Center, Center) via identity
    ├── + at (Face, Center, Center)
    │   ├── ^ at (Face, Center, Center) via identity
    │   │   ├── - at (Face, Center, Center) via identity
    │   │   │   ├── Field located at (Face, Center, Center)
    │   │   │   └── AveragedField over dims=(1, 2) located at (, , Center) of Field located at (Face, Center, Center)
    │   │   └── 2
    │   ├── ^ at (Center, Face, Center) via identity
    │   │   ├── - at (Center, Face, Center) via identity
    │   │   │   ├── Field located at (Center, Face, Center)
    │   │   │   └── AveragedField over dims=(1, 2) located at (, , Center) of Field located at (Center, Face, Center)
    │   │   └── 2
    │   └── ^ at (Center, Center, Face) via identity
        │   ├── Field located at (Center, Center, Face)
        │   └── 2
    └── 2

julia> compute!(ComputedField(tke))

So I guess we are almost there with AveragedField. I think the order or arguments could actually matter for compilation due to heuristics that are invoked during compilation... ?

We might update the test for computations with AveragedField because they might pass with a different ordering. We could also change how AbstractOperations are built and put Number as the second argument regardless of how the expression is written.

@glwagner
Copy link
Member

glwagner commented Apr 17, 2021

This doesn't work sadly:

julia> tke = @at (Center, Center, Center) ((u - U)^2 + (v - V)^2 + w^2) / 2
BinaryOperation at (Center, Center, Center)
├── grid: RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded}(Nx=1, Ny=1, Nz=1)
│   └── domain: x  [0.0, 1.0], y  [0.0, 1.0], z  [-1.0, 0.0]
└── tree: 
    / at (Center, Center, Center) via identity
    ├── + at (Center, Center, Center)
    │   ├── ^ at (Center, Center, Center) via identity
    │   │   ├── - at (Center, Center, Center) via identity
    │   │   │   ├── Field located at (Face, Center, Center)
    │   │   │   └── AveragedField over dims=(1, 2) located at (, , Center) of Field located at (Face, Center, Center)
    │   │   └── 2
    │   ├── ^ at (Center, Center, Center) via identity
    │   │   ├── - at (Center, Center, Center) via identity
    │   │   │   ├── Field located at (Center, Face, Center)
    │   │   │   └── AveragedField over dims=(1, 2) located at (, , Center) of Field located at (Center, Face, Center)
    │   │   └── 2
    │   └── ^ at (Center, Center, Center) via ℑzᵃᵃᶜ
        │   ├── Field located at (Center, Center, Face)
        │   └── 2
    └── 2

julia> compute!(ComputedField(tke))
ERROR: InvalidIRError: compiling kernel gpu__compute!

There's something fishy there, worth looking into. We can change the topic of this issue to focus on AveragedField or open a new one. Happy to do either.

EDIT: I see the issue. While we ensure that binary operations always occur at the common location for fields, we haven't ensured that operations between fields and ReducedFields occur at the location of the 3D field. We can fix this by defining a few more functions.

@tomchor
Copy link
Collaborator Author

tomchor commented Apr 17, 2021

There's something fishy there, worth looking into. We can change the topic of this issue to focus on AveragedField or open a new one. Happy to do either.

Thanks for all the work! I think I'll reopen this thread then since the issue is not resolved and we may still be able to make progress. Also there's a chance that someone will see this thread and have some good ideas now that we specifically link this thread in the GPU docs.

EDIT: I see the issue. While we ensure that binary operations always occur at the common location for fields, we haven't ensured that operations between fields and ReducedFields occur at the location of the 3D field. We can fix this by defining a few more functions.

That's awesome! I vote for another separate issue for the averaged fields in the name of organization. But I'll let you decide that one since you already have insights into how to solve it.

@tomchor tomchor reopened this Apr 17, 2021
@tomchor
Copy link
Collaborator Author

tomchor commented Apr 20, 2021

@glwagner Just FYI, some of the things that did not work for you, actually worked for me. Most notably:

julia> tke = @at (Center, Center, Center) ((u - U)^2 + (v - V)^2 + w^2) / 2
BinaryOperation at (Center, Center, Center)
├── grid: RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded}(Nx=1, Ny=1, Nz=1)
│   └── domain: x  [0.0, 1.0], y  [0.0, 1.0], z  [-1.0, 0.0]
└── tree: 
    / at (Center, Center, Center) via identity
    ├── + at (Center, Center, Center)
    │   ├── ^ at (Center, Center, Center) via identity
    │   │   ├── - at (Center, Center, Center) via identity
    │   │   │   ├── Field located at (Face, Center, Center)
    │   │   │   └── AveragedField over dims=(1, 2) located at (, , Center) of Field located at (Face, Center, Center)
    │   │   └── 2
    │   ├── ^ at (Center, Center, Center) via identity
    │   │   ├── - at (Center, Center, Center) via identity
    │   │   │   ├── Field located at (Center, Face, Center)
    │   │   │   └── AveragedField over dims=(1, 2) located at (, , Center) of Field located at (Center, Face, Center)
    │   │   └── 2
    │   └── ^ at (Center, Center, Center) via ℑzᵃᵃᶜ
        │   ├── Field located at (Center, Center, Face)
        │   └── 2
    └── 2

julia> compute!(tke)

julia> 

So it appears to be machine-dependent at least to some extent.

@glwagner
Copy link
Member

The last lingering issue mentioned in #1241 (comment) was closed by #1599 ! See

#1599 (comment)

@tomchor
Copy link
Collaborator Author

tomchor commented Jul 22, 2021

The last lingering issue mentioned in #1241 (comment) was closed by #1599 ! See

#1599 (comment)

It did, but the overall issue still remains, no?

I'd vote for us to keep this issue open until we can compile at least dissipation rate calculations with AbstractOperations. It's a a high bar I think, but having the issue open (which we actually reference in the docs) might motivate someone to take a look.

@glwagner
Copy link
Member

Interesting. I will open a new issue that's specific to the current issue with a concise summary. It's hard to parse the conversation in this long issue. I guess we will also have to update the docs for this.

@tomchor
Copy link
Collaborator Author

tomchor commented Jul 22, 2021

Sounds good. I'll change the docs after you open the new issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants