Fix advice on how to run code on multiple GPUs #1506

MasonProtter · 2022-05-13T22:03:40Z

The old version caused segfaults on my multi-GPU setup but this one works fine.

vchuravy · 2022-05-13T22:06:07Z

The old way should not cause seqfaults (but we should probably use @spawn).

MasonProtter · 2022-05-13T22:28:58Z

Interesting. It reliably segfaults for me on a workload I'm testing with ITensorGPU.jl with the old way, switching to @spawn seems to introduce a different error related to undefined references, but the do block pattern works fine.

This is probably some weird thing about ITensorGPU.jl then. The only real difference I can find with these various forms is this:

julia> @sync begin
           @async device!(0) do
               outer_dev0 = device()
               @async begin
                   sleep(0.5)
                   inner_dev0 = device()
                   @show outer_dev0 inner_dev0
               end
           end
           @async device!(1) do
               outer_dev1 = device()
               @async begin
                   sleep(1)
                   inner_dev1 = device()
                   @show outer_dev1 inner_dev1
               end
           end
       end;
outer_dev0 = CuDevice(0)
inner_dev0 = CuDevice(0)
outer_dev1 = CuDevice(1)
inner_dev1 = CuDevice(0)

versus

julia> @sync begin
           @async begin 
               device!(0)
               outer_dev0 = device()
               @async begin
                   sleep(0.5)
                   inner_dev0 = device()
                   @show outer_dev0 inner_dev0
               end
           end
           @async begin
               device!(1)
               outer_dev1 = device()
               @async begin
                   sleep(1)
                   inner_dev1 = device()
                   @show outer_dev1 inner_dev1
               end
           end
       end;
outer_dev0 = CuDevice(0)
inner_dev0 = CuDevice(1)
outer_dev1 = CuDevice(1)
inner_dev1 = CuDevice(1)

I would argue that both behaviours are worringly incorrect, but maybe the way that the incorrect behaviour is manifested in the second case interacts with ITensorsGPU.jl in a more dangerous way? I'm not sure.

MasonProtter · 2022-05-13T22:40:14Z

Maybe this is an argument in favour of switching to ContextVariablesX.jl while we wait for JuliaLang/julia#35833 ?

julia> using ContextVariablesX

julia> @contextvar dev = 0;

julia> @sync begin
           @async with_context(dev => 0) do
               outer_dev0 = dev[]
               @async begin
                   sleep(0.5)
                   inner_dev0 = dev[]
                   @show outer_dev0 inner_dev0
               end
           end
           @async with_context(dev => 1) do
               outer_dev1 = dev[]
               @async begin
                   sleep(1)
                   inner_dev1 = dev[]
                   @show outer_dev1 inner_dev1
               end
           end
       end
outer_dev0 = 0
inner_dev0 = 0
outer_dev1 = 1
inner_dev1 = 1

codecov · 2022-05-13T23:22:37Z

Codecov Report

Merging #1506 (9e0f4c7) into master (e762285) will decrease coverage by 0.67%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master    #1506      +/-   ##
==========================================
- Coverage   73.35%   72.68%   -0.68%     
==========================================
  Files         131      131              
  Lines        9825     9825              
==========================================
- Hits         7207     7141      -66     
- Misses       2618     2684      +66

Impacted Files	Coverage Δ
lib/cudnn/CUDNN.jl	`37.50% <0.00%> (-35.94%)`	⬇️
lib/cublas/CUBLAS.jl	`50.00% <0.00%> (-25.44%)`	⬇️
src/utilities.jl	`68.91% <0.00%> (-4.06%)`	⬇️
lib/cudadrv/CUDAdrv.jl	`51.66% <0.00%> (-3.34%)`	⬇️
lib/cudadrv/module/linker.jl	`68.75% <0.00%> (-3.13%)`	⬇️
lib/cudadrv/state.jl	`77.84% <0.00%> (-2.28%)`	⬇️
lib/cudadrv/memory.jl	`78.59% <0.00%> (-1.01%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e762285...9e0f4c7. Read the comment docs.

maleadt · 2022-05-16T12:57:02Z

switching to @spawn seems to introduce a different error related to undefined references

Undefined reference exceptions from CUDA.jl? That sounds like a bug, please file an issue.

About this issue, setting the device as the first operation in a task should be identical to the @async do block (except that it changes the device back afterwards). The behavior noted in #1506 (comment) is 'expected' because inspecting the device in a task without having set the device is undefined, as noted in the admonition that this PR is modifying. I'm not sure which behavior you expected (given Julia's current limitations)?

ContextVariables would be better, but these task-local look-ups are pretty performance sensitive, so I don't want to penalize them.

MasonProtter · 2022-05-18T19:40:39Z

Undefined reference exceptions from CUDA.jl? That sounds like a bug, please file an issue.

They're not coming from CUDA, it's coming from ITensorGPU.jl, I'll file an issue there.

Fix advice on how to run code on multiple GPUs

9e0f4c7

MasonProtter closed this May 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix advice on how to run code on multiple GPUs #1506

Fix advice on how to run code on multiple GPUs #1506

MasonProtter commented May 13, 2022

vchuravy commented May 13, 2022

MasonProtter commented May 13, 2022 •

edited

Loading

MasonProtter commented May 13, 2022 •

edited

Loading

codecov bot commented May 13, 2022 •

edited

Loading

maleadt commented May 16, 2022

MasonProtter commented May 18, 2022

Fix advice on how to run code on multiple GPUs #1506

Fix advice on how to run code on multiple GPUs #1506

Conversation

MasonProtter commented May 13, 2022

vchuravy commented May 13, 2022

MasonProtter commented May 13, 2022 • edited Loading

MasonProtter commented May 13, 2022 • edited Loading

codecov bot commented May 13, 2022 • edited Loading

Codecov Report

maleadt commented May 16, 2022

MasonProtter commented May 18, 2022

MasonProtter commented May 13, 2022 •

edited

Loading

MasonProtter commented May 13, 2022 •

edited

Loading

codecov bot commented May 13, 2022 •

edited

Loading