Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix advice on how to run code on multiple GPUs #1506

Closed
wants to merge 1 commit into from

Conversation

MasonProtter
Copy link
Contributor

The old version caused segfaults on my multi-GPU setup but this one works fine.

@vchuravy
Copy link
Member

The old way should not cause seqfaults (but we should probably use @spawn).

@MasonProtter
Copy link
Contributor Author

MasonProtter commented May 13, 2022

Interesting. It reliably segfaults for me on a workload I'm testing with ITensorGPU.jl with the old way, switching to @spawn seems to introduce a different error related to undefined references, but the do block pattern works fine.

This is probably some weird thing about ITensorGPU.jl then. The only real difference I can find with these various forms is this:

julia> @sync begin
           @async device!(0) do
               outer_dev0 = device()
               @async begin
                   sleep(0.5)
                   inner_dev0 = device()
                   @show outer_dev0 inner_dev0
               end
           end
           @async device!(1) do
               outer_dev1 = device()
               @async begin
                   sleep(1)
                   inner_dev1 = device()
                   @show outer_dev1 inner_dev1
               end
           end
       end;
outer_dev0 = CuDevice(0)
inner_dev0 = CuDevice(0)
outer_dev1 = CuDevice(1)
inner_dev1 = CuDevice(0)

versus

julia> @sync begin
           @async begin 
               device!(0)
               outer_dev0 = device()
               @async begin
                   sleep(0.5)
                   inner_dev0 = device()
                   @show outer_dev0 inner_dev0
               end
           end
           @async begin
               device!(1)
               outer_dev1 = device()
               @async begin
                   sleep(1)
                   inner_dev1 = device()
                   @show outer_dev1 inner_dev1
               end
           end
       end;
outer_dev0 = CuDevice(0)
inner_dev0 = CuDevice(1)
outer_dev1 = CuDevice(1)
inner_dev1 = CuDevice(1)

I would argue that both behaviours are worringly incorrect, but maybe the way that the incorrect behaviour is manifested in the second case interacts with ITensorsGPU.jl in a more dangerous way? I'm not sure.

@MasonProtter
Copy link
Contributor Author

MasonProtter commented May 13, 2022

Maybe this is an argument in favour of switching to ContextVariablesX.jl while we wait for JuliaLang/julia#35833 ?

julia> using ContextVariablesX

julia> @contextvar dev = 0;

julia> @sync begin
           @async with_context(dev => 0) do
               outer_dev0 = dev[]
               @async begin
                   sleep(0.5)
                   inner_dev0 = dev[]
                   @show outer_dev0 inner_dev0
               end
           end
           @async with_context(dev => 1) do
               outer_dev1 = dev[]
               @async begin
                   sleep(1)
                   inner_dev1 = dev[]
                   @show outer_dev1 inner_dev1
               end
           end
       end
outer_dev0 = 0
inner_dev0 = 0
outer_dev1 = 1
inner_dev1 = 1

@codecov
Copy link

codecov bot commented May 13, 2022

Codecov Report

Merging #1506 (9e0f4c7) into master (e762285) will decrease coverage by 0.67%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master    #1506      +/-   ##
==========================================
- Coverage   73.35%   72.68%   -0.68%     
==========================================
  Files         131      131              
  Lines        9825     9825              
==========================================
- Hits         7207     7141      -66     
- Misses       2618     2684      +66     
Impacted Files Coverage Δ
lib/cudnn/CUDNN.jl 37.50% <0.00%> (-35.94%) ⬇️
lib/cublas/CUBLAS.jl 50.00% <0.00%> (-25.44%) ⬇️
src/utilities.jl 68.91% <0.00%> (-4.06%) ⬇️
lib/cudadrv/CUDAdrv.jl 51.66% <0.00%> (-3.34%) ⬇️
lib/cudadrv/module/linker.jl 68.75% <0.00%> (-3.13%) ⬇️
lib/cudadrv/state.jl 77.84% <0.00%> (-2.28%) ⬇️
lib/cudadrv/memory.jl 78.59% <0.00%> (-1.01%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e762285...9e0f4c7. Read the comment docs.

@maleadt
Copy link
Member

maleadt commented May 16, 2022

switching to @spawn seems to introduce a different error related to undefined references

Undefined reference exceptions from CUDA.jl? That sounds like a bug, please file an issue.

About this issue, setting the device as the first operation in a task should be identical to the @async do block (except that it changes the device back afterwards). The behavior noted in #1506 (comment) is 'expected' because inspecting the device in a task without having set the device is undefined, as noted in the admonition that this PR is modifying. I'm not sure which behavior you expected (given Julia's current limitations)?

ContextVariables would be better, but these task-local look-ups are pretty performance sensitive, so I don't want to penalize them.

@MasonProtter
Copy link
Contributor Author

Undefined reference exceptions from CUDA.jl? That sounds like a bug, please file an issue.

They're not coming from CUDA, it's coming from ITensorGPU.jl, I'll file an issue there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants