-
Notifications
You must be signed in to change notification settings - Fork 222
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GC integration: multiple threads fail to free each others objects, lead to OOM #1522
Comments
That's a wrong expectation. For one, memory allocations are garbage collected so it might take a while before they get freed, but secondly there's a caching layer in |
@maleadt but why the overloading error - shouldn't GPU memory be GC'ed automatically if necessary? |
The issue didn't demonstrate an actual OOM, so I'm guessing that statement was a hypothetical? Either way, it shouldn't OOM, our allocator will forcibly free memory (by calling the GC) if some is needed. |
@lmh91 could you change your example to demo the OOM errors we've encountered? |
I tried to make the MWE example as small as possible and it seems I actually removed the important part which seems to create the OOM error in my use case: using multiple threads. When running the The OOM error is produced running the loop over multiple threads via |
Thanks, I can reproduce using this MWE. Will have a look. |
MWE: using CUDA
function main()
Threads.@threads for i in 1:100000
CuArray{Float32}(undef, (1024, 100))
nothing
end
end
isinteractive() || main() This looks like us calling into the GC being broken when using threads. |
Thanks @maleadt ! |
Even smaller: using CUDA
function main()
Threads.@threads for i in 1:30
CuArray{UInt8}(undef, (1024, 1024, 1024)) # 1 GiB
nothing
end
end
isinteractive() || main() OOMs on
During those retries, we do incrementally more effort to free memory, including calls to the GC. But as you can see from the trace, those don't free another thread's objects in time. I'm not sure how to proceed here, so I've asked @chflood to have a look. |
MWE without CUDA.jl: const LIMIT = 14
# dummy atomic allocator
const memory = Threads.Atomic{Int}(0)
function alloc()
println("thread $(Threads.threadid()): try alloc ($(memory[])/$(LIMIT) used)")
while true
old_memory = memory[]
new_memory = old_memory + 1
if new_memory > LIMIT
printstyled("thread $(Threads.threadid()): alloc failure\n"; color=:yellow)
return false
end
if Threads.atomic_cas!(memory, old_memory, new_memory) == old_memory
println("thread $(Threads.threadid()): alloc success")
return true
end
end
end
function free()
printstyled("thread $(Threads.threadid()): free ($(memory[])/$(LIMIT) used)\n"; color=:green)
while true
old_memory = memory[]
new_memory = old_memory - 1
@assert new_memory >= 0
if Threads.atomic_cas!(memory, old_memory, new_memory) == old_memory
return
end
end
end
# dummy array
mutable struct CuArray
function CuArray()
success = alloc()
if !success
printstyled("thread $(Threads.threadid()): GC.gc(false)\n"; color=:magenta)
GC.gc(false)
success = alloc()
end
if !success
printstyled("thread $(Threads.threadid()): GC.gc(true)\n"; color=:magenta)
GC.gc(true)
success = alloc()
end
if !success
printstyled("thread $(Threads.threadid()): alloc really failed\n"; color=:red)
throw(OutOfMemoryError())
end
obj = new()
finalizer(obj) do _
free()
end
end
end
function main()
Threads.@threads for i in 1:30
CuArray()
nothing
end
end
isinteractive() || main() |
I am seeing also an GPU OOM while training a neural network with CUDA.jl 5.1.1 and julia 1.9.4 and 2 threads (works fine with 1 thread). The CUDA MWE also fails on my system (NVIDIA A100-SXM4-40GB). Is there any known work-around? I already tried to downgrade CUDA.jl, but without success. Thank you for your time! |
I am trying to add error in running finalizer: ArgumentError(msg="Attempt to release freed data.")error in running finalizer: ArgumentError(msg="Attempt to release freed data.") release at /gpfs/home/acad/ulg-gher/abarth/.julia/packages/GPUArrays/dAUOE/src/host/abstractarray.jl:38 unsafe_free! at /gpfs/home/acad/ulg-gher/abarth/.julia/packages/GPUArrays/dAUOE/src/host/abstractarray.jl:90 [inlined] unsafe_finalize! at /gpfs/home/acad/ulg-gher/abarth/.julia/packages/CUDA/YIj5X/src/array.jl:113 unknown function (ip: 0x7f0fa044c512) _jl_invoke at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-9/src/gf.c:2758 [inlined] ijl_apply_generic at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-9/src/gf.c:2940 run_finalizer at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-9/src/gc.c:417 jl_gc_run_finalizers_in_list at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-9/src/gc.c:507 run_finalizers at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-9/src/gc.c:553 jl_mutex_unlock at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-9/src/julia_locks.h:81 [inlined] jl_generate_fptr_impl at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-9/src/jitlayers.cpp:467 jl_compile_method_internal at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-9/src/gf.c:2348 [inlined] jl_compile_method_internal at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-9/src/gf.c:2237 _jl_invoke at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-9/src/gf.c:2750 [inlined] ijl_apply_generic at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-9/src/gf.c:2940 release at /gpfs/home/acad/ulg-gher/abarth/.julia/packages/GPUArrays/dAUOE/src/host/abstractarray.jl:42 unsafe_free! at /gpfs/home/acad/ulg-gher/abarth/.julia/packages/GPUArrays/dAUOE/src/host/abstractarray.jl:90 [inlined] unsafe_free! at /gpfs/home/acad/ulg-gher/abarth/.julia/packages/CUDA/YIj5X/src/array.jl:112 [inlined] #scan!#1158 at /gpfs/home/acad/ulg-gher/abarth/.julia/packages/CUDA/YIj5X/src/accumulate.jl:194 unknown function (ip: 0x7f0fa0427b1c) unknown function (ip: 0x7f0fa041cc69) unknown function (ip: 0x7f0fa041cc2e) scan! at /gpfs/home/acad/ulg-gher/abarth/.julia/packages/CUDA/YIj5X/src/accumulate.jl:135 [inlined] _accumulate! at /gpfs/home/acad/ulg-gher/abarth/.julia/packages/CUDA/YIj5X/src/accumulate.jl:203 [inlined] #accumulate!#877 at ./accumulate.jl:340 [inlined] accumulate! at ./accumulate.jl:337 [inlined] _cumsum! at ./accumulate.jl:61 [inlined] #cumsum!#869 at ./accumulate.jl:51 [inlined] cumsum! at ./accumulate.jl:49 [inlined] #cumsum#870 at ./accumulate.jl:113 [inlined] cumsum at ./accumulate.jl:111 [inlined] cumsum at ./accumulate.jl:144 [inlined] findall at /gpfs/home/acad/ulg-gher/abarth/.julia/packages/CUDA/YIj5X/src/indexing.jl:25 to_index at /gpfs/home/acad/ulg-gher/abarth/.julia/packages/CUDA/YIj5X/src/indexing.jl:14 [inlined] _to_indices1 at ./indices.jl:359 [inlined] to_indices at ./indices.jl:354 [inlined] to_indices at ./indices.jl:345 [inlined] view at /gpfs/home/acad/ulg-gher/abarth/.julia/packages/GPUArrays/dAUOE/src/host/base.jl:312 [inlined] maybeview at ./views.jl:148 unknown function (ip: 0x7f0fa041f26b) _jl_invoke at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-9/src/gf.c:2758 [inlined] ijl_apply_generic at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-9/src/gf.c:2940 dotview at ./broadcast.jl:1214 [inlined] getobs at /gpfs/home/acad/ulg-gher/abarth/projects-test-orig/Julia/share/diffusion_model.jl:452 getobs at /gpfs/home/acad/ulg-gher/abarth/.julia/packages/MLUtils/n3C0h/src/obsview.jl:187 [inlined] _getbatch at /gpfs/home/acad/ulg-gher/abarth/.julia/packages/MLUtils/n3C0h/src/batchview.jl:144 [inlined] getindex at /gpfs/home/acad/ulg-gher/abarth/.julia/packages/MLUtils/n3C0h/src/batchview.jl:129 [inlined] getobs at /gpfs/home/acad/ulg-gher/abarth/.julia/packages/MLUtils/n3C0h/src/observation.jl:110 [inlined] getobs at /gpfs/home/acad/ulg-gher/abarth/.julia/packages/SimpleTraits/l1ZsK/src/SimpleTraits.jl:331 [inlined] #58 at /gpfs/home/acad/ulg-gher/abarth/.julia/packages/MLUtils/n3C0h/src/parallel.jl:66 unknown function (ip: 0x7f14bc0ccc4c) _jl_invoke at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-9/src/gf.c:2758 [inlined] ijl_apply_generic at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-9/src/gf.c:2940 macro expansion at /gpfs/home/acad/ulg-gher/abarth/.julia/packages/MLUtils/n3C0h/src/parallel.jl:124 [inlined] ##reducing_function#293#68 at /gpfs/home/acad/ulg-gher/abarth/.julia/packages/FLoops/6PVny/src/reduce.jl:817 [inlined] AdjoinIdentity at /gpfs/home/acad/ulg-gher/abarth/.julia/packages/InitialValues/OWP8V/src/InitialValues.jl:306 next at /gpfs/home/acad/ulg-gher/abarth/.julia/packages/Transducers/KcCBR/src/combinators.jl:290 [inlined] next at /gpfs/home/acad/ulg-gher/abarth/.julia/packages/Transducers/KcCBR/src/core.jl:287 [inlined] macro expansion at /gpfs/home/acad/ulg-gher/abarth/.julia/packages/Transducers/KcCBR/src/core.jl:181 [inlined] _foldl_array at /gpfs/home/acad/ulg-gher/abarth/.julia/packages/Transducers/KcCBR/src/processes.jl:187 [inlined] __foldl__ at /gpfs/home/acad/ulg-gher/abarth/.julia/packages/Transducers/KcCBR/src/processes.jl:182 [inlined] foldl_basecase at /gpfs/home/acad/ulg-gher/abarth/.julia/packages/Transducers/KcCBR/src/processes.jl:361 [inlined] _reduce_basecase at /gpfs/home/acad/ulg-gher/abarth/.julia/packages/Transducers/KcCBR/src/threading_utils.jl:58 _reduce at /gpfs/home/acad/ulg-gher/abarth/.julia/packages/Transducers/KcCBR/src/reduce.jl:139 #177 at ./threadingconstructs.jl:416 unknown function (ip: 0x7f14bc0ccc7f) _jl_invoke at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-9/src/gf.c:2758 [inlined] ijl_apply_generic at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-9/src/gf.c:2940 jl_apply at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-9/src/julia.h:1880 [inlined] start_task at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-9/src/task.c:1092 nb_parameters 2359169 WARNING: Error while freeing DeviceBuffer(240.000 KiB at 0x0000000c27443600): ErrorException("task switch not allowed from inside gc finalizer")Manifest.toml# This file is machine-generated - editing it directly is not advisedjulia_version = "1.9.4" [[deps.AbstractFFTs]]
[[deps.Adapt]]
[[deps.ArgCheck]] [[deps.ArgTools]] [[deps.Artifacts]] [[deps.Atomix]] [[deps.BFloat16s]] [[deps.BSON]] [[deps.BangBang]]
[[deps.Base64]] [[deps.Baselet]] [[deps.Bzip2_jll]] [[deps.CEnum]] [[deps.CFTime]] [[deps.CUDA]]
[[deps.CUDA_Driver_jll]] [[deps.CUDA_Runtime_Discovery]] [[deps.CUDA_Runtime_jll]] [[deps.CUDNN_jll]] [[deps.ChainRules]] [[deps.ChainRulesCore]]
[[deps.ColorTypes]] [[deps.Colors]] [[deps.CommonDataModel]] [[deps.CommonSubexpressions]] [[deps.Compat]]
[[deps.CompilerSupportLibraries_jll]] [[deps.CompositionsBase]]
[[deps.ConstructionBase]]
[[deps.ContextVariablesX]] [[deps.Crayons]] [[deps.DataAPI]] [[deps.DataFrames]] [[deps.DataStructures]] [[deps.DataValueInterfaces]] [[deps.Dates]] [[deps.DefineSingletons]] [[deps.DelimitedFiles]] [[deps.DiffResults]] [[deps.DiffRules]] [[deps.DiskArrays]] [[deps.Distributed]] [[deps.DocStringExtensions]] [[deps.Downloads]] [[deps.ExprTools]] [[deps.FLoops]] [[deps.FLoopsBase]] [[deps.FileWatching]] [[deps.FillArrays]]
[[deps.FixedPointNumbers]] [[deps.Flux]]
[[deps.ForwardDiff]]
[[deps.Functors]] [[deps.Future]] [[deps.GPUArrays]] [[deps.GPUArraysCore]] [[deps.GPUCompiler]] [[deps.HDF5_jll]] [[deps.Hwloc_jll]] [[deps.IRTools]] [[deps.InitialValues]] [[deps.InlineStrings]] [[deps.InteractiveUtils]] [[deps.InvertedIndices]] [[deps.IrrationalConstants]] [[deps.IteratorInterfaceExtensions]] [[deps.JLLWrappers]] [[deps.JSON3]] [[deps.JuliaNVTXCallbacks_jll]] [[deps.JuliaVariables]] [[deps.KernelAbstractions]]
[[deps.LLVM]]
[[deps.LLVMExtra_jll]] [[deps.LLVMLoopInfo]] [[deps.LLVMOpenMP_jll]] [[deps.LaTeXStrings]] [[deps.LazyArtifacts]] [[deps.LibCURL]] [[deps.LibCURL_jll]] [[deps.LibGit2]] [[deps.LibSSH2_jll]] [[deps.Libdl]] [[deps.Libiconv_jll]] [[deps.LinearAlgebra]] [[deps.LogExpFunctions]]
[[deps.Logging]] [[deps.MLStyle]] [[deps.MLUtils]] [[deps.MPICH_jll]] [[deps.MPIPreferences]] [[deps.MPItrampoline_jll]] [[deps.MacroTools]] [[deps.Markdown]] [[deps.MbedTLS_jll]] [[deps.MicroCollections]] [[deps.MicrosoftMPI_jll]] [[deps.Missings]] [[deps.Mmap]] [[deps.MozillaCACerts_jll]] [[deps.NCDatasets]] [[deps.NNlib]]
[[deps.NVTX]] [[deps.NVTX_jll]] [[deps.NaNMath]] [[deps.NameResolution]] [[deps.NetCDF_jll]] [[deps.NetworkOptions]] [[deps.OffsetArrays]] [[deps.OneHotArrays]] [[deps.OpenBLAS_jll]] [[deps.OpenLibm_jll]] [[deps.OpenMPI_jll]] [[deps.OpenSSL_jll]] [[deps.OpenSpecFun_jll]] [[deps.Optimisers]] [[deps.OrderedCollections]] [[deps.PMIx_jll]] [[deps.Parsers]] [[deps.Pkg]] [[deps.PooledArrays]] [[deps.PrecompileTools]] [[deps.Preferences]] [[deps.PrettyPrint]] [[deps.PrettyTables]] [[deps.Printf]] [[deps.ProgressLogging]] [[deps.REPL]] [[deps.Random]] [[deps.Random123]] [[deps.RandomNumbers]] [[deps.RealDot]] [[deps.Reexport]] [[deps.Requires]] [[deps.SHA]] [[deps.Scratch]] [[deps.SentinelArrays]] [[deps.Serialization]] [[deps.Setfield]] [[deps.ShowCases]] [[deps.SimpleTraits]] [[deps.Sockets]] [[deps.SortingAlgorithms]] [[deps.SparseArrays]] [[deps.SparseInverseSubset]] [[deps.SpecialFunctions]]
[[deps.SplittablesBase]] [[deps.StaticArrays]]
[[deps.StaticArraysCore]] [[deps.Statistics]] [[deps.StatsAPI]] [[deps.StatsBase]] [[deps.StringManipulation]] [[deps.StructArrays]] [[deps.StructTypes]] [[deps.SuiteSparse]] [[deps.SuiteSparse_jll]] [[deps.TOML]] [[deps.TableTraits]] [[deps.Tables]] [[deps.Tar]] [[deps.Test]] [[deps.TimerOutputs]] [[deps.Transducers]]
[[deps.UUIDs]] [[deps.Unicode]] [[deps.UnsafeAtomics]] [[deps.UnsafeAtomicsLLVM]] [[deps.XML2_jll]] [[deps.Zlib_jll]] [[deps.Zstd_jll]] [[deps.Zygote]]
[[deps.ZygoteRules]] [[deps.cuDNN]] [[deps.libaec_jll]] [[deps.libblastrampoline_jll]] [[deps.libevent_jll]] [[deps.nghttp2_jll]] [[deps.p7zip_jll]] [[deps.prrte_jll]] |
That is a different issue; what's described here is that our GC calls are ineffective with multiple threads, leading to an OOM. You're describing an error that shouldn't occur. Please file a new issue with an MWE so that I can take a look! |
The bug
GPU memory is not freed (fast enough?) when performing a little memory requiring computation
in parallel on multiple threads.
MWE
I came across this issue when using a multi layer ML model via Flux.jl on a GPU in an multi-threaded optimizer.
I was able to reproduce the issue with only CUDA.jl and Adapt.jl:
My output of running the script via
julia --project=. main.jl
:Manifest.toml
Expected behavior
The "permanent" GPU memories (the layers) and the function
eval_model
only uses a very small amount of the GPU memory.Even when performing the function in parallel on multiple threads there should not be any GPU memory issue
(with the size of the arrays in the above MWE).
However, a OOM error is produced.
Version info
Details on Julia:
Details on CUDA:
The text was updated successfully, but these errors were encountered: