Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Metal atomics backend following the AtomixCUDA package - almost verbatim. #39

Merged
merged 8 commits into from
Nov 8, 2024

Conversation

anicusan
Copy link
Contributor

@anicusan anicusan commented Oct 30, 2024

Tests pass on my Mac M3. Only changes to AtomixCUDA is the use of Int32 instead of Int - are there any plans to add support for 64-bit atomics, at least when natively supported on the M2 and up?

@maleadt maleadt added the enhancement New feature or request label Nov 6, 2024
Copy link
Member

@maleadt maleadt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@maleadt
Copy link
Member

maleadt commented Nov 6, 2024

Tests pass on my Mac M3.

Did you also verify this against KernelAbstractions?

are there any plans to add support for 64-bit atomics, at least when natively supported on the M2 and up?

No concrete plans, but it shouldn't be too hard to generalize the functionality for that in Metal.jl


I've also bumped the version numbers here to v1.0 to get out of the dreaded v0.x regime, so it will take until KA.jl bumps compat for this to actually be installable.

@anicusan
Copy link
Contributor Author

anicusan commented Nov 6, 2024

Thanks a lot for going over this! We're waiting on atomics in ImplicitBVH.jl to make it work across the JuliaGPU stacks.

I made a local copy of KernelAbstractions and bumped the [compat] to Atomix = "1.0", and dev-ed Atomix, AtomixMetal, and KernelAbstractions. I tried running the following code:

using KernelAbstractions
using Atomix
using Metal

@kernel cpu=false function atomic_add_ka!(v)
    i = @index(Global)
    Atomix.@atomic v[i] += eltype(v)(1)
end

v = Metal.zeros(Int32, 1000)
atomic_add_ka!(get_backend(v), 128)(v, ndrange=length(v))
@assert all(Array(v) .== 1)

Which gives me the following error:

ERROR: LoadError: Compilation to native code failed; see below for details.
If you think this is a bug, please file an issue and attach /var/folders/gk/pdh0y2f100s3z_kkb9wv50tr0000gn/T/jl_wR4DG9gR0D.metallib
Stacktrace:
  [1] error(s::String)
    @ Base ./error.jl:35
  [2] macro expansion
    @ ~/.julia/packages/Metal/JtmpJ/src/compiler/compilation.jl:195 [inlined]
  [3] macro expansion
    @ ~/.julia/packages/ObjectiveC/C7BVt/src/os.jl:264 [inlined]
  [4] macro expansion
    @ ~/.julia/packages/Metal/JtmpJ/src/compiler/compilation.jl:178 [inlined]
  [5] (::Metal.var"#171#172"{Bool, GPUCompiler.CompilerJob{…}, @NamedTuple{…}})()
    @ Metal ~/.julia/packages/ObjectiveC/C7BVt/src/foundation.jl:637
  [6] macro expansion
    @ ~/.julia/packages/ObjectiveC/C7BVt/src/foundation.jl:565 [inlined]
  [7] macro expansion
    @ ./lock.jl:273 [inlined]
  [8] ObjectiveC.Foundation.NSAutoreleasePool(f::Metal.var"#171#172"{Bool, GPUCompiler.CompilerJob{…}, @NamedTuple{…}})
    @ ObjectiveC.Foundation ~/.julia/packages/ObjectiveC/C7BVt/src/foundation.jl:557
  [9] link(job::GPUCompiler.CompilerJob, compiled::@NamedTuple{image::Vector{UInt8}, entry::String}; return_function::Bool)
    @ Metal ~/.julia/packages/ObjectiveC/C7BVt/src/foundation.jl:636
 [10] actual_compilation(cache::Dict{…}, src::Core.MethodInstance, world::UInt64, cfg::GPUCompiler.CompilerConfig{…}, compiler::typeof(Metal.compile), linker::typeof(Metal.link))
    @ GPUCompiler ~/.julia/packages/GPUCompiler/2CW9L/src/execution.jl:262
 [11] cached_compilation(cache::Dict{…}, src::Core.MethodInstance, cfg::GPUCompiler.CompilerConfig{…}, compiler::Function, linker::Function)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/2CW9L/src/execution.jl:151
 [12] macro expansion
    @ ~/.julia/packages/Metal/JtmpJ/src/compiler/execution.jl:189 [inlined]
 [13] macro expansion
    @ ./lock.jl:273 [inlined]
 [14] mtlfunction(f::typeof(gpu_atomic_add_ka!), tt::Type{Tuple{…}}; name::Nothing, kwargs::@Kwargs{})
    @ Metal ~/.julia/packages/Metal/JtmpJ/src/compiler/execution.jl:184
 [15] mtlfunction
    @ ~/.julia/packages/Metal/JtmpJ/src/compiler/execution.jl:182 [inlined]
 [16] macro expansion
    @ ~/.julia/packages/Metal/JtmpJ/src/compiler/execution.jl:85 [inlined]
 [17] (::KernelAbstractions.Kernel{…})(args::MtlVector{…}; ndrange::Int64, workgroupsize::Nothing)
    @ Metal.MetalKernels ~/.julia/packages/Metal/JtmpJ/src/MetalKernels.jl:110
 [18] top-level scope
    @ ~/Prog/Julia/Packages/Atomix.jl-fork/prototype/atomic_add_test.jl:14
 [19] include(fname::String)
    @ Main ./sysimg.jl:38
 [20] top-level scope
    @ REPL[12]:1
 [21] top-level scope
    @ ~/.julia/packages/Metal/JtmpJ/src/initialization.jl:72
in expression starting at /Users/anicusan/Prog/Julia/Packages/Atomix.jl-fork/prototype/atomic_add_test.jl:14

caused by: NSError: Failed to materializeAll. (AGXMetalG15X_M1, code 3)
Stacktrace:
  [1] Metal.MTL.MTLComputePipelineState(dev::Metal.MTL.MTLDeviceInstance, fun::Metal.MTL.MTLFunctionInstance)
    @ Metal.MTL ~/.julia/packages/Metal/JtmpJ/lib/mtl/compute_pipeline.jl:60
  [2] macro expansion
    @ ~/.julia/packages/Metal/JtmpJ/src/compiler/compilation.jl:183 [inlined]
  [3] macro expansion
    @ ~/.julia/packages/ObjectiveC/C7BVt/src/os.jl:264 [inlined]
  [4] macro expansion
    @ ~/.julia/packages/Metal/JtmpJ/src/compiler/compilation.jl:178 [inlined]
  [5] (::Metal.var"#171#172"{Bool, GPUCompiler.CompilerJob{…}, @NamedTuple{…}})()
    @ Metal ~/.julia/packages/ObjectiveC/C7BVt/src/foundation.jl:637
  [6] macro expansion
    @ ~/.julia/packages/ObjectiveC/C7BVt/src/foundation.jl:565 [inlined]
  [7] macro expansion
    @ ./lock.jl:273 [inlined]
  [8] ObjectiveC.Foundation.NSAutoreleasePool(f::Metal.var"#171#172"{Bool, GPUCompiler.CompilerJob{…}, @NamedTuple{…}})
    @ ObjectiveC.Foundation ~/.julia/packages/ObjectiveC/C7BVt/src/foundation.jl:557
  [9] link(job::GPUCompiler.CompilerJob, compiled::@NamedTuple{image::Vector{UInt8}, entry::String}; return_function::Bool)
    @ Metal ~/.julia/packages/ObjectiveC/C7BVt/src/foundation.jl:636
 [10] actual_compilation(cache::Dict{…}, src::Core.MethodInstance, world::UInt64, cfg::GPUCompiler.CompilerConfig{…}, compiler::typeof(Metal.compile), linker::typeof(Metal.link))
    @ GPUCompiler ~/.julia/packages/GPUCompiler/2CW9L/src/execution.jl:262
 [11] cached_compilation(cache::Dict{…}, src::Core.MethodInstance, cfg::GPUCompiler.CompilerConfig{…}, compiler::Function, linker::Function)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/2CW9L/src/execution.jl:151
 [12] macro expansion
    @ ~/.julia/packages/Metal/JtmpJ/src/compiler/execution.jl:189 [inlined]
 [13] macro expansion
    @ ./lock.jl:273 [inlined]
 [14] mtlfunction(f::typeof(gpu_atomic_add_ka!), tt::Type{Tuple{…}}; name::Nothing, kwargs::@Kwargs{})
    @ Metal ~/.julia/packages/Metal/JtmpJ/src/compiler/execution.jl:184
 [15] mtlfunction
    @ ~/.julia/packages/Metal/JtmpJ/src/compiler/execution.jl:182 [inlined]
 [16] macro expansion
    @ ~/.julia/packages/Metal/JtmpJ/src/compiler/execution.jl:85 [inlined]
 [17] (::KernelAbstractions.Kernel{…})(args::MtlVector{…}; ndrange::Int64, workgroupsize::Nothing)
    @ Metal.MetalKernels ~/.julia/packages/Metal/JtmpJ/src/MetalKernels.jl:110
 [18] top-level scope
    @ ~/Prog/Julia/Packages/Atomix.jl-fork/prototype/atomic_add_test.jl:14
 [19] include(fname::String)
    @ Main ./sysimg.jl:38
 [20] top-level scope
    @ REPL[12]:1
 [21] top-level scope
    @ ~/.julia/packages/Metal/JtmpJ/src/initialization.jl:72
Some type information was truncated. Use `show(err)` to see complete types.

When running @macroexpand on the kernel I see:

julia> @macroexpand @kernel cpu=false function atomic_add_ka!(v)
           i = @index(Global)
           Atomix.@atomic v[i] += eltype(v)(1)
       end
quote
    function gpu_atomic_add_ka!(__ctx__, v; )
        let
            #= /Users/anicusan/Prog/Julia/Packages/Atomix.jl-fork/prototype/KernelAbstractions.jl/src/macros.jl:96 =#
            if (KernelAbstractions.__validindex)(__ctx__)
                #= /Users/anicusan/Prog/Julia/Packages/Atomix.jl-fork/prototype/KernelAbstractions.jl/src/macros.jl:97 =#
                begin
                    #= REPL[15]:1 =#
                    #= REPL[15]:2 =#
                    i = KernelAbstractions.__index_Global_Linear(__ctx__)
                    #= REPL[15]:3 =#
                    ((Atomix.Internal.Atomix).modify!((Atomix.Internal.referenceable(v))[i], +, (eltype(v))(1), UnsafeAtomics.seq_cst))[2]
                end
            end
            #= /Users/anicusan/Prog/Julia/Packages/Atomix.jl-fork/prototype/KernelAbstractions.jl/src/macros.jl:99 =#
            return nothing
        end
    end
    begin
        #= /Users/anicusan/Prog/Julia/Packages/Atomix.jl-fork/prototype/KernelAbstractions.jl/src/macros.jl:54 =#
        if !($(Expr(:isdefined, :atomic_add_ka!)))
            #= /Users/anicusan/Prog/Julia/Packages/Atomix.jl-fork/prototype/KernelAbstractions.jl/src/macros.jl:55 =#
            begin
                $(Expr(:meta, :doc))
                atomic_add_ka!(dev) = begin
                        #= /Users/anicusan/Prog/Julia/Packages/Atomix.jl-fork/prototype/KernelAbstractions.jl/src/macros.jl:55 =#
                        atomic_add_ka!(dev, (KernelAbstractions.NDIteration.DynamicSize)(), (KernelAbstractions.NDIteration.DynamicSize)())
                    end
            end
            #= /Users/anicusan/Prog/Julia/Packages/Atomix.jl-fork/prototype/KernelAbstractions.jl/src/macros.jl:56 =#
            atomic_add_ka!(dev, size) = begin
                    #= /Users/anicusan/Prog/Julia/Packages/Atomix.jl-fork/prototype/KernelAbstractions.jl/src/macros.jl:56 =#
                    atomic_add_ka!(dev, (KernelAbstractions.NDIteration.StaticSize)(size), (KernelAbstractions.NDIteration.DynamicSize)())
                end
            #= /Users/anicusan/Prog/Julia/Packages/Atomix.jl-fork/prototype/KernelAbstractions.jl/src/macros.jl:57 =#
            atomic_add_ka!(dev, size, range) = begin
                    #= /Users/anicusan/Prog/Julia/Packages/Atomix.jl-fork/prototype/KernelAbstractions.jl/src/macros.jl:57 =#
                    atomic_add_ka!(dev, (KernelAbstractions.NDIteration.StaticSize)(size), (KernelAbstractions.NDIteration.StaticSize)(range))
                end
            #= /Users/anicusan/Prog/Julia/Packages/Atomix.jl-fork/prototype/KernelAbstractions.jl/src/macros.jl:58 =#
            function atomic_add_ka!(dev::Dev, sz::S, range::NDRange) where {Dev, S <: KernelAbstractions.NDIteration._Size, NDRange <: KernelAbstractions.NDIteration._Size}
                #= /Users/anicusan/Prog/Julia/Packages/Atomix.jl-fork/prototype/KernelAbstractions.jl/src/macros.jl:58 =#
                #= /Users/anicusan/Prog/Julia/Packages/Atomix.jl-fork/prototype/KernelAbstractions.jl/src/macros.jl:59 =#
                if (KernelAbstractions.isgpu)(dev)
                    #= /Users/anicusan/Prog/Julia/Packages/Atomix.jl-fork/prototype/KernelAbstractions.jl/src/macros.jl:60 =#
                    return (KernelAbstractions.construct)(dev, sz, range, gpu_atomic_add_ka!)
                else
                    #= /Users/anicusan/Prog/Julia/Packages/Atomix.jl-fork/prototype/KernelAbstractions.jl/src/macros.jl:62 =#
                    if false
                        #= /Users/anicusan/Prog/Julia/Packages/Atomix.jl-fork/prototype/KernelAbstractions.jl/src/macros.jl:63 =#
                        return (KernelAbstractions.construct)(dev, sz, range, cpu_atomic_add_ka!)
                    else
                        #= /Users/anicusan/Prog/Julia/Packages/Atomix.jl-fork/prototype/KernelAbstractions.jl/src/macros.jl:65 =#
                        error("This kernel is unavailable for backend CPU")
                    end
                end
            end
        end
    end
end

Metal.versioninfo() gives me:

julia> Metal.versioninfo()
macOS 15.0.1, Darwin 24.0.0

Toolchain:
- Julia: 1.11.1
- LLVM: 16.0.6

Julia packages: 
- Metal.jl: 1.4.2
- GPUArrays: 10.3.1
- GPUCompiler: 0.27.8
- KernelAbstractions: 0.9.29
- ObjectiveC: 3.1.0
- LLVM: 9.1.3
- LLVMDowngrader_jll: 0.3.0+2

1 device:
- Apple M3 Max (464.000 KiB allocated)

And versioninfo() gives me:

julia> versioninfo()
Julia Version 1.11.1
Commit 8f5b7ca12ad (2024-10-16 10:53 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: macOS (arm64-apple-darwin22.4.0)
  CPU: 14 × Apple M3 Max
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, apple-m3)
Threads: 1 default, 0 interactive, 1 GC (on 10 virtual cores)

I do not know enough about the KernelAbstractions/JuliaGPU compilation pipeline to debug this further - @vchuravy or @maleadt , would you have any pointers?

@anicusan
Copy link
Contributor Author

anicusan commented Nov 6, 2024

If we make this work, I will implement AtomixoneAPI too, so Atomix v1.0 can work on all KernelAbstractions backends. Is AMDGPU using the same Atomix backend for atomics? If so, I can write AtomixAMDGPU in a similar way, but I don't have an AMD GPU on hand to test it.

@maleadt
Copy link
Member

maleadt commented Nov 6, 2024

Hmm, that doesn't bode well. Some bad LLVM IR is being generated here, causing the Metal back-end compiler to abort. Can you upload the metallib from the error message here?

@anicusan
Copy link
Contributor Author

anicusan commented Nov 6, 2024

Here's a zip with the jl_tgfgHYNGNa.metallib generated:
jl_tgfgHYNGNa.zip

@tgymnich
Copy link

tgymnich commented Nov 6, 2024

are there any plans to add support for 64-bit atomics, at least when natively supported on the M2 and up?

added an issue to track progress on this: JuliaGPU/Metal.jl#477

@maleadt
Copy link
Member

maleadt commented Nov 7, 2024

Here's a zip with the jl_tgfgHYNGNa.metallib generated:
jl_tgfgHYNGNa.zip

Invalid record (Producer: 'LLVM16.0.6' Reader: 'LLVM 16.0.6jl')

Are you using a non-official version of Julia? If so, it's strongly recommended to use juliaup with official builds.

@anicusan
Copy link
Contributor Author

anicusan commented Nov 7, 2024

Hi, no, I am using the official Julia distribution installed via Juliaup - see my versioninfo in the previous comment. Then again, normal Metal kernels do work - it's just atomics that produce the error above.

@maleadt
Copy link
Member

maleadt commented Nov 7, 2024

It looks like the IR is actually corrupt -- the producer/reader mismatch is only a red herring. Probably an issue with our IR downgrader. I'll take a closer look.

@maleadt
Copy link
Member

maleadt commented Nov 7, 2024

The invalid IR comes from atomicrmw not being supported by the downgrader. I can fix that, but it won't help you, as Metal doesn't support LLVM's native atomics. The fact that an atomicrmw is being emitted by Atomix.jl, indicates that this PR is incomplete. I guess it should use atomic_fetch_OP_explicit instead, see https://github.com/JuliaGPU/Metal.jl/blob/9019b56c05055db3e5dcd93ca0d08bf264c908cd/src/device/intrinsics/atomics.jl#L205-L238 for how Metal.jl implements this for Metal.@atomic.


EDIT: The unsupported IR in question:

define void @kernel(ptr %ptr) {
  %1 = atomicrmw add ptr %ptr, i32 0 monotonic, align 4
  %2 = cmpxchg ptr %ptr, i32 0, i32 1 monotonic monotonic
  ret void
}

@anicusan
Copy link
Contributor Author

anicusan commented Nov 7, 2024

That's odd - why did the tests (copied verbatim from AtomixCUDA) work then? I'm asking to 1) understand it, and 2) then write a test covering this failed case.

The operations in Atomix.modify! already use atomic_fetch_OP_explicit as defined in Metal.jl - this is what I wrote:

@inline function Atomix.modify!(ref::MtlIndexableRef, op::OP, x, order) where {OP}
    x = convert(eltype(ref), x)
    ptr = Atomix.pointer(ref)
    begin
        old = if op === (+)
            Metal.atomic_fetch_add_explicit(ptr, x)
        elseif op === (-)
            Metal.atomic_fetch_sub_explicit(ptr, x)
        elseif op === (&)
            Metal.atomic_fetch_and_explicit(ptr, x)
        elseif op === (|)
            Metal.atomic_fetch_or_explicit(ptr, x)
        elseif op === xor
            Metal.atomic_fetch_xor_explicit(ptr, x)
        elseif op === min
            Metal.atomic_fetch_min_explicit(ptr, x)
        elseif op === max
            Metal.atomic_fetch_max_explicit(ptr, x)
        else
            error("not implemented")
        end
    end
    return old => op(old, x)
end

@maleadt
Copy link
Member

maleadt commented Nov 7, 2024

That's odd - why did the tests (copied verbatim from AtomixCUDA) work then?

Presumably because those tests didn't trigger atomicrmw emission? In CUDA, the LLVM atomics are partially supported, explaining why this isn't needed for CUDA.jl.

Here's where the atomicrmw comes from (you can see this using @device_code_llvm):

; │┌ @ /Users/tim/Julia/pkg/Atomix/src/core.jl:30 within `modify!`
; ││┌ @ /Users/tim/Julia/pkg/Atomix/src/references.jl:99 within `pointer` @ /Users/tim/Julia/pkg/Metal/src/device/array.jl:64
; │││┌ @ abstractarray.jl:1236 within `_memory_offset`
; ││││┌ @ int.jl:88 within `*`
       %16 = shl nuw nsw i64 %14, 2
       %17 = add nsw i64 %16, -4
; │││└└
; │││┌ @ /Users/tim/.julia/packages/LLVM/wMjUU/src/interop/pointer.jl:147 within `+`
; ││││┌ @ /Users/tim/.julia/packages/LLVM/wMjUU/src/interop/pointer.jl:114 within `add_ptr`
; │││││┌ @ /Users/tim/.julia/packages/LLVM/wMjUU/src/interop/pointer.jl:114 within `macro expansion` @ /Users/tim/.julia/packages/LLVM/wMjUU/src/interop/base.jl:39
        %18 = getelementptr i8, i8 addrspace(1)* %.unpack, i64 %17
; ││└└└└
; ││ @ /Users/tim/Julia/pkg/Atomix/src/core.jl:33 within `modify!` @ /Users/tim/.julia/packages/UnsafeAtomicsLLVM/LPqS5/src/internal.jl:23 @ /Users/tim/.julia/packages/UnsafeAtomicsLLVM/LPqS5/src/internal.jl:23
; ││┌ @ /Users/tim/.julia/packages/UnsafeAtomicsLLVM/LPqS5/src/atomics.jl:399 within `atomic_pointermodify`
; │││┌ @ /Users/tim/.julia/packages/UnsafeAtomicsLLVM/LPqS5/src/atomics.jl:260 within `llvm_atomic_op`
; ││││┌ @ /Users/tim/.julia/packages/UnsafeAtomicsLLVM/LPqS5/src/atomics.jl:260 within `macro expansion` @ /Users/tim/.julia/packages/LLVM/wMjUU/src/interop/base.jl:39
       %19 = bitcast i8 addrspace(1)* %18 to i32 addrspace(1)*
       %20 = atomicrmw add i32 addrspace(1)* %19, i32 1 seq_cst, align 4

Interestingly though, after fixing the downgrader your example does just work. I guess Apple has recently added some support for native LLVM atomics to Metal? Something to look into, but if you want to be sure I'd try to use the explicit AIR atomic intrinsics where possible for now.

@anicusan
Copy link
Contributor Author

anicusan commented Nov 7, 2024

First, thank you again for taking the time to investigate all this.

The AtomixMetal tests do end up using our implementation of Atomix.modify!(ref::MtlIndexableRef, ...), which forwards the calls to Metal.atomic_fetch_OP_explicit (these are the explicit AIR intrinsics you mentioned, right?).

The odd part is that KernelAbstractions kernels using atomics end up emitting LLVM IR for them, and not anything via AtomixMetal - as seen in your call stack, the IR comes from UnsafeAtomicsLLVM, completely circumventing AtomixMetal. If CUDA does support LLVM IR, then that explains why it was working in CUDA, even if it was not actually using AtomixCUDA, just UnsafeAtomicsLLVM directly.

But wasn't KernelAbstractions use of Atomix supposed to use the right Atomix{backend} package? Or are they not needed anymore?

If it seems we now accidentally have Metal atomics in KA because of LLVM IR I'm happy, but I'm still curious about the AtomixMetal / AtomixCUDA stacks, which may still be needed for other backends in the future.

@maleadt
Copy link
Member

maleadt commented Nov 7, 2024

But wasn't KernelAbstractions use of Atomix supposed to use the right Atomix{backend} package?

I thought so as well; cc @vchuravy.

Note that a better design would be to use LLVM atomics everywhere and do the lowering to backend-specific intrinsics (like AIR's) in GPUCompiler.jl, but that's a redesign I don't have the time for (JuliaGPU/GPUCompiler.jl#479).

@pxl-th
Copy link

pxl-th commented Nov 7, 2024

Is AMDGPU using the same Atomix backend for atomics?

AMDGPU.jl already uses Atomix for atomics and it does not need any special handling, since we rely directly on LLVM atomics for this.
The only special bit is that we specify syncscope to enable hardware FP atomics.

@christiangnrd
Copy link
Contributor

using KernelAbstractions
using Atomix
using Metal

@kernel cpu=false function atomic_add_ka!(v)
    i = @index(Global)
    Atomix.@atomic v[i] += eltype(v)(1)
end

v = Metal.zeros(Int32, 1000)
atomic_add_ka!(get_backend(v), 128)(v, ndrange=length(v))
@assert all(Array(v) .== 1)

I believe this fails because you forgot to using AtomixMetal, whereas the tests pass because they do load it.

I think both AtomixCUDA and AtomixMetal should be deprecated and converted to extensions. It might even make more sense to have those extensions live in their respective repositories instead of here. Although it may be easier for CI if they live here.

@maleadt
Copy link
Member

maleadt commented Nov 7, 2024

I believe this fails because you forgot to using AtomixMetal, whereas the tests pass because they do load it.

That seems to be correct.

; │┌ @ /Users/tim/Julia/pkg/Atomix/lib/AtomixMetal/src/AtomixMetal.jl:35 within `modify!`
; ││┌ @ /Users/tim/Julia/pkg/Atomix/src/references.jl:99 within `pointer` @ /Users/tim/Julia/pkg/Metal/src/device/array.jl:64
; │││┌ @ abstractarray.jl:1236 within `_memory_offset`
; ││││┌ @ int.jl:88 within `*`
       %16 = shl nuw nsw i64 %14, 2
       %17 = add nsw i64 %16, -4
; │││└└
; │││┌ @ /Users/tim/.julia/packages/LLVM/wMjUU/src/interop/pointer.jl:147 within `+`
; ││││┌ @ /Users/tim/.julia/packages/LLVM/wMjUU/src/interop/pointer.jl:114 within `add_ptr`
; │││││┌ @ /Users/tim/.julia/packages/LLVM/wMjUU/src/interop/pointer.jl:114 within `macro expansion` @ /Users/tim/.julia/packages/LLVM/wMjUU/src/interop/base.jl:39
        %18 = getelementptr i8, i8 addrspace(1)* %.unpack, i64 %17
; ││└└└└
; ││ @ /Users/tim/Julia/pkg/Atomix/lib/AtomixMetal/src/AtomixMetal.jl:38 within `modify!`
; ││┌ @ /Users/tim/Julia/pkg/Metal/src/device/intrinsics/atomics.jl:84 within `atomic_fetch_add_explicit`
; │││┌ @ /Users/tim/.julia/packages/LLVM/wMjUU/src/interop/pointer.jl:344 within `macro expansion`
; ││││┌ @ /Users/tim/.julia/packages/LLVM/wMjUU/src/interop/pointer.jl:182 within `_typed_llvmcall`
; │││││┌ @ /Users/tim/.julia/packages/LLVM/wMjUU/src/interop/pointer.jl:182 within `macro expansion` @ /Users/tim/.julia/packages/LLVM/wMjUU/src/interop/base.jl:39
        %19 = bitcast i8 addrspace(1)* %18 to i32 addrspace(1)*
        %20 = call i32 @air.atomic.global.add.s.i32(i32 addrspace(1)* %19, i32 1, i32 0, i32 2, i1 true)

@christiangnrd
Copy link
Contributor

I opened JuliaGPU/CUDA.jl#2549. If it works out, I think we should do the same with the Metal backend instead of a new package.

@maleadt
Copy link
Member

maleadt commented Nov 8, 2024

I suggest we merge this, and then focus on converting the current subpackages to extensions (as proposed by @christiangnrd) before releasing v1.0.

@maleadt maleadt merged commit bb8edda into JuliaConcurrent:main Nov 8, 2024
5 of 8 checks passed
@christiangnrd
Copy link
Contributor

christiangnrd commented Nov 8, 2024

I'll have a PR for extensions within a couple hours.

Unless it's already being worked on. Then I won't bother finishing it.

@maleadt
Copy link
Member

maleadt commented Nov 8, 2024

I'll have a PR for extensions within a couple hours.

Awesome! I won't be able to get to it this week, so feel free 🙂

@anicusan
Copy link
Contributor Author

anicusan commented Nov 8, 2024

Thanks for all the help in this conversation! Yes, adding using AtomixMetal explicitly fixes it:

using KernelAbstractions
using Atomix
using AtomixMetal
using Metal

# Have two threads concurrently increment each element
@kernel cpu=false function atomic_add_ka!(v)
    i = @index(Global)
    Atomix.@atomic v[(i - 1) ÷ 2 + 1] += eltype(v)(1)
end

v = Metal.zeros(Int32, 1000)
atomic_add_ka!(get_backend(v), 128)(v, ndrange=length(v) * 2)
@assert all(Array(v) .== 2)

This is a bit surprising - @christiangnrd 's PR will be very useful in codebases using KernelAbstractions with atomics.

Finally - does the oneAPI backend support UnsafeAtomicsLLVM directly, or do we need a similar AtomixoneAPI? (my ancient Intel machine decided to kick the bucket today and I can't test it)

@maleadt
Copy link
Member

maleadt commented Nov 11, 2024

Finally - does the oneAPI backend support UnsafeAtomicsLLVM directly, or do we need a similar AtomixoneAPI? (my ancient Intel machine decided to kick the bucket today and I can't test it)

oneAPI.jl and OpenCL.jl use SPIRVIntrinsics.jl, which (contrary to its name) currently rely on OpenCL-style atomics, which are explicit function calls detected by the back-end: https://github.com/JuliaGPU/OpenCL.jl/blob/master/lib/intrinsics/src/atomic.jl
So we'll need a specific Atomix extension as well.

@christiangnrd
Copy link
Contributor

christiangnrd commented Nov 12, 2024

I'll have a PR for extensions within a couple hours.

Unless it's already being worked on. Then I won't bother finishing it.

Ok didn't have the free time I expected to work on this and I don't know when I'll be able to get to it, so someone else should probably pick this up if they need this.

@anicusan
Copy link
Contributor Author

@christiangnrd I made a pull request for this: #42
Refactored the libs into extensions and added a oneAPI backend; the GPU CI passes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants