TinyKernels.jl provides a tiny abstraction for GPU (and CPU) kernels, with full support for CUDA (Nvidia) and ROCm (AMD) backends, limited support for Metal (GPU programming on MacOS ARM) backend, and allowing for multi-threaded CPU execution.
TinyKernels.jl is mostly a heavily stripped-down version of KernelAbstractions.jl supporting the bare minimum of the features. This package provides a sandbox for Julia GPU tooling and to measure the performance of kernels in a GPU-agnostic way. While the API of KernelAbstractions.jl is in a "transient" state, this package will provide the thin abstraction layer on top the CUDA.jl, AMDGPU.jl and Metal.jl packages.
TinyKernels.jl allows to explicitly launch GPU kernels asynchronously on different streams or queues with given priority. This feature facilitates the overlap between computations and memory transfers in distributed configurations.
TinyKernels.jl supports automatic differentiation with Enzyme.jl overloading the Enzyme.autodiff
function to enable reverse mode AD of GPU (and CPU) kernels.
Preliminary benchmarks can be found in TinyBenchmarks.jl and Metal playground in MetalGPU.
Stay tuned 🚀
- AMDGPU ≥ v0.4.8
- CUDA ≥ 3.13
- Metal ≥ v0.3.0
- Only
Float32
is being supported. ForFloat64
, one could try using a construct from DoubleFloats.jl which may impact performance. - Automatic differentiation (AD) capabilities (Enzyme.jl) are currently not working on ARM GPU (Metal) and giving erroneous results on ARM CPU.