Merge pull request #175 from GiackAloZ/metal

Add support for `Metal` backend
omlins · Oct 30, 2024 · 0390670 · 0390670
2 parents c1e2e69 + 7179816
commit 0390670
Show file tree

Hide file tree

Showing 36 changed files with 790 additions and 266 deletions.
diff --git a/Project.toml b/Project.toml
@@ -13,19 +13,22 @@ StaticArrays = "90137ffa-7385-5640-81b9-e52037218182"
 AMDGPU = "21141c5a-9bdb-4563-92ae-f87d6854732e"
 CUDA = "052768ef-5323-5732-b1bb-66c8b64840ba"
 Enzyme = "7da242da-08ed-463a-9acd-ee780be4f1d9"
+Metal = "dde4c033-4e86-420c-a63e-0dd931031962"
 Polyester = "f517fe37-dbe3-4b94-8317-1923a5111588"
 
 [extensions]
 ParallelStencil_AMDGPUExt = "AMDGPU"
 ParallelStencil_CUDAExt = "CUDA"
 ParallelStencil_EnzymeExt = "Enzyme"
+ParallelStencil_MetalExt = "Metal"
 
 [compat]
 AMDGPU = "0.6, 0.7, 0.8, 0.9, 1"
 CUDA = "3.12, 4, 5"
 CellArrays = "0.3"
 Enzyme = "0.11, 0.12, 0.13"
 MacroTools = "0.5"
+Metal = "1.2"
 Polyester = "0.7"
 StaticArrays = "1"
 julia = "1.10" # Minimum version supporting Data module creation
@@ -35,4 +38,4 @@ TOML = "fa267f1f-6049-4f14-aa54-33bafae1ed76"
 Test = "8dfed614-e22c-5e08-85e1-65c5234f0b40"
 
 [targets]
-test = ["Test", "TOML", "AMDGPU", "CUDA", "Enzyme", "Polyester"]
+test = ["Test", "TOML", "AMDGPU", "CUDA", "Metal", "Enzyme", "Polyester"]
diff --git a/README.md b/README.md
@@ -7,7 +7,7 @@ ParallelStencil empowers domain scientists to write architecture-agnostic high-l
 
 <a id="fig_teff">![Performance ParallelStencil Teff](docs/images/perf_ps2.png)</a>
 
-ParallelStencil relies on the native kernel programming capabilities of [CUDA.jl] and [AMDGPU.jl] and on [Base.Threads] for high-performance computations on GPUs and CPUs, respectively. It is seamlessly interoperable with [ImplicitGlobalGrid.jl], which renders the distributed parallelization of stencil-based GPU and CPU applications on a regular staggered grid almost trivial and enables close to ideal weak scaling of real-world applications on thousands of GPUs \[[1][JuliaCon20a], [2][JuliaCon20b], [3][JuliaCon19], [4][PASC19]\]. Moreover, ParallelStencil enables hiding communication behind computation with a simple macro call and without any particular restrictions on the package used for communication. ParallelStencil has been designed in conjunction with [ImplicitGlobalGrid.jl] for simplest possible usage by domain-scientists, rendering fast and interactive development of massively scalable high performance multi-GPU applications readily accessible to them. Furthermore, we have developed a self-contained approach for "Solving Nonlinear Multi-Physics on GPU Supercomputers with Julia" relying on ParallelStencil and [ImplicitGlobalGrid.jl] \[[1][JuliaCon20a]\]. ParallelStencil's feature to hide communication behind computation was showcased when a close to ideal weak scaling was demonstrated for a 3-D poro-hydro-mechanical real-world application on up to 1024 GPUs on the Piz Daint Supercomputer \[[1][JuliaCon20a]\]:
+ParallelStencil relies on the native kernel programming capabilities of [CUDA.jl], [AMDGPU.jl], [Metal.jl] and on [Base.Threads] for high-performance computations on GPUs and CPUs, respectively. It is seamlessly interoperable with [ImplicitGlobalGrid.jl], which renders the distributed parallelization of stencil-based GPU and CPU applications on a regular staggered grid almost trivial and enables close to ideal weak scaling of real-world applications on thousands of GPUs \[[1][JuliaCon20a], [2][JuliaCon20b], [3][JuliaCon19], [4][PASC19]\]. Moreover, ParallelStencil enables hiding communication behind computation with a simple macro call and without any particular restrictions on the package used for communication. ParallelStencil has been designed in conjunction with [ImplicitGlobalGrid.jl] for simplest possible usage by domain-scientists, rendering fast and interactive development of massively scalable high performance multi-GPU applications readily accessible to them. Furthermore, we have developed a self-contained approach for "Solving Nonlinear Multi-Physics on GPU Supercomputers with Julia" relying on ParallelStencil and [ImplicitGlobalGrid.jl] \[[1][JuliaCon20a]\]. ParallelStencil's feature to hide communication behind computation was showcased when a close to ideal weak scaling was demonstrated for a 3-D poro-hydro-mechanical real-world application on up to 1024 GPUs on the Piz Daint Supercomputer \[[1][JuliaCon20a]\]:
 
 ![Parallel efficiency of ParallelStencil with CUDA C backend](docs/images/par_eff_c_julia2.png)
 
@@ -33,7 +33,7 @@ Beyond traditional high-performance computing, ParallelStencil supports automati
 * [References](#references)
 
 ## Parallelization and optimization with one macro call
-A simple call to `@parallel` is enough to parallelize and optimize a function and to launch it. The package used underneath for parallelization is defined in a call to `@init_parallel_stencil` beforehand. Supported are [CUDA.jl] and [AMDGPU.jl] for running on GPU and [Base.Threads] for CPU. The following example outlines how to run parallel computations on a GPU using the native kernel programming capabilities of [CUDA.jl] underneath (omitted lines are represented with `#(...)`, omitted arguments with `...`):
+A simple call to `@parallel` is enough to parallelize and optimize a function and to launch it. The package used underneath for parallelization is defined in a call to `@init_parallel_stencil` beforehand. Supported are [CUDA.jl], [AMDGPU.jl] and [Metal.jl] for running on GPU and [Base.Threads] for CPU. The following example outlines how to run parallel computations on a GPU using the native kernel programming capabilities of [CUDA.jl] underneath (omitted lines are represented with `#(...)`, omitted arguments with `...`):
 ```julia
 #(...)
 @init_parallel_stencil(CUDA,...)
@@ -554,6 +554,7 @@ Please open an issue to discuss your idea for a contribution beforehand. Further
 [CellArrays.jl]: https://github.com/omlins/CellArrays.jl
 [CUDA.jl]: https://github.com/JuliaGPU/CUDA.jl
 [AMDGPU.jl]: https://github.com/JuliaGPU/AMDGPU.jl
+[Metal.jl]: https://github.com/JuliaGPU/Metal.jl
 [Enzyme.jl]: https://github.com/EnzymeAD/Enzyme.jl
 [MacroTools.jl]: https://github.com/FluxML/MacroTools.jl
 [StaticArrays.jl]: https://github.com/JuliaArrays/StaticArrays.jl

diff --git a/ext/ParallelStencil_MetalExt.jl b/ext/ParallelStencil_MetalExt.jl
@@ -0,0 +1,4 @@
+module ParallelStencil_MetalExt
+    include(joinpath(@__DIR__, "..", "src", "ParallelKernel", "MetalExt", "shared.jl"))
+    include(joinpath(@__DIR__, "..", "src", "ParallelKernel", "MetalExt", "allocators.jl"))
+end
diff --git a/src/FiniteDifferences.jl b/src/FiniteDifferences.jl
@@ -55,7 +55,7 @@ macro     d2(A)  @expandargs(A);  esc(:( ($A[$ixi+1] - $A[$ixi])  -  ($A[$ixi] -
 macro    all(A)  @expandargs(A);  esc(:( $A[$ix  ] )) end
 macro    inn(A)  @expandargs(A);  esc(:( $A[$ixi ] )) end
 macro     av(A)  @expandargs(A);  esc(:(($A[$ix] + $A[$ix+1] )*0.5 )) end
-macro   harm(A)  @expandargs(A);  esc(:(1.0/(1.0/$A[$ix] + 1.0/$A[$ix+1])*2.0 )) end
+macro   harm(A)  @expandargs(A);  esc(:( inv(inv($A[$ix]) + inv($A[$ix+1]))*2.0 )) end
 macro maxloc(A)  @expandargs(A);  esc(:( max( max($A[$ixi-1], $A[$ixi+1]), $A[$ixi] ) )) end
 macro minloc(A)  @expandargs(A);  esc(:( min( min($A[$ixi-1], $A[$ixi+1]), $A[$ixi] ) )) end
 
@@ -172,11 +172,11 @@ macro    av_xa(A)  @expandargs(A);  esc(:(($A[$ix  ,$iy  ] + $A[$ix+1,$iy  ] )*0
 macro    av_ya(A)  @expandargs(A);  esc(:(($A[$ix  ,$iy  ] + $A[$ix  ,$iy+1] )*0.5 )) end
 macro    av_xi(A)  @expandargs(A);  esc(:(($A[$ix  ,$iyi ] + $A[$ix+1,$iyi ] )*0.5 )) end
 macro    av_yi(A)  @expandargs(A);  esc(:(($A[$ixi ,$iy  ] + $A[$ixi ,$iy+1] )*0.5 )) end
-macro     harm(A)  @expandargs(A);  esc(:(1.0/(1.0/$A[$ix  ,$iy  ] + 1.0/$A[$ix+1,$iy  ] + 1.0/$A[$ix,$iy+1] + 1.0/$A[$ix+1,$iy+1])*4.0 )) end
-macro  harm_xa(A)  @expandargs(A);  esc(:(1.0/(1.0/$A[$ix  ,$iy  ] + 1.0/$A[$ix+1,$iy  ] )*2.0 )) end
-macro  harm_ya(A)  @expandargs(A);  esc(:(1.0/(1.0/$A[$ix  ,$iy  ] + 1.0/$A[$ix  ,$iy+1] )*2.0 )) end
-macro  harm_xi(A)  @expandargs(A);  esc(:(1.0/(1.0/$A[$ix  ,$iyi ] + 1.0/$A[$ix+1,$iyi ] )*2.0 )) end
-macro  harm_yi(A)  @expandargs(A);  esc(:(1.0/(1.0/$A[$ixi ,$iy  ] + 1.0/$A[$ixi ,$iy+1] )*2.0 )) end
+macro     harm(A)  @expandargs(A);  esc(:( inv(inv($A[$ix  ,$iy  ]) + inv($A[$ix+1,$iy  ]) + inv($A[$ix,$iy+1]) + inv($A[$ix+1,$iy+1]))*4.0 )) end
+macro  harm_xa(A)  @expandargs(A);  esc(:( inv(inv($A[$ix  ,$iy  ]) + inv($A[$ix+1,$iy  ]))*2.0 )) end
+macro  harm_ya(A)  @expandargs(A);  esc(:( inv(inv($A[$ix  ,$iy  ]) + inv($A[$ix  ,$iy+1]))*2.0 )) end
+macro  harm_xi(A)  @expandargs(A);  esc(:( inv(inv($A[$ix  ,$iyi ]) + inv($A[$ix+1,$iyi ]))*2.0 )) end
+macro  harm_yi(A)  @expandargs(A);  esc(:( inv(inv($A[$ixi ,$iy  ]) + inv($A[$ixi ,$iy+1]))*2.0 )) end
 macro   maxloc(A)  @expandargs(A);  esc(:( max( max( max($A[$ixi-1,$iyi  ], $A[$ixi+1,$iyi  ])  , $A[$ixi  ,$iyi  ] ),
                                                      max($A[$ixi  ,$iyi-1], $A[$ixi  ,$iyi+1]) ) )) end
 macro   minloc(A)  @expandargs(A);  esc(:( min( min( min($A[$ixi-1,$iyi  ], $A[$ixi+1,$iyi  ])  , $A[$ixi  ,$iyi  ] ),
@@ -361,28 +361,28 @@ macro   av_xzi(A)  @expandargs(A);  esc(:(($A[$ix  ,$iyi ,$iz  ] + $A[$ix+1,$iyi
                                                          $A[$ix  ,$iyi ,$iz+1] + $A[$ix+1,$iyi ,$iz+1] )*0.25 )) end
 macro   av_yzi(A)  @expandargs(A);  esc(:(($A[$ixi ,$iy  ,$iz  ] + $A[$ixi ,$iy+1,$iz  ] +
                                                          $A[$ixi ,$iy  ,$iz+1] + $A[$ixi ,$iy+1,$iz+1] )*0.25 )) end
-macro     harm(A)  @expandargs(A);  esc(:(1.0/(1.0/$A[$ix  ,$iy  ,$iz  ] + 1.0/$A[$ix+1,$iy  ,$iz  ] +
-                                               1.0/$A[$ix+1,$iy+1,$iz  ] + 1.0/$A[$ix+1,$iy+1,$iz+1] +
-                                               1.0/$A[$ix  ,$iy+1,$iz+1] + 1.0/$A[$ix  ,$iy  ,$iz+1] +
-                                               1.0/$A[$ix+1,$iy  ,$iz+1] + 1.0/$A[$ix  ,$iy+1,$iz  ] )*8.0)) end
-macro  harm_xa(A)  @expandargs(A);  esc(:(1.0/(1.0/$A[$ix  ,$iy  ,$iz  ] + 1.0/$A[$ix+1,$iy  ,$iz  ] )*2.0 )) end
-macro  harm_ya(A)  @expandargs(A);  esc(:(1.0/(1.0/$A[$ix  ,$iy  ,$iz  ] + 1.0/$A[$ix  ,$iy+1,$iz  ] )*2.0 )) end
-macro  harm_za(A)  @expandargs(A);  esc(:(1.0/(1.0/$A[$ix  ,$iy  ,$iz  ] + 1.0/$A[$ix  ,$iy  ,$iz+1] )*2.0 )) end
-macro  harm_xi(A)  @expandargs(A);  esc(:(1.0/(1.0/$A[$ix  ,$iyi ,$izi ] + 1.0/$A[$ix+1,$iyi ,$izi ] )*2.0 )) end
-macro  harm_yi(A)  @expandargs(A);  esc(:(1.0/(1.0/$A[$ixi ,$iy  ,$izi ] + 1.0/$A[$ixi ,$iy+1,$izi ] )*2.0 )) end
-macro  harm_zi(A)  @expandargs(A);  esc(:(1.0/(1.0/$A[$ixi ,$iyi ,$iz  ] + 1.0/$A[$ixi ,$iyi ,$iz+1] )*2.0 )) end
-macro harm_xya(A)  @expandargs(A);  esc(:(1.0/(1.0/$A[$ix  ,$iy  ,$iz  ] + 1.0/$A[$ix+1,$iy  ,$iz  ] +
-                                               1.0/$A[$ix  ,$iy+1,$iz  ] + 1.0/$A[$ix+1,$iy+1,$iz  ] )*4.0 )) end
-macro harm_xza(A)  @expandargs(A);  esc(:(1.0/(1.0/$A[$ix  ,$iy  ,$iz  ] + 1.0/$A[$ix+1,$iy  ,$iz  ] +
-                                               1.0/$A[$ix  ,$iy  ,$iz+1] + 1.0/$A[$ix+1,$iy  ,$iz+1] )*4.0 )) end
-macro harm_yza(A)  @expandargs(A);  esc(:(1.0/(1.0/$A[$ix  ,$iy  ,$iz  ] + 1.0/$A[$ix  ,$iy+1,$iz  ] +
-                                               1.0/$A[$ix  ,$iy  ,$iz+1] + 1.0/$A[$ix  ,$iy+1,$iz+1] )*4.0 )) end
-macro harm_xyi(A)  @expandargs(A);  esc(:(1.0/(1.0/$A[$ix  ,$iy  ,$izi ] + 1.0/$A[$ix+1,$iy  ,$izi ] +
-                                               1.0/$A[$ix  ,$iy+1,$izi ] + 1.0/$A[$ix+1,$iy+1,$izi ] )*4.0 )) end
-macro harm_xzi(A)  @expandargs(A);  esc(:(1.0/(1.0/$A[$ix  ,$iyi ,$iz  ] + 1.0/$A[$ix+1,$iyi ,$iz  ] +
-                                               1.0/$A[$ix  ,$iyi ,$iz+1] + 1.0/$A[$ix+1,$iyi ,$iz+1] )*4.0 )) end
-macro harm_yzi(A)  @expandargs(A);  esc(:(1.0/(1.0/$A[$ixi ,$iy  ,$iz  ] + 1.0/$A[$ixi ,$iy+1,$iz  ] +
-                                               1.0/$A[$ixi ,$iy  ,$iz+1] + 1.0/$A[$ixi ,$iy+1,$iz+1] )*4.0 )) end
+macro     harm(A)  @expandargs(A);  esc(:( inv(inv($A[$ix  ,$iy  ,$iz  ]) + inv($A[$ix+1,$iy  ,$iz  ]) +
+                                               inv($A[$ix+1,$iy+1,$iz  ]) + inv($A[$ix+1,$iy+1,$iz+1]) +
+                                               inv($A[$ix  ,$iy+1,$iz+1]) + inv($A[$ix  ,$iy  ,$iz+1]) +
+                                               inv($A[$ix+1,$iy  ,$iz+1]) + inv($A[$ix  ,$iy+1,$iz  ]) )*8.0 )) end
+macro  harm_xa(A)  @expandargs(A);  esc(:( inv(inv($A[$ix  ,$iy  ,$iz  ]) + inv($A[$ix+1,$iy  ,$iz  ]) )*2.0 )) end
+macro  harm_ya(A)  @expandargs(A);  esc(:( inv(inv($A[$ix  ,$iy  ,$iz  ]) + inv($A[$ix  ,$iy+1,$iz  ]) )*2.0 )) end
+macro  harm_za(A)  @expandargs(A);  esc(:( inv(inv($A[$ix  ,$iy  ,$iz  ]) + inv($A[$ix  ,$iy  ,$iz+1]) )*2.0 )) end
+macro  harm_xi(A)  @expandargs(A);  esc(:( inv(inv($A[$ix  ,$iyi ,$izi ]) + inv($A[$ix+1,$iyi ,$izi ]) )*2.0 )) end
+macro  harm_yi(A)  @expandargs(A);  esc(:( inv(inv($A[$ixi ,$iy  ,$izi ]) + inv($A[$ixi ,$iy+1,$izi ]) )*2.0 )) end
+macro  harm_zi(A)  @expandargs(A);  esc(:( inv(inv($A[$ixi ,$iyi ,$iz  ]) + inv($A[$ixi ,$iyi ,$iz+1]) )*2.0 )) end
+macro harm_xya(A)  @expandargs(A);  esc(:( inv(inv($A[$ix  ,$iy  ,$iz  ]) + inv($A[$ix+1,$iy  ,$iz  ]) +
+                                               inv($A[$ix  ,$iy+1,$iz  ]) + inv($A[$ix+1,$iy+1,$iz  ]) )*4.0 )) end
+macro harm_xza(A)  @expandargs(A);  esc(:( inv(inv($A[$ix  ,$iy  ,$iz  ]) + inv($A[$ix+1,$iy  ,$iz  ]) +
+                                               inv($A[$ix  ,$iy  ,$iz+1]) + inv($A[$ix+1,$iy  ,$iz+1]) )*4.0 )) end
+macro harm_yza(A)  @expandargs(A);  esc(:( inv(inv($A[$ix  ,$iy  ,$iz  ]) + inv($A[$ix  ,$iy+1,$iz  ]) +
+                                               inv($A[$ix  ,$iy  ,$iz+1]) + inv($A[$ix  ,$iy+1,$iz+1]) )*4.0 )) end
+macro harm_xyi(A)  @expandargs(A);  esc(:( inv(inv($A[$ix  ,$iy  ,$izi ]) + inv($A[$ix+1,$iy  ,$izi ]) +
+                                               inv($A[$ix  ,$iy+1,$izi ]) + inv($A[$ix+1,$iy+1,$izi ]) )*4.0 )) end
+macro harm_xzi(A)  @expandargs(A);  esc(:( inv(inv($A[$ix  ,$iyi ,$iz  ]) + inv($A[$ix+1,$iyi ,$iz  ]) +
+                                               inv($A[$ix  ,$iyi ,$iz+1]) + inv($A[$ix+1,$iyi ,$iz+1]) )*4.0 )) end
+macro harm_yzi(A)  @expandargs(A);  esc(:( inv(inv($A[$ixi ,$iy  ,$iz  ]) + inv($A[$ixi ,$iy+1,$iz  ]) +
+                                               inv($A[$ixi ,$iy  ,$iz+1]) + inv($A[$ixi ,$iy+1,$iz+1]) )*4.0 )) end
 macro   maxloc(A)  @expandargs(A);  esc(:( max( max( max( max($A[$ixi-1,$iyi  ,$izi  ], $A[$ixi+1,$iyi  ,$izi  ])  , $A[$ixi  ,$iyi  ,$izi  ] ),
                                                           max($A[$ixi  ,$iyi-1,$izi  ], $A[$ixi  ,$iyi+1,$izi  ]) ),
                                                           max($A[$ixi  ,$iyi  ,$izi-1], $A[$ixi  ,$iyi  ,$izi+1]) ) )) end