Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

metal backend ? #179

Closed
LaurentPlagne opened this issue Nov 5, 2024 · 14 comments
Closed

metal backend ? #179

LaurentPlagne opened this issue Nov 5, 2024 · 14 comments

Comments

@LaurentPlagne
Copy link

Hi,

Considering that apple GPUs are the only devices allowing for very large fast RAM (up to 128 Go) for (relatively) affordable price,
I wonder about the amount of work that would be necessary to add a metal backend to PS.

I also wonder if the best way to achieve that would be using a KernelAbstractions.jl backend or to extend the current PS collection of backend.

Any though ?

Laurent

@albert-de-montserrat
Copy link
Contributor

albert-de-montserrat commented Nov 5, 2024

Metal support has recently been merged in #175. Should be available in the next release

@LaurentPlagne
Copy link
Author

Great ! I have just test it (dev) and it seems to run ok !
The performance is a bit lower than what I expected (seems to improve with the size of the problem).
I wonder about the use of FP64 literals in some FD macro definitions like
macro av(A) @expandargs(A); esc(:(($A[$ixi-1,$iyi-1] + $A[$ixi,$iyi-1] + $A[$ixi-1,$iyi] + $A[$ixi,$iyi])*0.25 )) end
Could this lead to costly conversions when dealing with FP32 ?

@albert-de-montserrat
Copy link
Contributor

You could try manually writing the kernel with @parallel_indices instead and explicitly writing A[ixi-1,iyi-1] + A[ixi,iyi-1] + A[ixi-1,iyi] + A[ixi,iyi]) / 4. Does this improve the performance?

@LaurentPlagne
Copy link
Author

I will try this !

@omlins
Copy link
Owner

omlins commented Nov 5, 2024

Great ! I have just test it (dev) and it seems to run ok ! The performance is a bit lower than what I expected (seems to improve with the size of the problem). I wonder about the use of FP64 literals in some FD macro definitions like macro av(A) @expandargs(A); esc(:(($A[$ixi-1,$iyi-1] + $A[$ixi,$iyi-1] + $A[$ixi-1,$iyi] + $A[$ixi,$iyi])*0.25 )) end Could this lead to costly conversions when dealing with FP32 ?

While this seems to be hard coded, in reality it is not: literal types are converted to the right type before the kernel expression is given back to the user (at parse time). To enable simpler dealing with literal types, as a user you only need to distinguish between floats and integers, write for example 1.0 for floats and 1 for integers.

@LaurentPlagne
Copy link
Author

Yes. I confirm that the performance does not change with FP32 literals.
I obtain a 191 GB/s effective bandwidth for a 512^3 problem (acoustic wave with variable velocity field).
I was a bit disappointed because the bandwidth reaches 360 GB/s on a simple array copy on this machine (M1 Max).

Thanks again for the Metal backend !

@luraess
Copy link
Collaborator

luraess commented Nov 5, 2024

Which is about 53% of peak for your application. What's the percentage on e.g. Nvidia GPU? Maybe tuning the kernel launch parameters could help a bit in Metal? Although, 53% isn't that bad ;)

@LaurentPlagne
Copy link
Author

Yes it is not bad for this size (512^3).
I do not remember what was the percentage on my NVidia board (I will check).
The thing is the multithread CPU is rather good on apple silicon and the GPU speed up is less impressive compared to Intel CPU and NVidia GPU ;)

I do not remember how to tune the kernel launch parameters with PS.

@luraess
Copy link
Collaborator

luraess commented Nov 5, 2024

multithread CPU is rather good on apple silicon

Fair point!

how to tune the kernel launch parameters with PS.

From ? @parallel in the REPL: @parallel nblocks nthreads kernelcall can be used to launch with tuned params. In Metal, the heuristics seems to be the same as for CUDA, i.e., max threads per block is 256, and threads.x is 32, see

const NTHREADS_X_MAX = 32

The remaining thread and block sizes is then inferred from
function compute_nthreads(maxsize; nthreads_x_max=NTHREADS_X_MAX, nthreads_max=NTHREADS_MAX, flatdim=0) # This is a heuristic, which results in (32,8,1) threads, except if maxsize[1] < 32 or maxsize[2] < 8.
maxsize = promote_maxsize(maxsize)
nthreads_x = min(nthreads_x_max, (flatdim==1) ? 1 : maxsize[1])
nthreads_y = min(ceil(Int,nthreads_max/nthreads_x), (flatdim==2) ? 1 : maxsize[2])
nthreads_z = min(ceil(Int,nthreads_max/(nthreads_x*nthreads_y)), (flatdim==3) ? 1 : maxsize[3])
return (nthreads_x, nthreads_y , nthreads_z)
end
function compute_nblocks(maxsize, nthreads)
maxsize = promote_maxsize(maxsize)
if !(isa(nthreads, Union{AbstractArray,Tuple}) && length(nthreads)==3) @ArgumentError("nthreads must be an Array or Tuple of size 3 (obtained: $nthreads; its type is: $(typeof(nthreads))).") end
return ceil.(Int, maxsize./nthreads)
end

@omlins it may be interesting to have the per backend heuristic defined in the extension now that we have those?

@LaurentPlagne
Copy link
Author

Thanks !
I will try to tune the parameters for smaller size. The perf is actually quite good and the CPU counterparts is actually 5 times slower ;)

@omlins
Copy link
Owner

omlins commented Nov 5, 2024

@LaurentPlagne : unfortunately, I don't have access to a Metal capable GPU in order to test and tune myself.
@GiackAloZ : have you found any theoretical information or made any experience that is useful to define the heuristics for the Metal backend?

@omlins
Copy link
Owner

omlins commented Nov 5, 2024

and @albert-de-montserrat ?

@LaurentPlagne
Copy link
Author

@omlins : I can run the test you provide on my machine if that helps.
@GiackAloZ : I have enjoyed a lot your presentation at JuliaCon 2024. Our simulations looks similar (but simpler) to what you have presented. Would you be available for a quick talk any time soon ?

@GiackAloZ
Copy link
Contributor

GiackAloZ commented Nov 6, 2024

have you found any theoretical information or made any experience that is useful to define the heuristics for the Metal backend?

@omlins I have not dug deep enough into that as of now, but I could try to gather some info and then come up with an heuristic. I also can run the tests ofc.

I have enjoyed a lot your presentation at JuliaCon 2024. Our simulations looks similar (but simpler) to what you have presented. Would you be available for a quick talk any time soon ?

@LaurentPlagne Thanks a lot! Surely we can chat, contact me vie the email address you can find in my GH profile page.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants