-
Notifications
You must be signed in to change notification settings - Fork 195
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MPI distributed parallelism #590
Conversation
Yes! I have just made this easier in JuliaParallel/MPI.jl#329: you will need the master version of MPI.jl for now, but I'd appreciate if you could try it out. The simplest option is to pass a |
Codecov Report
@@ Coverage Diff @@
## master #590 +/- ##
===========================================
+ Coverage 55.78% 68.39% +12.61%
===========================================
Files 171 72 -99
Lines 4005 2338 -1667
===========================================
- Hits 2234 1599 -635
+ Misses 1771 739 -1032
Continue to review full report at Codecov.
|
@ali-ramadhan I am trying to follow the current state of affairs of this PR but having some difficulties. Can you maybe give me a brief summary? |
@francispoulin Sorry for the neglected state of this PR. It's in a half-baked state and I keep meaning to revisit it. You can decompose the domain into x, y, z cubes/slabs/pencils/etc. Then "halo communication boundary conditions" are injected on edges where the model needs to communicate with another rank then communication occurs as part of Hoping to revisit soon, don't think it should be too hard to get this PR to work for a shallow water model or compressible model (pressure solver will the hard part of an MPI incompressible model). |
Thanks @ali-ramadhan for the summary. Very helpful and sounds like there's a lot there to play with. I saw a link here that looks at MPI scaling in Julia for the 2D diffusion equation. I pressume it would be easy to do a similar test with Oceananigans in 2 or 3D? What I find espeically interesting is right before the conclusiosn they compare with Fortran and C and, if I'm reading this correctly, Julia seems to be faster in serial and on two cores. Impressive! P.S. You seem to be very busy nowadays so please don't let me add to your already full plate. |
5f8aa56
to
efd410d
Compare
Progress updateI decided to take a stab at the simplest case: triply-periodic on the CPU. Surprisingly I was able to get a distributed PR is still a work-in-progress so it's a bit messy, the purpose was to demonstrate a proof of concept. MPI.jl and PencilFFTs.jl are new dependencies but I haven't updated the Project.toml yet. So far this PR adds some new infrastructure:
I also added some simple tests for multi architecture rank connectivity, local grid construction, injection of halo communication BCs, and halo communication (testing x, y, and z slab decompositions). Also added tests for the distributed Poisson solver ensuring the solution is divergence-free. Next step for testing would probably be to test that the code handles Some notesDomain decompositionDomain decomposition is supported and tested in x, y, and z. But for Local topologiesThe local grid topology may need to be different from the global/full grid topology. This isn't an issue for triply-periodic so I haven't thought about how to tackle it yet. How would topologies work for grids on a single rank (i.e. a piece of a larger grid)? Do we need a new In the case of being I guess if we consider communication as a boundary conditions, then it makes sense to use Distributed architecture typesWhat is the architecture of an If This might be needed for multiple dispatch.
|
Right now this PR requires the grid points to be split evenly among the ranks, e.g. if you have 4 ranks along the x-direction then @christophernhill suggested generalizing this. He's written some code for dividing nearly evenly into N subdomains when the global number of points is not exactly divisable by N: https://github.com/christophernhill/iap-2021-12.091/blob/e79dfe9dca5441e561cefd65b4c052b1a1dea5a3/step6.py#L46-58 |
@ali-ramadhan thanks a lot for this great work. I'm super excited to have this feature!
Sorry if I'm misunderstand some things (I'm not very familiar with the pressure-solver implementations) but it looks to me like this isn't an issue for the pressure solver, right? (Based on the fact that you got a 2D turbulence example going.) If that's the case, then this limitation might be okay for now, as it would only affect cases we're you'd want to parallelize in Also (again correct me if I'm missing something) from my talks with you guys about this it seems that the primary goal of MPI-distributed parallelism is to run multi-GPU simulations, since GPUs have a relatively low memory limit. In this case, I think the way to use MPI is very different from the traditional multi-CPU runs. That is, I think the important capabilities when distributing a simulation across a few GPUs are different from the important capabilities when distributing it across hundreds of CPU cores.
In principle I completely agree with this. In principle communicating with another sub-domain is treated exactly the same as you treat BCs, right? Which is to fill the halo region accordingly. The only difference being that the proper way to fill it now depends on other processes, instead of depending on the user definitions of the BCs. This is also true when a subdomain is, say, |
As long as you only decompose the domain in the y direction (dimension 2) then it should work right now. That's the current limitation I guess, should be the same for 1D, 2D, and 3D. So in the 2D turbulence example the domain is decomposed in y since that's the only decomposition currently supported.
Yeah for sure. If you can only decompose in y and you want to run on hundreds of CPU ranks then each rank gets a very thin domain, almost a slice. The surface-to-volume ratio of each rank is high which means lots of communication overhead. If you're running a huge simulation on a moderate number of GPUs then each rank gets a pretty thick slice so the surface-to-volume ratio is smaller, leading to less communication overhead. Definitely agree that it's probably not a huge deal for now.
Thanks for the thoughts! I think using Also would be good to avoid adding a new non-physical topology to do with halo communication haha. |
Perhaps a silly question but with the new MPI parallelism, will be able to do a traditional multi-CPU run with Oceananigans? I presume so but the fact that people are talking about the MPI mostly supporting GPU's, I'm a bit confused. |
Yes! Actually only CPU + MPI is supported right now haha. GPU + MPI will require distributed FFTs across multiple GPUs. Unfortunately right now PencilFFTs.jl only provides distributed FFTs across multiple CPUs. The benefit for multi-GPU parallelism was brought up since we are currently limited to decomposing the domain in the y-direction only. This is probably fine if you don't have too many ranks so the current limitation favors multi-GPU performance over multi-CPU performance. For models that don't need distributed FFTs, e.g. |
That's great news @ali-ramadhan . I guess by looking at your code I can learn how to adapt it to |
I haven't done any scaling tests, that would be super useful! Been meaning to clean up this PR a little bit and integrate it into the main code (right now it's completely separate) so the PR can be merged and development can continue in future PRs. Should probably also rename |
Thanks @ali-ramadhan. Would ti make sense that I wait to do tests until after merger, just in case anything changes along the way? I'm keen to play but can certanily wait a well. |
I think the cleanup mostly involves moving stuff around in files, etc. so the functions themselves shouldn't change. Should be able to play around right now! Might just need to change the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes you've made so far also look good to me. I see that one changed grid to my_grid. Not sure if that was a bug or not but if so glad you caught it.
Alright I think this PR is ready to be merged. Extra change since yesterday: I think we can have a cleaner design by getting rid of So now instead of dispatching on
Yes I think it was picking default boundary conditions based on the EDIT: Also bumping v0.53.0. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few things discussed on zoom with @ali-ramadhan:
- Move the "injection" of halo communication boundary conditions from
DistributedIncompressibleModel
toField
constructor - Make sure that halo region locations are named and ordered from left to right, eg (west, east, south, north, bottom, top)
- Use terminology "local" and "distributed" instead of "my" and "full" for operations in local memory versus distributed operations / data
- docstring in
RankConnectivity
to explain what's going on
Before merging it would also be good to do both a weak and strong scaling analysis to understand the basis of where we are at. |
@vchuravy How big do you think we should go? I guess use one node then go up to 32 or 64 ranks or something? Or should we try doing some multi-node stuff? Also, should we use multithreading via KernelAbstractions.jl to fill up each node and run on multiple nodes? Right now we're limited to 1D/slab domain decompositions so I don't think |
@vchuravy I like the idea of doing both weak and strong scaling analysis. Can you point out any examples that you think do this well that we could use as a guide? |
Did a quick small strong scaling benchmark on Tartarus (256^3) up to 16 cores but results don't look super great? ~9.5x speedup on 16 cores. Better than multi-threading though. Maybe I'm not benchmarking properly though. Could also be missing some MPI barriers. Should probably learn how to profile MPI code.
|
Awesome work @ali-ramadhan! Excited to finally have this in the code. About the scaling. I'm not sure how you're calculating the stats exactly, but one thing to remember is that, given Julia's JIT compiling + all the MPI stuff, I'm guessing the start-up for these simulations is pretty significant and might impacting your results, no? So I guess two ways to circumvent that are to (1) benchmark with pretty long simulations or (2) compile everything ahead of time with PackageCompiler.jl. I don't think we necessarily need to do this now though, but might be good to keep in mind for the future. Thanks again for the great work! |
@ali-ramadhan I agree you've done a great job with this! One thing I might suggest we calculate is the efficiency. The way I have seen it before is the time for the serial run divided by the number of cores and total time for the MPI run. For these cases I find the efficiens of
The numbers are decreasing but I suspect we can probably do better by looking to see where the bottle necks are. For example, I wonder if we could compute the efficiency of just computing the tendencies, which should be pretty fast. Not sure how easy that is to do though. |
That's true but the benchmarking scripts call But yeah I only Never considered PackageCompiler.jl for benchmarking but makes sense here seeing as it's pretty easy to use!
Yes for sure. I'll open an issue to figure out how we should profile and benchmark Julia + MPI better.
Ah the efficiency metric is great! I'll see if we can easily add it to the benchmark results table. It should be pretty easy to isolate a part of |
Them fighting words ;) |
This PR takes a stab at designing a non-invasive interface to running Oceananigans on multiple CPUs and GPUs, i.e. distributed parallelism with MPI. By non-invasive I mean that no existing code will have to change. It may not be the best solution but I really like it and I'm hoping we can discuss this design. I see no reason why it won't perform well.
Vision of this PR:
Oceananigans.Distributed
submodule.fill_halo_regions!
. There is no "master rank".HaloCommunication
boundary conditions wherever a submodel shares a halo with another rank. This is then dispatched on so no need to modify existing code.DistributedPressureSolver
struct that can be used to dispatch onsolve_for_pressure!
. So again, no need to modify existing code.This way MPI does not invade the core code base making it easier to maintain, and there will be a very clear boundary between "core Oceananigans" and "distributed parallelism functionality" which I think will serve us well in the future as MPI seems to permeate deeply into other codes, making them hard to modify.
The big thing that is missing is of course the distributed pressure solver, the hard thing to implement. This is where DistributedTranspose.jl will come in handy. I also recently found PencilFFTs.jl which also looks interesting. cc @leios
For testing purposes, I'm tempted to do the pressure solve via an
MPI.Gather
onto rank 0 where it can be solved locally then anMPI.Scatter
to pass the pressure to all ranks. Super inefficient but might be good to ensure that theDistributedModel
can reproduce existing regression tests.Performance issues:
MPI.Isend
,MPI.Recv!
, andMPI.SendRecv!
all expect send and receive buffers to be contiguous in memory I believe. To get around this I allocate memory for these buffers, but this is definitely not performant. @vchuravy suggested that we may be able to send and receive into strided buffers, so will look into this. cc @simonbyrne maybe you know more about this?Quality of life features we may want in the future (which might effect design choices):
MPI.Gather
) required across all ranks. I wonder if it's even worth thinking about them much. If we really need things like aDistributedHorizontalAverage
then we can look into that.