-
Notifications
You must be signed in to change notification settings - Fork 9.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Distributed inference via MPI #2099
Conversation
That's actually neat! I'm surprised that the change is so small I saw the discussion in ggerganov/ggml#340 Let's try to make the following changes and see if we can make the implementation more decoupled from
In short, you can follow the |
Done; but how/when should this custom struct be freed? I could free and NULL it immediately after using the information, but this feels wrong somehow. (It's also only 4 bytes so we could also just not worry about the micro-leak.)
Done
Yep, with your suggested approach, the core ggml.c changes are no longer necessary. Will close that PR.
I've moved code into
Tested out the changes with a local MPI ring, and inference still seems to work. Will peel off the Draft label; please let me know if you'd like to see other changes. |
On another note, this paper outlines parallelization strategies used in Google's PaLM: https://arxiv.org/abs/2211.05102 Not sure if they're applicable to LLaMA, but this would be a good starting point for thinking beyond simple layer-based pipeline parallelism... |
I tried to factor out all the MPI logic into the Want to test if this works, but I don't know how to make the Edit: nvm, I just saw the full instructions that you have provided. Will give it a try now |
Great! Note that |
Thanks - it works now. Please take a look at the proposed changes. Let me know what you think |
I've looked over your branch, I agree it's a little hacky but I was able to follow the logic. Overall it makes sense to me. It's great that this will work out of the box with many other models! My only tentative feedback would be to replace |
The linker is unhappy with OpenMPI on GitHub CI
Guessing it just needs a variable added somewhere in the CMakeLists. |
mpi : trying to move more MPI stuff into ggml-mpi (ggerganov#2099)
Yup, let's resolve CI and I think we should try to utilize this approach to run a 65B LLaMA on Raspberry Pis. It would be a fun thing to try and potentially achieve world-first inference of 65B model on a cluster of Raspberries 😄 |
According to a header comment, I will be busy the next few hours but feel free to tweak / merge / etc after reviewing CI. I'm looking forward to seeing 65B models running on clusters of hacked home appliances! |
Model inference is currently limited by the memory on a single node. Using MPI, we can distribute models across a locally networked cluster of machines.
This PR uses a ring pipeline architecture so that the process at rank (index) 0 handles both input and output. The layers are grouped into slices, and each MPI slot (process) handles a slice. The communication during each token prediction happens like
Running MPI locally with N=8, you can see the 13B model distributed across 8 processes; each process takes up less than a gigabyte of system memory.
Note that this doesn't speed anything up as the processes cannot execute concurrently, but these processes can be distributed to multiple machines to take advantage of more machine RAM. No special code was required to read a subset of weights; selective weight-loading is just a consequence of
mmap
.See notes added to the README to try the distributed code for yourself.
Technical changes
The set of changes is somewhat minimal; the additions are:
LLAMA_MPI
compile-time optionggml_mpi_send_tensor
andggml_mpi_recv_tensor
functions, possibly to be added to GGML laterllama_finalize_backend()
API function (callsMPI_Finalize()
)mpi_rank
andmpi_size
fields in the llama_context objectTo take advantage of MPI, binary CLI programs usually need no source code changes except to call
llama_finalize_backend()
. This is something of a hack – I have modifiedllama_new_context_with_model
to enter an evaluation loop on non-primary processes. This loop blocks atMPI_Barrier
, waiting for the driving (rank 0) program to call it. I'm open to other suggestions, but this strategy let me run the example programs more or less out of the box.The changes to the core token prediction algorithm involve sending or receiving tensors before and after the layer loop. Each process only handles a subset of layers. If the process does not handle the first layer, it receives the input tensor from the preceding process. To close the communication ring, the driving (first) process will receive the layer output from the last process, and use that output tensor to compute logits and embeddings. This ensures that all user I/O occurs within a single process.
I was able to test the cluster code locally on an iMac connected to a (very slow) 12" MacBook over WiFi. It didn't win any speed awards, but it did generate plausible text, so I am confident in the overall algorithm correctness. However, there are likely bugs / oversights when it comes to handling MPI communication errors and shutdown.
Leaving as draft as I presume the GGML changes should be finalized and merged before the llama.cpp changes.
See previous discussion in #946