Distributed inference via MPI #2099

evanmiller · 2023-07-04T12:48:00Z

Model inference is currently limited by the memory on a single node. Using MPI, we can distribute models across a locally networked cluster of machines.

This PR uses a ring pipeline architecture so that the process at rank (index) 0 handles both input and output. The layers are grouped into slices, and each MPI slot (process) handles a slice. The communication during each token prediction happens like

Rank 0 -> Rank 1 -> ... -> Rank N-1 -> Rank 0

Running MPI locally with N=8, you can see the 13B model distributed across 8 processes; each process takes up less than a gigabyte of system memory.

Note that this doesn't speed anything up as the processes cannot execute concurrently, but these processes can be distributed to multiple machines to take advantage of more machine RAM. No special code was required to read a subset of weights; selective weight-loading is just a consequence of mmap.

See notes added to the README to try the distributed code for yourself.

Technical changes

The set of changes is somewhat minimal; the additions are:

New LLAMA_MPI compile-time option
New ggml_mpi_send_tensor and ggml_mpi_recv_tensor functions, possibly to be added to GGML later
New llama_finalize_backend() API function (calls MPI_Finalize())
New mpi_rank and mpi_size fields in the llama_context object

To take advantage of MPI, binary CLI programs usually need no source code changes except to call llama_finalize_backend(). This is something of a hack – I have modified llama_new_context_with_model to enter an evaluation loop on non-primary processes. This loop blocks at MPI_Barrier, waiting for the driving (rank 0) program to call it. I'm open to other suggestions, but this strategy let me run the example programs more or less out of the box.

The changes to the core token prediction algorithm involve sending or receiving tensors before and after the layer loop. Each process only handles a subset of layers. If the process does not handle the first layer, it receives the input tensor from the preceding process. To close the communication ring, the driving (first) process will receive the layer output from the last process, and use that output tensor to compute logits and embeddings. This ensures that all user I/O occurs within a single process.

I was able to test the cluster code locally on an iMac connected to a (very slow) 12" MacBook over WiFi. It didn't win any speed awards, but it did generate plausible text, so I am confident in the overall algorithm correctness. However, there are likely bugs / oversights when it comes to handling MPI communication errors and shutdown.

Leaving as draft as I presume the GGML changes should be finalized and merged before the llama.cpp changes.

See previous discussion in #946

Makefile

ggerganov · 2023-07-06T20:19:15Z

That's actually neat! I'm surprised that the change is so small

I saw the discussion in ggerganov/ggml#340

Let's try to make the following changes and see if we can make the implementation more decoupled from ggml and llama.cpp. Here is the plan:

Try to avoid introducing tag into ggml_tensor. Instead, utilize the extra pointer to reference a custom MPI-related struct with that information. See how we do this in the CUDA backend as an example:

llama.cpp/ggml-cuda.cu

Lines 236 to 239 in f789f2c

    
           struct ggml_tensor_extra_gpu { 
        
               void * data_device[GGML_CUDA_MAX_DEVICES]; // 1 pointer for each device for split tensors 
        
               cudaEvent_t events[GGML_CUDA_MAX_DEVICES]; // events for synchronizing multiple GPUs 
        
           };

Ideally, we want ggml.c to not include mpi.h. To achieve that, lets try to define ggml_send_tensor() and ggml_recv_tensor() as custom operators (see ggml_map_custom1_f32()) in the user code - i.e. llama.cpp. Currently we don't have an example how to use these custom ops, but I hope it is clear from the API. Let me know if not.
If this change works out, then you will be able to avoid all modifications in ggml and isolate the MPI dependency just in llama.cpp, so the companion PR won't be necessary anymore
The next step, would be to try and move as much code as possible into a new backend: ggml-mpi.h / ggml-mpi.cpp. This backend will provide a very simple API - mainly ggml_mpi_graph_compute() + helper init/free functions. See ggml-metal.h for reference. The more code you manage to move from llama.cpp into ggml-mpi.h/.cpp the better. I think you will be able to also move the definitions of the custom send / recv operators from the previous point in the ggml-mpi backend as well. Maybe the mpi_rank and mpi_size can go into a new struct ggml_mpi_context similar to struct ggml_metal_context.
All this will help to decouple the implementation from the MPI specifics and let us more easily extend in the future

In short, you can follow the GGML_USE_METAL ifdefs in llama.cpp. The goal is to have a very similar pattern for GGML_USE_MPI with as few modifications in llama.cpp as possible. I am not 100% sure that this refactoring plan will completely workout, so if you reach a blocker - let us know.

evanmiller · 2023-07-07T00:40:40Z

Try to avoid introducing tag into ggml_tensor. Instead, utilize the extra pointer to reference a custom MPI-related struct with that information. See how we do this in the CUDA backend as an example:

llama.cpp/ggml-cuda.cu

Lines 236 to 239 in f789f2c

struct ggml_tensor_extra_gpu {

void * data_device[GGML_CUDA_MAX_DEVICES]; // 1 pointer for each device for split tensors

cudaEvent_t events[GGML_CUDA_MAX_DEVICES]; // events for synchronizing multiple GPUs

};

Done; but how/when should this custom struct be freed? I could free and NULL it immediately after using the information, but this feels wrong somehow. (It's also only 4 bytes so we could also just not worry about the micro-leak.)

Ideally, we want ggml.c to not include mpi.h. To achieve that, lets try to define ggml_send_tensor() and ggml_recv_tensor() as custom operators (see ggml_map_custom1_f32()) in the user code - i.e. llama.cpp. Currently we don't have an example how to use these custom ops, but I hope it is clear from the API. Let me know if not.

Done

If this change works out, then you will be able to avoid all modifications in ggml and isolate the MPI dependency just in llama.cpp, so the companion PR won't be necessary anymore

Yep, with your suggested approach, the core ggml.c changes are no longer necessary. Will close that PR.

The next step, would be to try and move as much code as possible into a new backend: ggml-mpi.h / ggml-mpi.cpp. This backend will provide a very simple API - mainly ggml_mpi_graph_compute() + helper init/free functions. See ggml-metal.h for reference. The more code you manage to move from llama.cpp into ggml-mpi.h/.cpp the better. I think you will be able to also move the definitions of the custom send / recv operators from the previous point in the ggml-mpi backend as well. Maybe the mpi_rank and mpi_size can go into a new struct ggml_mpi_context similar to struct ggml_metal_context.

I've moved code into ggml-mpi.c/h in this PR. If this is what you were thinking, I can open a new PR over in GGML.

All this will help to decouple the implementation from the MPI specifics and let us more easily extend in the future

In short, you can follow the GGML_USE_METAL ifdefs in llama.cpp. The goal is to have a very similar pattern for GGML_USE_MPI with as few modifications in llama.cpp as possible. I am not 100% sure that this refactoring plan will completely workout, so if you reach a blocker - let us know.

Tested out the changes with a local MPI ring, and inference still seems to work. Will peel off the Draft label; please let me know if you'd like to see other changes.

evanmiller · 2023-07-07T00:41:29Z

On another note, this paper outlines parallelization strategies used in Google's PaLM: https://arxiv.org/abs/2211.05102 Not sure if they're applicable to LLaMA, but this would be a good starting point for thinking beyond simple layer-based pipeline parallelism...

README.md

Not tested yet

ggerganov · 2023-07-09T13:08:36Z

@evanmiller

I tried to factor out all the MPI logic into the ggml-mpi backend: evanmiller#1

Want to test if this works, but I don't know how to make the hostfile that you mention in the instructions.
Can you help with some howto for getting this to run on a local machine?

Edit: nvm, I just saw the full instructions that you have provided. Will give it a try now

evanmiller · 2023-07-09T13:34:16Z

@evanmiller

I tried to factor out all the MPI logic into the ggml-mpi backend: evanmiller#1

Want to test if this works, but I don't know how to make the hostfile that you mention in the instructions. Can you help with some howto for getting this to run on a local machine?

Edit: nvm, I just saw the full instructions that you have provided. Will give it a try now

Great! Note that hostfile is only needed for a networked cluster – just omit the argument to execute locally with shared memory communication.

ggerganov · 2023-07-09T15:32:56Z

@evanmiller

Thanks - it works now. Please take a look at the proposed changes.
The implementation is quite hacky atm, but I am OK to merge it since it is well decoupled from llama.cpp. Later we can improve it from the master branch.

Let me know what you think

evanmiller · 2023-07-09T19:01:46Z

I've looked over your branch, I agree it's a little hacky but I was able to follow the logic. Overall it makes sense to me. It's great that this will work out of the box with many other models!

My only tentative feedback would be to replace MPI_INT with MPI_INT32_T for the token tensor. However the latter constant does not appear to be supported by OpenMPI, so you'd need an additional #ifdef to make it all work. Let me add a block to the GitHub workflow so we have both OpenMPI and MPICH coverage.

evanmiller · 2023-07-09T19:15:02Z

The linker is unhappy with OpenMPI on GitHub CI

[ 25%] Building CXX object tests/CMakeFiles/test-sampling.dir/test-sampling.cpp.o
[ 27%] Linking CXX executable ../bin/test-sampling
/usr/bin/ld: ../libllama.a(llama.cpp.o): in function `MPI::Op::Init(void (*)(void const*, void*, int, MPI::Datatype const&), bool)':
llama.cpp:(.text._ZN3MPI2Op4InitEPFvPKvPviRKNS_8DatatypeEEb[_ZN3MPI2Op4InitEPFvPKvPviRKNS_8DatatypeEEb]+0x1d): undefined reference to `ompi_mpi_cxx_op_intercept'
/usr/bin/ld: ../libllama.a(llama.cpp.o): in function `MPI::Intracomm::Clone() const':
llama.cpp:(.text._ZNK3MPI9Intracomm5CloneEv[_ZNK3MPI9Intracomm5CloneEv]+0x40): undefined reference to `MPI::Comm::Comm()'
/usr/bin/ld: ../libllama.a(llama.cpp.o): in function `MPI::Graphcomm::Clone() const':
llama.cpp:(.text._ZNK3MPI9Graphcomm5CloneEv[_ZNK3MPI9Graphcomm5CloneEv]+0x3a): undefined reference to `MPI::Comm::Comm()'
/usr/bin/ld: ../libllama.a(llama.cpp.o): in function `MPI::Cartcomm::Sub(bool const*) const':
llama.cpp:(.text._ZNK3MPI8Cartcomm3SubEPKb[_ZNK3MPI8Cartcomm3SubEPKb]+0x96): undefined reference to `MPI::Comm::Comm()'
/usr/bin/ld: ../libllama.a(llama.cpp.o): in function `MPI::Intracomm::Create_graph(int, int const*, int const*, bool) const':
llama.cpp:(.text._ZNK3MPI9Intracomm12Create_graphEiPKiS2_b[_ZNK3MPI9Intracomm12Create_graphEiPKiS2_b]+0x42): undefined reference to `MPI::Comm::Comm()'
/usr/bin/ld: ../libllama.a(llama.cpp.o): in function `MPI::Cartcomm::Clone() const':
llama.cpp:(.text._ZNK3MPI8Cartcomm5CloneEv[_ZNK3MPI8Cartcomm5CloneEv]+0x3a): undefined reference to `MPI::Comm::Comm()'
/usr/bin/ld: ../libllama.a(llama.cpp.o):llama.cpp:(.text._ZNK3MPI9Intracomm11Create_cartEiPKiPKbb[_ZNK3MPI9Intracomm11Create_cartEiPKiPKbb]+0xab): more undefined references to `MPI::Comm::Comm()' follow
/usr/bin/ld: ../libllama.a(llama.cpp.o):(.data.rel.ro._ZTVN3MPI8DatatypeE[_ZTVN3MPI8DatatypeE]+0x78): undefined reference to `MPI::Datatype::Free()'
/usr/bin/ld: ../libllama.a(llama.cpp.o):(.data.rel.ro._ZTVN3MPI3WinE[_ZTVN3MPI3WinE]+0x48): undefined reference to `MPI::Win::Free()'
collect2: error: ld returned 1 exit status
gmake[2]: *** [tests/CMakeFiles/test-sampling.dir/build.make:99: bin/test-sampling] Error 1
gmake[1]: *** [CMakeFiles/Makefile2:1299: tests/CMakeFiles/test-sampling.dir/all] Error 2
gmake: *** [Makefile:101: all] Error 2
Error: Process completed with exit code 2.

Guessing it just needs a variable added somewhere in the CMakeLists.

mpi : trying to move more MPI stuff into ggml-mpi (ggerganov#2099)

ggerganov · 2023-07-09T19:30:17Z

Yup, let's resolve CI and MPI_INT and look to merge this.
I already have some neat ideas how to improve the implementation - will write an issue with details.

I think we should try to utilize this approach to run a 65B LLaMA on Raspberry Pis.
The only question is if the model can be mmaped from a shared network drive. If this is possible, then plugging enough devices should eventually allow you to do the inference.

It would be a fun thing to try and potentially achieve world-first inference of 65B model on a cluster of Raspberries 😄

evanmiller · 2023-07-09T19:41:00Z

According to a header comment, MPI_INT32_T was added in the MPI standard v2.2 so we should be safe to use it. Builds locally with 4.1.4.

I will be busy the next few hours but feel free to tweak / merge / etc after reviewing CI.

I'm looking forward to seeing 65B models running on clusters of hacked home appliances!

evanmiller added 4 commits July 3, 2023 21:51

MPI support, first cut

f85785f

fix warnings, update README

d05ca74

fixes

668ba5f

wrap includes

042c5b2

evanmiller mentioned this pull request Jul 4, 2023

Send/receive operators via MPI ggerganov/ggml#340

Closed

SlyEcho reviewed Jul 5, 2023

View reviewed changes

Makefile Outdated Show resolved Hide resolved

SlyEcho mentioned this pull request Jul 5, 2023

distributed llama without gpu, using only cpu #1235

Closed

ggerganov added the high priority Very important issue label Jul 6, 2023

evanmiller added 2 commits July 6, 2023 19:04

Merge branch 'master' into mpi

32deabf

PR comments

06a2393

evanmiller marked this pull request as ready for review July 7, 2023 00:41

evanmiller added 2 commits July 6, 2023 21:25

Update CMakeLists.txt

1f0a2cf

Add GH workflow, fix test

55207ba

mqy reviewed Jul 7, 2023

View reviewed changes

README.md Show resolved Hide resolved

Add info to README

ef61acf

evanmiller mentioned this pull request Jul 7, 2023

MPI for Darwin #752

Closed

4 tasks

mqy mentioned this pull request Jul 8, 2023

Fix ggml_tensor_extra_gpu memory leak #2146

Closed

ggerganov added 3 commits July 9, 2023 14:08

mpi : trying to move more MPI stuff into ggml-mpi (WIP) (ggerganov#2099)

3232db6

mpi : add names for layer inputs + prep ggml_mpi_graph_compute()

e339d35

mpi : move all MPI logic into ggml-mpi

01abb3b

Not tested yet

ggerganov added 3 commits July 9, 2023 16:40

mpi : various fixes - communication now works but results are wrong

c717c51

mpi : fix output tensor after MPI compute (still not working)

ef37dd1

mpi : fix inference

beadbf3

mpi : minor

9da9d26

evanmiller added 3 commits July 9, 2023 15:02

Merge branch 'master' into mpi

0f557c2

Add OpenMPI to GH action

4a9a474

[mpi] continue-on-error: true

03cc12b

ggerganov and others added 3 commits July 9, 2023 22:20

Merge branch 'mpi' into refactor-mpi

81c5ddd

Merge pull request #1 from ggerganov/refactor-mpi

1c3a15c

mpi : trying to move more MPI stuff into ggml-mpi (ggerganov#2099)

mpi : fix after master merge

166db36

evanmiller and others added 5 commits July 9, 2023 15:31

[mpi] Link MPI C++ libraries to fix OpenMPI

f085a57

tests : fix new llama_backend API

00b8aa1

Merge remote-tracking branch 'refs/remotes/origin/mpi' into mpi

666a15a

Merge branch 'mpi' of github.com:evanmiller/llama.cpp into mpi

b18e4ad

[mpi] use MPI_INT32_T

ada1a2a

ggerganov added 2 commits July 10, 2023 18:35

mpi : factor out recv / send in functions and reuse

c3c3ef1

mpi : extend API to allow usage with outer backends (e.g. Metal)

eaef2d0

ggerganov approved these changes Jul 10, 2023

View reviewed changes

ggerganov merged commit 5656d10 into ggerganov:master Jul 10, 2023

ggerganov mentioned this pull request Jul 10, 2023

mpi : attempt inference of 65B LLaMA on a cluster of Raspberry Pis #2164

Open

magnusviri mentioned this pull request Jul 10, 2023

LLAMA_METAL=1 and LLAMA_MPI=1 incompatible? #2166

Closed

chadbrewbaker mentioned this pull request Jul 10, 2023

Porting MPI PR to Darwin OpenMPI #2168

Open

cgisky1980 mentioned this pull request Jul 17, 2023

Support for MPI RWKV/rwkv.cpp#116

Open

trholding mentioned this pull request Jul 31, 2023

README.md - Update notable forks section karpathy/llama2.c#198

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed inference via MPI #2099

Distributed inference via MPI #2099

evanmiller commented Jul 4, 2023 •

edited

Loading

ggerganov commented Jul 6, 2023 •

edited

Loading

evanmiller commented Jul 7, 2023

evanmiller commented Jul 7, 2023

ggerganov commented Jul 9, 2023 •

edited

Loading

evanmiller commented Jul 9, 2023

ggerganov commented Jul 9, 2023

evanmiller commented Jul 9, 2023

evanmiller commented Jul 9, 2023

ggerganov commented Jul 9, 2023

evanmiller commented Jul 9, 2023

Distributed inference via MPI #2099

Distributed inference via MPI #2099

Conversation

evanmiller commented Jul 4, 2023 • edited Loading

Technical changes

ggerganov commented Jul 6, 2023 • edited Loading

evanmiller commented Jul 7, 2023

evanmiller commented Jul 7, 2023

ggerganov commented Jul 9, 2023 • edited Loading

evanmiller commented Jul 9, 2023

ggerganov commented Jul 9, 2023

evanmiller commented Jul 9, 2023

evanmiller commented Jul 9, 2023

ggerganov commented Jul 9, 2023

evanmiller commented Jul 9, 2023

evanmiller commented Jul 4, 2023 •

edited

Loading

ggerganov commented Jul 6, 2023 •

edited

Loading

ggerganov commented Jul 9, 2023 •

edited

Loading