-
Notifications
You must be signed in to change notification settings - Fork 9.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mpi : attempt inference of 65B LLaMA on a cluster of Raspberry Pis #2164
Comments
I wonder If this capability could be integrated on https://github.com/pry0cc/axiom. Axiom allows to spin off multiple cloud instances in minutes and for some stuff run distributed scripts/loads. |
I could try simulating it. 8 VMs with 8GB of ram. |
Cool idea, I have some more powerful embedded devices(RISC-V CPU) can integrated in a cluster. Expect this experiment and I am willing to deploy it on RISC-V cluster. |
Ordered the parts today, and should be here same time tomorrow (6 x Raspberry Pi 4 - 8GB variants), in the meantime, I'm setting up a local mpi cluster on VMs to test the inferencing, pick up your Pi(s) now a shortage is coming! |
had issues with this in the past (couble of weeks). It first works, I can read the full file no problem, but it suddenly stopps and kills the program. |
NFS or CIFS? |
CIFS , between 2 ubuntu machines |
@theycallmeloki Hope I didn't set the expectations too high - even if this runs, the performance is expected to be really terrible. Likely few (tens of) seconds per token for 65B. It's mostly a fun experiment - don't think it would have any practical use. |
The moment you said raspberry pi I knew we were in the meme train. |
Well the same experiment can be done with a bunch of gaming GPUs (e.g. 8x 6GB or 6x 8GB) and it would make more sense in terms of performance. But running a 65B model on RPi cluster sounds like a more historic achievement 😄 |
@ggerganov Nope, not at all, I was going through the discussions and realized there is some room to add value around the inferencing pipelines, I can also imagine varying the size of the virtual nodes in the Pi cluster and tweaking the partitioning of the model could lead to better tokens/second and this setup costs approximately 1 order of a magnitude cheaper compared to any other off-the-shelf self hostable setup to run a 65B model (I'm looking at the M2 Ultra Studios) so even if it's slow I think it's likely to be worth it for novel reasons alone 😄 |
FWIW a certain popular but very early report of ~10s/token was a bit exaggerated - Actual number right now is closer to 1.6s/token on Pi 4B 8GB for a q6_k quantized 7B model, which it just barely fits with OS and GUI. Even the Pi is memory bandwidth bound in this use case, and -t 4 is actually a bit slower than -t 3. The board has somewhere around 4GB/s of memory bandwidth. Running headless might also speed things up a bit given the architecture.
The actual cheapest right now might be (used) Ryzen 5800X/5700G, the corresponding motherboard, peripherals, and 64GB of the fastest DDR4 RAM you can find in matched 32GB modules. The latter had become quite cheap after DDR5 rollout, and can be had for the price of some three to four Pi 4 8GB. But no, that is not nearly as interesting! |
Believe I might have gotten the local environment up and running on the Pis (Confirmed that I ran a hello-world example first to ensure mpi itself was running smoothly) Moved the 65B model to each pi using scp (256GB microsd cards, so Im hoping I do not need to mmap on network drive)
I'm unable to determine how to split this model into tinier chunks so that it can fit on the induvidual Pis, right now I reckon it's trying to load the entire model into each Pi which is probably why it is failing, logs below
|
Hm, my expectation is that diff --git a/llama-util.h b/llama-util.h
index 43b6f05..1c0502f 100644
--- a/llama-util.h
+++ b/llama-util.h
@@ -177,10 +177,7 @@ struct llama_mmap {
int fd = fileno(file->fp);
int flags = MAP_PRIVATE;
// prefetch/readahead impairs performance on NUMA systems
- if (numa) { prefetch = 0; }
-#ifdef __linux__
- if (prefetch) { flags |= MAP_POPULATE; }
-#endif
+ prefetch = 0;
addr = mmap(NULL, file->size, PROT_READ | PROT_WRITE, flags, fd, 0);
if (addr == MAP_FAILED) {
throw std::runtime_error(format("mmap failed: %s", strerror(errno))); I'm just poking here - probably someone who understands better how it works can chime in. |
Yes, the patch was definitely helpful, prior to this I was parallelly also trying out a 7B model just to ensure the whole process was working, which used to fail with a similar error message as above, but after this patch, the 7B model seems to be running on the Pis
However when I load the 65B model a similar error is thrown when the patch wasnt in place (I'm not sure about this, tbh)
I would like to work on the splitting of the model and have each node load just the weights specific to that model, please do give me some rough pointers on how I can approach this, I understand there are ckpt to diffusers and diffusers to ckpt, I could probably patch one side to split the model prior to writing to disk, so that way running both the processes could end up in a split file that can be loaded by each Pi individually In a parallel track, do you think adding swap storage equivalent to the model size would help? I can imagine it trying to load the entire model on the microsd card and being able to look through the specific portions it needs to for it's inferencing logic might help in this regard |
Update: The swap seems to have done the trick, I added 50GB swap file to each Pi and was able to run the 65B model using:
I think swap was likely not the correct approach as I fear the benefit of being able to use mpi to load only into RAM was so that inference speed could be higher, now that the entire model can be brought up in one Pi technically I can run 6 conversations in parallel, one on each Pi, and they'd be just as slow as right now (about one token every 10-12 mins) thereby I'm leaning towards believing the split model approach could be better and will try on the same lines (Also I can't run k8s on this if I enable swap since etcd stashes into swap and messy things happen, which I would like to be able to do down the line) |
@theycallmeloki try setting vm.overcommit_memory to 1 |
If everything works as expected, it won't swap. Something is still not ok and I think it is swapping due to the large KV cache. Let's try to limit it by reducing the context length by x8:
diff --git a/llama.cpp b/llama.cpp
index 2d09d6c..3ebdca4 100644
--- a/llama.cpp
+++ b/llama.cpp
@@ -129,11 +129,11 @@ static const std::map<e_model, size_t> & MEM_REQ_SCRATCH1()
static const std::map<e_model, size_t> & MEM_REQ_KV_SELF()
{
static std::map<e_model, size_t> k_sizes = {
- { MODEL_3B, 682ull * MB },
- { MODEL_7B, 1026ull * MB },
- { MODEL_13B, 1608ull * MB },
- { MODEL_30B, 3124ull * MB },
- { MODEL_65B, 5120ull * MB },
+ { MODEL_3B, 682ull * MB / 8 },
+ { MODEL_7B, 1026ull * MB / 8 },
+ { MODEL_13B, 1608ull * MB / 8 },
+ { MODEL_30B, 3124ull * MB / 8 },
+ { MODEL_65B, 5120ull * MB / 8 },
};
return k_sizes;
}
Btw, how much RAM does each RPi have? |
Looks like it's the time to make a few virtual machines to test this out. |
I have now disabled the 50GB swap file on the Pi each Pi has 8GB RAM (300mb for headless bootup)
A point of contention that I am fearing is that not all the Pis are involved in the mpirun as only 2/6 of them seem to have spikes in compute as shown here |
Let's goooo !! 😄
Yes, this the expected behaviour. The MPI implementation currently just supports pipeline parallelisation, so each node processes part of the pipeline (i.e. few layers of the graph) and passes the results to the next node. This allows each node to "see" only a part of the model and thus distribute it across the cluster. This is in contrast to tensor parallelisation where all nodes can work in parallel on each part of the graph. However, this would require all nodes to see the entire model, which in our experiment is not viable since we cannot fit the entire model in the RPi RAM. In any case, I consider the experiment already a success! Maybe check if Also make sure the generation makes sense -- One more thing: maybe adding And another thing: update to latest |
Reading your comment again, I might have misunderstood. The nodes should continuously take turns to process the next part of the pipeline. If only the same 2 nodes always have spikes then there is definitely something wrong |
Can I freely parallelize the model further to ensure I am able to densely pack the inferencing? is there a cutoff at which point it'll be more about the DAG being constructed than actually computing the tokens? asking as I am not too sure on how mpi does things, for example: When I run a hello world example on 6 threads, instead of all of them responding, I get replies from 2 instantly
Just to verify it wasn't an access issue, when I increase to 24 slots, all of them respond
|
Show the contents of cc @evanmiller for advice |
This is what I am using for
|
Try using the following:
|
I believe OpenMPI supports this syntax but MPICH does not. |
@theycallmeloki Does it work if you mmap the model from a network share? Wondering if we can avoid the SD cards
Yes, we have currently cut the context by x8 since the KV cache becomes significant for 2048 tokens (~5GB), but if you keep adding RPi nodes, you can utilize the full context as with each new node, they would need to load smaller and smaller parts of the model. With the current implementation you can have up to 80 nodes for a 65B LLaMA model - 1 layer per node.
This is surprisingly fast 😄 Btw, you can try with Btw, does the generated text make sense? |
I believe so the generated text does make sense for the most part,
The vm.overcommit_memory = 1 flag was unnecessary, so I dropped it. This was with
This was with
I am looking into mlock flag and the network drive mmap next |
Update: Starting from master, I did not have to make the patch wrt prefetching readahead, this actually had an end impact on the time it takes to load up the model as I think now it's able to load everything in parallel instead of sequentially (I am not sure if I should keep it or remove it) mlock is failing to allocate memory (this could be related to the above, but is not a deal breaker since the generation continues to work, albeit slower)
To confirm I have added the patch reducing the context window by 8 |
It's not clear from the explanation - I expect on latest
This is an important thing to improve for the MPI inference.
However, each node only uses the KV cache for the layers that it computes. So we are over-allocating a lot of memory. |
Hi everyone, has anyone experienced the following issue when running inference using mpirun on a cluster? If so, how did you solve it? Thank you in advance!
|
I'm running into what I think is a case of the processes not all returning when finished. I wonder if anyone is seeing similar or can help. I've followed along with the instructions here, including the patch mentioned by @ggerganov on 14 July, and I have three Pi4 8G nodes in a Slurm cluster with OpenMPI I'm prompting the mpirun commands within slurm batch jobs as such
and my slurm-<job_number>.out file looks like this:
leading me to think that the output is all finished ( unless the models I have were trained on their own output :) ) but the resources are never released to allow the next job to run. I can cancel it and the next one runs just fine, and I presume I can set a time limit on the job, but I'm wondering if I'm missing an MPI configuration somewhere. Full output: |
Yeah, the MPI backend is missing a lot of functionality, one of which is terminating upon end of text. |
Does MPI support multi GPU device distribution? |
@ggerganov firstly I can't thank you and every other contributor enough for the hard work you do on this project. You mentioned before about limitations with each node only seeing part of the model at a time due to memory constraint.. just wondering, what if that were not a limitation, what if one were to say have a cluster of M2 Mac Studios? See my thinking at the moment is along the lines of... 30x of those is both cheaper and easier to get than a DGX/HGX with 8xH100s right now and I recently saw something on Twitter that makes me think I may not be the only one trying to find at least $6000 to spare right now 👀.. So I am desperately now trying to think of a way I could use such a cluster to run LLaMa.cpp at something approaching production scale after proving out our experiments by simply buying more of them until it makes sense to go GPU. Maybe MPI isn't the way, if not, any ideas? How's batch inference looking these days? Thanks again for your gift to the world that continues to change the calculus. |
@justinh-rahb should be possible, since the real bottleneck for token generation is memory throughput, having 2 machines theoretically doubles that (ignoring overhead).
It is indeed an argument for MPI, but having enough memory is not an argument agains MPI :) ( => ) also, it would be hilarious, if you could outbench datacenter gpus by that much 😆 |
In theory, a parallel implementation would split the model tensors across rows to 32 nodes and the MPI code would synchronize the activations after each layer, which would in theory give you ~32x times the memory bandwidth of a single node. And on top of that, each node would need just 1/32 of the total memory. In practice, there will be synchronization points, network latency and it might be tricky to update the code to support this mode of parallelization. You would also want to do this with a large model to compensate for the overhead from the previous points. Maybe the new 180B Flacon is a good candidate. But in any case, I find this experiment very interesting and hopefully will give it a try when I have some time and hardware to do that. Running Q8 Falcon 180B on 4x Mac Studio cluster at 20 t/s sounds fun! |
You like to party, fantastic 😬 |
What about 10 Orange Pi5 16GB at $109 each on aliexpress? This gives me 160GB RAM, 80 cores, 40 at 2GHz and 40 at 2.4GHz requiring 150W. I can fit about 16 of these into a long 1U case I have. Would it be possible to use the RKNN SDK (supports quantization) to utilize the NPU on each, with 6 TOPS each, 60 TOPS total? Might be a stupid question, but the idea intrigues me. |
not atm
not ggml's custom quantizations but otherwise it might work. not sure it give you the value back though. 😄 |
I would like to add to @davidwynter 's point, I think the Orange Pi 5 32GB would be even better, and I think 32 of them would be a very interesting number, to begin with, that's 1TB of RAM! I did some research on the hardware and wrote a kickstarter campaign on mirror, (sorry if this comes across as spam) here, if anyone would like to support me purchasing the hardware and running the experiment |
Orange Pi 5 Can use large models |
There's a phone called honor x50, 6gen1 16g ram in same price range... |
This is only for AMD GPUs, but I found this in their ROCM documentation. |
was this exit implemented in the master? I just ran last weekend and ran into error when finishing run on ARM64 2 node CPU. |
Hello guys! I've developed a project that enables running Llama 2 (7B, 13B, 70B) inference on multiple devices in parallel, resulting in improved inference speed. The project is called Distributed Llama. Of course some code I copied from llama.cpp. 😅 My best results:
Currently the project is only optimized for ARM CPUs. I described how to run it in the repository. Here is the report with more details. |
Now that distributed inference is supported thanks to the work of @evanmiller in #2099 it would be fun to try to utilize it for something cool. One such idea is to connect a bunch of Raspberry Pis in a local network and run the inference using MPI:
Here we assume that the 65B model data is located on a network share in
/mnt
and thatmmap
works over a network share.Not sure if that is the case - if not, then it would be more difficult to perform this experiment.
Looking for people with access to the necessary hardware to perform this experiment
The text was updated successfully, but these errors were encountered: