You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am trying to use MPI but each node uses the full RAM. Is this how MPI is supposed to work? I didn't think it was. Here's the details.
I am on commit 1cbf561. I modified the Makefile so I could compile it like this (see #2208).
LLAMA_MPI=1 LLAMA_METAL=1 make CC=/opt/homebrew/bin/mpicc CXX=/opt/homebrew/bin/mpicxx
I run the following.
mpirun -hostfile hostfile -n 3 ./main -m airoboros-65B-gpt4-1.2.ggmlv3.q4_0.bin -n 128 -p "Q. What is the capital of Germany? A. Berlin. Q. What is the capital of France? A."
This is the output. It works, but each node uses 39 GB of RAM. Each node has 16 GB of RAM, so they swap bad.
main: build = 827 (1cbf561)
main: seed = 1689216374
main: build = 827 (1cbf561)
main: seed = 1689216374
main: build = 827 (1cbf561)
main: seed = 1689216374
llama.cpp: loading model from airoboros-65B-gpt4-1.2.ggmlv3.q4_0.bin
llama.cpp: loading model from airoboros-65B-gpt4-1.2.ggmlv3.q4_0.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 512
llama_model_load_internal: n_embd = 8192
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 64
llama_model_load_internal: n_layer = 80
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 2 (mostly Q4_0)
llama_model_load_internal: n_ff = 22016
llama_model_load_internal: model size = 65B
llama_model_load_internal: ggml ctx size = 0.19 MB
llama_model_load_internal: mem required = 38610.47 MB (+ 5120.00 MB per state)
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 512
llama_model_load_internal: n_embd = 8192
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 64
llama_model_load_internal: n_layer = 80
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 2 (mostly Q4_0)
llama_model_load_internal: n_ff = 22016
llama_model_load_internal: model size = 65B
llama_model_load_internal: ggml ctx size = 0.19 MB
llama_model_load_internal: mem required = 38610.47 MB (+ 5120.00 MB per state)
llama_new_context_with_model: kv self size = 1280.00 MB
llama.cpp: loading model from airoboros-65B-gpt4-1.2.ggmlv3.q4_0.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 512
llama_model_load_internal: n_embd = 8192
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 64
llama_model_load_internal: n_layer = 80
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 2 (mostly Q4_0)
llama_model_load_internal: n_ff = 22016
llama_model_load_internal: model size = 65B
llama_model_load_internal: ggml ctx size = 0.19 MB
llama_model_load_internal: mem required = 38610.47 MB (+ 5120.00 MB per state)
llama_new_context_with_model: kv self size = 1280.00 MB
llama_new_context_with_model: kv self size = 1280.00 MB
system_info: n_threads = 4 / 8 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 128, n_keep = 0
Q. What is the capital of Germany? A. Berlin. Q. What is the capital of France? A. Paris. [end of text]
llama_print_timings: load time = 149282.74 ms
llama_print_timings: sample time = 2.15 ms / 3 runs ( 0.72 ms per token, 1397.95 tokens per second)
llama_print_timings: prompt eval time = 20222.54 ms / 25 tokens ( 808.90 ms per token, 1.24 tokens per second)
llama_print_timings: eval time = 2537.97 ms / 2 runs ( 1268.99 ms per token, 0.79 tokens per second)
llama_print_timings: total time = 22764.59 ms
[[email protected]] HYDU_sock_write (utils/sock/sock.c:256): write error (Bad file descriptor)
[[email protected]] control_cb (pm/pmiserv/pmiserv_cb.c:316): error writing to control socket
[[email protected]] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:77): callback returned error status
[[email protected]] HYD_pmci_wait_for_completion (pm/pmiserv/pmiserv_pmci.c:196): error waiting for event
[[email protected]] main (ui/mpich/mpiexec.c:336): process manager error waiting for completion
If I enable metal, it errors out.
mpirun -hostfile hostfile -n 3 ./main -m airoboros-65B-gpt4-1.2.ggmlv3.q4_0.bin -n 128 -ngl 1 -p "Q. What is the capital of Germany? A. Berlin. Q. What is the capital of France? A."
Not sure about 65B, but I tried a 33B model that mmaps 26GB on a Mac mini with 24GB RAM. It swapped and worked at 46 seconds per token. Then I added a second Mac mini over MPI and together they worked at 450ms per token, which is 100x faster.
I am trying to use MPI but each node uses the full RAM. Is this how MPI is supposed to work? I didn't think it was. Here's the details.
I am on commit 1cbf561. I modified the Makefile so I could compile it like this (see #2208).
I run the following.
This is the output. It works, but each node uses 39 GB of RAM. Each node has 16 GB of RAM, so they swap bad.
If I enable metal, it errors out.
Output.
I'm guessing it fails because it runs out of memory.
The text was updated successfully, but these errors were encountered: