Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using MPI w/ 65b model but each node uses the full RAM. #2209

Open
magnusviri opened this issue Jul 13, 2023 · 3 comments
Open

Using MPI w/ 65b model but each node uses the full RAM. #2209

magnusviri opened this issue Jul 13, 2023 · 3 comments
Labels
help wanted Extra attention is needed

Comments

@magnusviri
Copy link
Contributor

I am trying to use MPI but each node uses the full RAM. Is this how MPI is supposed to work? I didn't think it was. Here's the details.

I am on commit 1cbf561. I modified the Makefile so I could compile it like this (see #2208).

LLAMA_MPI=1 LLAMA_METAL=1 make CC=/opt/homebrew/bin/mpicc CXX=/opt/homebrew/bin/mpicxx 

I run the following.

mpirun -hostfile hostfile -n 3 ./main -m airoboros-65B-gpt4-1.2.ggmlv3.q4_0.bin -n 128 -p "Q. What is the capital of Germany? A. Berlin. Q. What is the capital of France? A."

This is the output. It works, but each node uses 39 GB of RAM. Each node has 16 GB of RAM, so they swap bad.

main: build = 827 (1cbf561)
main: seed  = 1689216374
main: build = 827 (1cbf561)
main: seed  = 1689216374
main: build = 827 (1cbf561)
main: seed  = 1689216374
llama.cpp: loading model from airoboros-65B-gpt4-1.2.ggmlv3.q4_0.bin
llama.cpp: loading model from airoboros-65B-gpt4-1.2.ggmlv3.q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 8192
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 64
llama_model_load_internal: n_layer    = 80
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 22016
llama_model_load_internal: model size = 65B
llama_model_load_internal: ggml ctx size =    0.19 MB
llama_model_load_internal: mem required  = 38610.47 MB (+ 5120.00 MB per state)
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 8192
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 64
llama_model_load_internal: n_layer    = 80
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 22016
llama_model_load_internal: model size = 65B
llama_model_load_internal: ggml ctx size =    0.19 MB
llama_model_load_internal: mem required  = 38610.47 MB (+ 5120.00 MB per state)
llama_new_context_with_model: kv self size  = 1280.00 MB
llama.cpp: loading model from airoboros-65B-gpt4-1.2.ggmlv3.q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 8192
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 64
llama_model_load_internal: n_layer    = 80
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 22016
llama_model_load_internal: model size = 65B
llama_model_load_internal: ggml ctx size =    0.19 MB
llama_model_load_internal: mem required  = 38610.47 MB (+ 5120.00 MB per state)
llama_new_context_with_model: kv self size  = 1280.00 MB
llama_new_context_with_model: kv self size  = 1280.00 MB

system_info: n_threads = 4 / 8 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 128, n_keep = 0


 Q. What is the capital of Germany? A. Berlin. Q. What is the capital of France? A. Paris. [end of text]

llama_print_timings:        load time = 149282.74 ms
llama_print_timings:      sample time =     2.15 ms /     3 runs   (    0.72 ms per token,  1397.95 tokens per second)
llama_print_timings: prompt eval time = 20222.54 ms /    25 tokens (  808.90 ms per token,     1.24 tokens per second)
llama_print_timings:        eval time =  2537.97 ms /     2 runs   ( 1268.99 ms per token,     0.79 tokens per second)
llama_print_timings:       total time = 22764.59 ms

[[email protected]] HYDU_sock_write (utils/sock/sock.c:256): write error (Bad file descriptor)
[[email protected]] control_cb (pm/pmiserv/pmiserv_cb.c:316): error writing to control socket
[[email protected]] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:77): callback returned error status
[[email protected]] HYD_pmci_wait_for_completion (pm/pmiserv/pmiserv_pmci.c:196): error waiting for event
[[email protected]] main (ui/mpich/mpiexec.c:336): process manager error waiting for completion

If I enable metal, it errors out.

mpirun -hostfile hostfile -n 3 ./main -m airoboros-65B-gpt4-1.2.ggmlv3.q4_0.bin -n 128 -ngl 1 -p "Q. What is the capital of Germany? A. Berlin. Q. What is the capital of France? A."

Output.

main: build = 827 (1cbf561)
main: seed  = 1689216039
main: build = 827 (1cbf561)
main: seed  = 1689216039
main: build = 827 (1cbf561)
main: seed  = 1689216040
llama.cpp: loading model from airoboros-65B-gpt4-1.2.ggmlv3.q4_0.bin
llama.cpp: loading model from airoboros-65B-gpt4-1.2.ggmlv3.q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 8192
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 64
llama_model_load_internal: n_layer    = 80
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 22016
llama_model_load_internal: model size = 65B
llama_model_load_internal: ggml ctx size =    0.19 MB
llama_model_load_internal: mem required  = 38610.47 MB (+ 5120.00 MB per state)
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 8192
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 64
llama_model_load_internal: n_layer    = 80
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 22016
llama_model_load_internal: model size = 65B
llama_model_load_internal: ggml ctx size =    0.19 MB
llama_model_load_internal: mem required  = 38610.47 MB (+ 5120.00 MB per state)
llama_new_context_with_model: kv self size  = 1280.00 MB
ggml_metal_init: allocating
ggml_metal_init: using MPS
ggml_metal_init: loading '/Users/james/llama.cpp/ggml-metal.metal'
ggml_metal_init: loaded kernel_add                            0x13b604a40
ggml_metal_init: loaded kernel_mul                            0x13b605630
ggml_metal_init: loaded kernel_mul_row                        0x13b605c20
ggml_metal_init: loaded kernel_scale                          0x13b606210
ggml_metal_init: loaded kernel_silu                           0x13b606800
ggml_metal_init: loaded kernel_relu                           0x13b606df0
ggml_metal_init: loaded kernel_gelu                           0x13b6073e0
ggml_metal_init: loaded kernel_soft_max                       0x13b607cf0
ggml_metal_init: loaded kernel_diag_mask_inf                  0x13b608400
ggml_metal_init: loaded kernel_get_rows_f16                   0x13b608b40
ggml_metal_init: loaded kernel_get_rows_q4_0                  0x12b6042f0
ggml_metal_init: loaded kernel_get_rows_q4_1                  0x12b604b70
ggml_metal_init: loaded kernel_get_rows_q2_K                  0x12b605120
ggml_metal_init: loaded kernel_get_rows_q3_K                  0x14b7050c0
ggml_metal_init: loaded kernel_get_rows_q4_K                  0x14b705790
ggml_metal_init: loaded kernel_get_rows_q5_K                  0x12b605460
ggml_metal_init: loaded kernel_get_rows_q6_K                  0x12b605b30
ggml_metal_init: loaded kernel_rms_norm                       0x12b606440
ggml_metal_init: loaded kernel_norm                           0x12b606d50
ggml_metal_init: loaded kernel_mul_mat_f16_f32                0x12b6077e0
ggml_metal_init: loaded kernel_mul_mat_q4_0_f32               0x12b607dd0
ggml_metal_init: loaded kernel_mul_mat_q4_1_f32               0x12b6083d0
ggml_metal_init: loaded kernel_mul_mat_q2_K_f32               0x12b6089d0
ggml_metal_init: loaded kernel_mul_mat_q3_K_f32               0x12b609170
ggml_metal_init: loaded kernel_mul_mat_q4_K_f32               0x12b609770
ggml_metal_init: loaded kernel_mul_mat_q5_K_f32               0x12b609d70
ggml_metal_init: loaded kernel_mul_mat_q6_K_f32               0x106304490
ggml_metal_init: loaded kernel_rope                           0x106305300
ggml_metal_init: loaded kernel_alibi_f32                      0x106305e20
ggml_metal_init: loaded kernel_cpy_f32_f16                    0x106306920
ggml_metal_init: loaded kernel_cpy_f32_f32                    0x106307420
ggml_metal_init: loaded kernel_cpy_f16_f16                    0x106308070
ggml_metal_init: recommendedMaxWorkingSetSize = 10922.67 MB
ggml_metal_init: hasUnifiedMemory             = true
ggml_metal_init: maxTransferRate              = built-in GPU
llama_new_context_with_model: max tensor size =   140.62 MB
ggml_metal_add_buffer: allocated 'data            ' buffer, size =  8192.00 MB, offs =            0
ggml_metal_add_buffer: allocated 'data            ' buffer, size =  8192.00 MB, offs =   8442462208
ggml_metal_add_buffer: allocated 'data            ' buffer, size =  8192.00 MB, offs =  16884924416
ggml_metal_add_buffer: allocated 'data            ' buffer, size =  8192.00 MB, offs =  25327386624
ggml_metal_add_buffer: allocated 'data            ' buffer, size =  2821.31 MB, offs =  33769848832, (35589.77 / 10922.67), warning: current allocated size is greater than the recommended max working set size
ggml_metal_add_buffer: allocated 'eval            ' buffer, size =  1536.00 MB, (37125.77 / 10922.67), warning: current allocated size is greater than the recommended max working set size
ggml_metal_add_buffer: allocated 'kv              ' buffer, size =  1282.00 MB, (38407.77 / 10922.67), warning: current allocated size is greater than the recommended max working set size
ggml_metal_add_buffer: allocated 'scr0            ' buffer, size =  1024.00 MB, (39431.77 / 10922.67), warning: current allocated size is greater than the recommended max working set size
ggml_metal_add_buffer: allocated 'scr1            ' buffer, size =  1024.00 MB, (40455.77 / 10922.67), warning: current allocated size is greater than the recommended max working set size
llama_new_context_with_model: kv self size  = 1280.00 MB
ggml_metal_init: allocating
ggml_metal_init: using MPS
ggml_metal_init: loading '/Users/james/llama.cpp/ggml-metal.metal'
ggml_metal_init: loaded kernel_add                            0x1080044b0
ggml_metal_init: loaded kernel_mul                            0x1080051c0
ggml_metal_init: loaded kernel_mul_row                        0x1080057b0
ggml_metal_init: loaded kernel_scale                          0x108104330
ggml_metal_init: loaded kernel_silu                           0x108104a40
ggml_metal_init: loaded kernel_relu                           0x108005c80
ggml_metal_init: loaded kernel_gelu                           0x108006390
ggml_metal_init: loaded kernel_soft_max                       0x108006ca0
ggml_metal_init: loaded kernel_diag_mask_inf                  0x107704610
ggml_metal_init: loaded kernel_get_rows_f16                   0x107704e70
ggml_metal_init: loaded kernel_get_rows_q4_0                  0x107705420
ggml_metal_init: loaded kernel_get_rows_q4_1                  0x107705b40
ggml_metal_init: loaded kernel_get_rows_q2_K                  0x1077060f0
ggml_metal_init: loaded kernel_get_rows_q3_K                  0x1077066a0
ggml_metal_init: loaded kernel_get_rows_q4_K                  0x1082041a0
ggml_metal_init: loaded kernel_get_rows_q5_K                  0x108204870
ggml_metal_init: loaded kernel_get_rows_q6_K                  0x107706b30
ggml_metal_init: loaded kernel_rms_norm                       0x107706f90
ggml_metal_init: loaded kernel_norm                           0x1077078a0
ggml_metal_init: loaded kernel_mul_mat_f16_f32                0x1082051c0
ggml_metal_init: loaded kernel_mul_mat_q4_0_f32               0x1082058d0
ggml_metal_init: loaded kernel_mul_mat_q4_1_f32               0x108205ed0
ggml_metal_init: loaded kernel_mul_mat_q2_K_f32               0x1082064d0
ggml_metal_init: loaded kernel_mul_mat_q3_K_f32               0x108206c70
ggml_metal_init: loaded kernel_mul_mat_q4_K_f32               0x108207270
ggml_metal_init: loaded kernel_mul_mat_q5_K_f32               0x108207870
ggml_metal_init: loaded kernel_mul_mat_q6_K_f32               0x108207e70
ggml_metal_init: loaded kernel_rope                           0x108208bc0
ggml_metal_init: loaded kernel_alibi_f32                      0x1082096e0
ggml_metal_init: loaded kernel_cpy_f32_f16                    0x10820a1e0
ggml_metal_init: loaded kernel_cpy_f32_f32                    0x10820ace0
ggml_metal_init: loaded kernel_cpy_f16_f16                    0x1080078f0
ggml_metal_init: recommendedMaxWorkingSetSize = 10922.67 MB
ggml_metal_init: hasUnifiedMemory             = true
ggml_metal_init: maxTransferRate              = built-in GPU
llama_new_context_with_model: max tensor size =   140.62 MB
ggml_metal_add_buffer: allocated 'data            ' buffer, size =  8192.00 MB, offs =            0
ggml_metal_add_buffer: allocated 'data            ' buffer, size =  8192.00 MB, offs =   8442462208
ggml_metal_add_buffer: allocated 'data            ' buffer, size =  8192.00 MB, offs =  16884924416
ggml_metal_add_buffer: allocated 'data            ' buffer, size =  8192.00 MB, offs =  25327386624
ggml_metal_add_buffer: allocated 'data            ' buffer, size =  2821.31 MB, offs =  33769848832, (35589.77 / 10922.67), warning: current allocated size is greater than the recommended max working set size
ggml_metal_add_buffer: allocated 'eval            ' buffer, size =  1536.00 MB, (37125.77 / 10922.67), warning: current allocated size is greater than the recommended max working set size
ggml_metal_add_buffer: allocated 'kv              ' buffer, size =  1282.00 MB, (38407.77 / 10922.67), warning: current allocated size is greater than the recommended max working set size
ggml_metal_add_buffer: allocated 'scr0            ' buffer, size =  1024.00 MB, (39431.77 / 10922.67), warning: current allocated size is greater than the recommended max working set size
ggml_metal_add_buffer: allocated 'scr1            ' buffer, size =  1024.00 MB, (40455.77 / 10922.67), warning: current allocated size is greater than the recommended max working set size
llama.cpp: loading model from airoboros-65B-gpt4-1.2.ggmlv3.q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 8192
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 64
llama_model_load_internal: n_layer    = 80
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 22016
llama_model_load_internal: model size = 65B
llama_model_load_internal: ggml ctx size =    0.19 MB
llama_model_load_internal: mem required  = 38610.47 MB (+ 5120.00 MB per state)
llama_new_context_with_model: kv self size  = 1280.00 MB
ggml_metal_init: allocating
ggml_metal_init: using MPS
ggml_metal_init: loading '/Users/james/llama.cpp/ggml-metal.metal'
ggml_metal_init: loaded kernel_add                            0x132605260
ggml_metal_init: loaded kernel_mul                            0x132605e50
ggml_metal_init: loaded kernel_mul_row                        0x132606440
ggml_metal_init: loaded kernel_scale                          0x132606a30
ggml_metal_init: loaded kernel_silu                           0x132607020
ggml_metal_init: loaded kernel_relu                           0x132607610
ggml_metal_init: loaded kernel_gelu                           0x132607c00
ggml_metal_init: loaded kernel_soft_max                       0x132608510
llama_new_context_with_model: max tensor size =   140.62 MB
ggml_metal_init: loaded kernel_diag_mask_inf                  0x1068046d0
ggml_metal_init: loaded kernel_get_rows_f16                   0x106804e10
ggml_metal_init: loaded kernel_get_rows_q4_0                  0x1068053c0
ggml_metal_init: loaded kernel_get_rows_q4_1                  0x106805ae0
ggml_metal_init: loaded kernel_get_rows_q2_K                  0x106806090
ggml_metal_init: loaded kernel_get_rows_q3_K                  0x106806640
ggml_metal_init: loaded kernel_get_rows_q4_K                  0x106806bf0
ggml_metal_init: loaded kernel_get_rows_q5_K                  0x1068071a0
ggml_metal_init: loaded kernel_get_rows_q6_K                  0x106807750
ggml_metal_init: loaded kernel_rms_norm                       0x106808060
ggml_metal_init: loaded kernel_norm                           0x106808970
ggml_metal_init: loaded kernel_mul_mat_f16_f32                0x106809400
ggml_metal_init: loaded kernel_mul_mat_q4_0_f32               0x1068099f0
ggml_metal_init: loaded kernel_mul_mat_q4_1_f32               0x106809ff0
ggml_metal_init: loaded kernel_mul_mat_q2_K_f32               0x10680a5f0
ggml_metal_init: loaded kernel_mul_mat_q3_K_f32               0x10680ad90
ggml_metal_init: loaded kernel_mul_mat_q4_K_f32               0x10680b390
ggml_metal_init: loaded kernel_mul_mat_q5_K_f32               0x10680b990
ggml_metal_init: loaded kernel_mul_mat_q6_K_f32               0x10680bf90
ggml_metal_init: loaded kernel_rope                           0x10680cce0
ggml_metal_init: loaded kernel_alibi_f32                      0x10680d800
ggml_metal_init: loaded kernel_cpy_f32_f16                    0x10680e300
ggml_metal_init: loaded kernel_cpy_f32_f32                    0x10680ee00
ggml_metal_init: loaded kernel_cpy_f16_f16                    0x10680fa50
ggml_metal_init: recommendedMaxWorkingSetSize = 10922.67 MB
ggml_metal_init: hasUnifiedMemory             = true
ggml_metal_init: maxTransferRate              = built-in GPU
ggml_metal_add_buffer: allocated 'data            ' buffer, size =  8192.00 MB, offs =            0
ggml_metal_add_buffer: allocated 'data            ' buffer, size =  8192.00 MB, offs =   8442462208
ggml_metal_add_buffer: allocated 'data            ' buffer, size =  8192.00 MB, offs =  16884924416
ggml_metal_add_buffer: allocated 'data            ' buffer, size =  8192.00 MB, offs =  25327386624
ggml_metal_add_buffer: allocated 'data            ' buffer, size =  2821.31 MB, offs =  33769848832, (35589.77 / 10922.67), warning: current allocated size is greater than the recommended max working set size
ggml_metal_add_buffer: allocated 'eval            ' buffer, size =  1536.00 MB, (37125.77 / 10922.67), warning: current allocated size is greater than the recommended max working set size
ggml_metal_add_buffer: allocated 'kv              ' buffer, size =  1282.00 MB, (38407.77 / 10922.67), warning: current allocated size is greater than the recommended max working set size
ggml_metal_add_buffer: allocated 'scr0            ' buffer, size =  1024.00 MB, (39431.77 / 10922.67), warning: current allocated size is greater than the recommended max working set size
ggml_metal_add_buffer: allocated 'scr1            ' buffer, size =  1024.00 MB, (40455.77 / 10922.67), warning: current allocated size is greater than the recommended max working set size

system_info: n_threads = 4 / 8 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 128, n_keep = 0

I'm guessing it fails because it runs out of memory.

@ggerganov
Copy link
Owner

Can you try updating to latest master since 32c5411 was an mmap regression

Also, maybe try using mlock argument - again, I'm just giving random ideas without really having a good understanding if this makes sense

@ggerganov ggerganov added the help wanted Extra attention is needed label Jul 14, 2023
@izard
Copy link

izard commented Jul 16, 2023

Not sure about 65B, but I tried a 33B model that mmaps 26GB on a Mac mini with 24GB RAM. It swapped and worked at 46 seconds per token. Then I added a second Mac mini over MPI and together they worked at 450ms per token, which is 100x faster.

@ggerganov
Copy link
Owner

Yup, the approach is definitely viable. There is more progress here: #2164 (comment)

On Mac mini, you can also enable Metal which might lead to some extra improvements

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants