Bump llama-cpp-python to 0.2.18 #4611

oobabooga · 2023-11-15T23:39:54Z

GPU offloading didn't work in 0.2.17, and now 0.2.18 crashes with "Illegal instruction (core dumped)" when I try to load a model. I'll leave this PR here until this gets figured out.

mjameson · 2023-11-16T13:31:42Z

Thanks, this enables Falcon 180B functionality on my Mac Studio.

Ph0rk0z · 2023-11-16T13:56:37Z

I am not having any issues using the MMQ kernels. Maybe lost .3 t/s on empty context 70b. Need to add new Min_P to pure .cpp and all the samplers exllama added too.

oobabooga · 2023-11-16T15:11:10Z

I think that the missing exllamav2 samplers should be covered now 58c6001

oobabooga · 2023-11-16T16:43:51Z

Min_P cannot be added to the llama.cpp loader yet because while the python bindings for the C++ sampling functions are available, the parameter is not implemented in the _create_completion function:

https://github.com/abetlen/llama-cpp-python/blob/main/llama_cpp/llama.py#L1294

I added the (apparently) new "seed" generation parameter though, which replaces the previous useless "seed" loading parameter.

I'm studying the possibility of removing the "CPU only" version of llama-cpp-python from the requirements, and installing only a single version for CUDA / AMD users with the original "llama_cpp" namespace instead of "llamacpp_cuda".

The modified namespace seems to be causing problems, and I don't see the use case for 2 libraries now that the installer handles the AVX2, no AVX2, CUDA, and CPU only cases automatically.

Requesting feedback from @jllllll on whether this makes sense.

This reverts commit 923c8e2.

jllllll · 2023-11-17T22:33:24Z

@oobabooga
The original reason for having separate packages for both the CPU and CUDA versions was to allow for easier testing since the CUDA version can't fully switch off the CUDA code. There were some other reasons as well, but I don't remember them.

I am working on a fix for the issues that have arisen with this.
You can see some discussion as to the cause here:
jllllll/llama-cpp-python-cuBLAS-wheels#21
abetlen/llama-cpp-python#922

oobabooga · 2023-11-17T22:52:56Z

@jllllll I was inclined to keep only the CUDA wheels for simplicity. Some people were confused by the "cpu" checkbox in the llama.cpp loader, and I also haven't seen anyone using the "cpu" option recently. But if you feel like it's best to keep this option, then we can keep it.

I am at this very moment trying to build wheels using your workflows with the -DLLAMA_CUDA_FORCE_MMQ=ON flag added, as without this latest llama.cpp performance drops immensely for GPUs without tensor cores. See the reports here, here, and here.

There was also a report of higher memory usage on a Mac in the latest version; I don't know what is up with that.

Here is my commit (I was going to PR it to you later if it works): oobabooga/llama-cpp-python-cuBLAS-wheels@beb1b54

My wheels are still building so I haven't tested them. Your workflows are mindblowing by the way.

jllllll · 2023-11-17T22:57:12Z

I'll do a local build with that and test it on my 1080ti to see what the performance difference is.
Does that flag have a negative impact on newer GPU performance?

oobabooga · 2023-11-17T23:07:32Z

Yes, it makes fully offloaded performance for a 13b model on a 3090 go from ~45 tokens/second to ~30 tokens/second (or something like that). This optimization was introduced in ggerganov/llama.cpp#3776, but it doesn't work for all GPUs.

I think that there are plans for detecting the GPU model automatically at runtime in llama.cpp, but for now, the switch has to be made at compile time. I think that the best we can do is target the lower end GPUs until that update happens in llama.cpp.

jllllll · 2023-11-17T23:24:04Z

Without flag: 20-23 t/s
With flag: 24-37 t/s

Hopefully they will add GPU detection soon. That is something that has been needed for a while now.
They could just as easily hard-code the MMQ code for non-tensor GPUs in C++. No detection needed.
I'll start rebuilding llama-cpp-python-cuda wheels for 0.2.18. After that, I'll go through 0.2.14-0.2.17.

oobabooga · 2023-11-17T23:36:14Z

I made a test on my GTX 1650 and couldn't get the old performance back with my workflow wheels:

0.2.11:

llama_print_timings:        load time =  8880.86 ms
llama_print_timings:      sample time =   109.16 ms /   200 runs   (    0.55 ms per token,  1832.16 tokens per second)
llama_print_timings: prompt eval time = 78151.46 ms /  3200 tokens (   24.42 ms per token,    40.95 tokens per second)
llama_print_timings:        eval time = 133905.11 ms /   199 runs   (  672.89 ms per token,     1.49 tokens per second)
llama_print_timings:       total time = 212725.49 ms

0.2.18 without -DLLAMA_CUDA_FORCE_MMQ=ON:

llama_print_timings:        load time =   41630.44 ms
llama_print_timings:      sample time =     126.56 ms /   200 runs   (    0.63 ms per token,  1580.24 tokens per second)
llama_print_timings: prompt eval time =  289150.58 ms /  3200 tokens (   90.36 ms per token,    11.07 tokens per second)
llama_print_timings:        eval time =  147013.30 ms /   199 runs   (  738.76 ms per token,     1.35 tokens per second)
llama_print_timings:       total time =  437592.15 ms

0.2.18 with -DLLAMA_CUDA_FORCE_MMQ=ON (supposedly):

https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/wheels/llama_cpp_python-0.2.18+cu121-cp311-cp311-manylinux_2_31_x86_64.whl

llama_print_timings:        load time =   41533.16 ms
llama_print_timings:      sample time =     121.64 ms /   200 runs   (    0.61 ms per token,  1644.17 tokens per second)
llama_print_timings: prompt eval time =  287120.45 ms /  3200 tokens (   89.73 ms per token,    11.15 tokens per second)
llama_print_timings:        eval time =  138205.92 ms /   199 runs   (  694.50 ms per token,     1.44 tokens per second)
llama_print_timings:       total time =  426661.81 ms

Most likely I put -DLLAMA_CUDA_FORCE_MMQ=ON in the wrong places.

jllllll · 2023-11-17T23:42:14Z

This is what I did: jllllll/llama-cpp-python-cuBLAS-wheels@f6d1e53
I didn't use that flag for AMD GPU builds as I don't know what effect that flag will have or if it has one at all on AMD.

oobabooga · 2023-11-17T23:43:41Z

I think that your wheels will work, my logs say that MMQ was not being forced even though it should be:

ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes

jllllll · 2023-11-18T01:26:14Z

~~CUDA wheels are built. Currently building the ROCm wheels.~~

All of the relevant 0.2.18 wheels should be rebuilt now.

oobabooga · 2023-11-18T03:27:49Z

Thank you! This fixes the broken prompt processing times. The speed is not better than what it was in late September, but at least it's not a lot worse:

0.2.11:

llama_print_timings:        load time =  8880.86 ms
llama_print_timings:      sample time =   109.16 ms /   200 runs   (    0.55 ms per token,  1832.16 tokens per second)
llama_print_timings: prompt eval time = 78151.46 ms /  3200 tokens (   24.42 ms per token,    40.95 tokens per second)
llama_print_timings:        eval time = 133905.11 ms /   199 runs   (  672.89 ms per token,     1.49 tokens per second)
llama_print_timings:       total time = 212725.49 ms

0.2.18 (jllllll version):

llama_print_timings:        load time =    9984.09 ms
llama_print_timings:      sample time =     124.98 ms /   200 runs   (    0.62 ms per token,  1600.19 tokens per second)
llama_print_timings: prompt eval time =   90034.89 ms /  3200 tokens (   28.14 ms per token,    35.54 tokens per second)
llama_print_timings:        eval time =  141965.92 ms /   199 runs   (  713.40 ms per token,     1.40 tokens per second)
llama_print_timings:       total time =  233312.03 ms

0.2.18 without -DLLAMA_CUDA_FORCE_MMQ=ON:

llama_print_timings:        load time =   41630.44 ms
llama_print_timings:      sample time =     126.56 ms /   200 runs   (    0.63 ms per token,  1580.24 tokens per second)
llama_print_timings: prompt eval time =  289150.58 ms /  3200 tokens (   90.36 ms per token,    11.07 tokens per second)
llama_print_timings:        eval time =  147013.30 ms /   199 runs   (  738.76 ms per token,     1.35 tokens per second)
llama_print_timings:       total time =  437592.15 ms

I have kept the llama_cpp_cuda libraries and the cpu option in this new PR: #4637

Bump llama-cpp-python to 0.2.18

9116a59

oobabooga changed the base branch from main to dev November 16, 2023 01:33

oobabooga added 5 commits November 16, 2023 08:11

Merge branch 'dev' into llamacpp-bump

0b5027d

Add new "seed" generation parameter

f236926

Remove "cpu" option in llama.cpp, use just 1 library

0fb8ff4

Fix seed type

85da046

Remove "CPU only" llama-cpp-python wheels

c9b7aeb

oobabooga added 4 commits November 16, 2023 08:52

Uninstall previous llama-cpp-python versions

16bb2b0

Update wheels URLs

8e3f8be

Merge branch 'dev' into llamacpp-bump

5019956

Merge branch 'dev' into llamacpp-bump

8e4d48b

oobabooga merged commit 923c8e2 into dev Nov 17, 2023

oobabooga added a commit that referenced this pull request Nov 17, 2023

Revert "Bump llama-cpp-python to 0.2.18 (#4611)"

9d6f79d

This reverts commit 923c8e2.

oobabooga deleted the llamacpp-bump branch November 18, 2023 21:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bump llama-cpp-python to 0.2.18 #4611

Bump llama-cpp-python to 0.2.18 #4611

oobabooga commented Nov 15, 2023

mjameson commented Nov 16, 2023

Ph0rk0z commented Nov 16, 2023

oobabooga commented Nov 16, 2023

oobabooga commented Nov 16, 2023 •

edited

Loading

jllllll commented Nov 17, 2023 •

edited

Loading

oobabooga commented Nov 17, 2023

jllllll commented Nov 17, 2023

oobabooga commented Nov 17, 2023 •

edited

Loading

jllllll commented Nov 17, 2023 •

edited

Loading

oobabooga commented Nov 17, 2023

jllllll commented Nov 17, 2023

oobabooga commented Nov 17, 2023

jllllll commented Nov 18, 2023 •

edited

Loading

oobabooga commented Nov 18, 2023 •

edited

Loading

Bump llama-cpp-python to 0.2.18 #4611

Bump llama-cpp-python to 0.2.18 #4611

Conversation

oobabooga commented Nov 15, 2023

mjameson commented Nov 16, 2023

Ph0rk0z commented Nov 16, 2023

oobabooga commented Nov 16, 2023

oobabooga commented Nov 16, 2023 • edited Loading

jllllll commented Nov 17, 2023 • edited Loading

oobabooga commented Nov 17, 2023

jllllll commented Nov 17, 2023

oobabooga commented Nov 17, 2023 • edited Loading

jllllll commented Nov 17, 2023 • edited Loading

oobabooga commented Nov 17, 2023

jllllll commented Nov 17, 2023

oobabooga commented Nov 17, 2023

jllllll commented Nov 18, 2023 • edited Loading

oobabooga commented Nov 18, 2023 • edited Loading

oobabooga commented Nov 16, 2023 •

edited

Loading

jllllll commented Nov 17, 2023 •

edited

Loading

oobabooga commented Nov 17, 2023 •

edited

Loading

jllllll commented Nov 17, 2023 •

edited

Loading

jllllll commented Nov 18, 2023 •

edited

Loading

oobabooga commented Nov 18, 2023 •

edited

Loading