-
-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bump llama-cpp-python to 0.2.18 #4611
Conversation
Thanks, this enables Falcon 180B functionality on my Mac Studio. |
I am not having any issues using the MMQ kernels. Maybe lost .3 t/s on empty context 70b. Need to add new Min_P to pure .cpp and all the samplers exllama added too. |
I think that the missing exllamav2 samplers should be covered now 58c6001 |
Min_P cannot be added to the llama.cpp loader yet because while the python bindings for the C++ sampling functions are available, the parameter is not implemented in the _create_completion function: https://github.com/abetlen/llama-cpp-python/blob/main/llama_cpp/llama.py#L1294 I added the (apparently) new "seed" generation parameter though, which replaces the previous useless "seed" loading parameter. I'm studying the possibility of removing the "CPU only" version of llama-cpp-python from the requirements, and installing only a single version for CUDA / AMD users with the original "llama_cpp" namespace instead of "llamacpp_cuda". The modified namespace seems to be causing problems, and I don't see the use case for 2 libraries now that the installer handles the AVX2, no AVX2, CUDA, and CPU only cases automatically. Requesting feedback from @jllllll on whether this makes sense. |
This reverts commit 923c8e2.
@oobabooga I am working on a fix for the issues that have arisen with this. |
@jllllll I was inclined to keep only the CUDA wheels for simplicity. Some people were confused by the "cpu" checkbox in the llama.cpp loader, and I also haven't seen anyone using the "cpu" option recently. But if you feel like it's best to keep this option, then we can keep it. I am at this very moment trying to build wheels using your workflows with the There was also a report of higher memory usage on a Mac in the latest version; I don't know what is up with that. Here is my commit (I was going to PR it to you later if it works): oobabooga/llama-cpp-python-cuBLAS-wheels@beb1b54 My wheels are still building so I haven't tested them. Your workflows are mindblowing by the way. |
I'll do a local build with that and test it on my 1080ti to see what the performance difference is. |
Yes, it makes fully offloaded performance for a 13b model on a 3090 go from ~45 tokens/second to ~30 tokens/second (or something like that). This optimization was introduced in ggerganov/llama.cpp#3776, but it doesn't work for all GPUs. I think that there are plans for detecting the GPU model automatically at runtime in llama.cpp, but for now, the switch has to be made at compile time. I think that the best we can do is target the lower end GPUs until that update happens in llama.cpp. |
Without flag: 20-23 t/s Hopefully they will add GPU detection soon. That is something that has been needed for a while now. |
I made a test on my GTX 1650 and couldn't get the old performance back with my workflow wheels: 0.2.11:
0.2.18 without -DLLAMA_CUDA_FORCE_MMQ=ON:
0.2.18 with -DLLAMA_CUDA_FORCE_MMQ=ON (supposedly):
Most likely I put |
This is what I did: jllllll/llama-cpp-python-cuBLAS-wheels@f6d1e53 |
I think that your wheels will work, my logs say that MMQ was not being forced even though it should be:
|
All of the relevant 0.2.18 wheels should be rebuilt now. |
Thank you! This fixes the broken prompt processing times. The speed is not better than what it was in late September, but at least it's not a lot worse: 0.2.11:
0.2.18 (jllllll version):
0.2.18 without -DLLAMA_CUDA_FORCE_MMQ=ON:
I have kept the |
GPU offloading didn't work in 0.2.17, and now 0.2.18 crashes with "Illegal instruction (core dumped)" when I try to load a model. I'll leave this PR here until this gets figured out.