-
Notifications
You must be signed in to change notification settings - Fork 351
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mamba #694
Comments
The binaries in the latest release (0.11.1) are a little too old. The ones in the master branch were compiled after that PR was merged, so in theory they should include mamba support. I'd be interested to hear how that goes if you try it! |
Oh, I'm using the 0.11.2 (from nuget). I tried copying the binaries from the master and replacing the /bin ones with it.
Is a new build of the nuget packet from master needed in this case? |
That won't work I'm afraid. The llama.cpp API is unstable so every time the binaries are updated there are various internal changes on the C# side to work with the changed API. You always need to use the correc set of binaries with the correct version of the C# code. |
Yep. Just compiled the main package and the CPU version and the results were the same, same exit code and assertion log. |
I don't have any ideas at the moment. I know Mamba is a bit of an unusual archtecture just because I've seen various comments inside llama.cpp about how certain APIs needs to be adjusted for Mamba, or don't quite make sense in a Mamba context. We'd definitely be interested in any investigations/PRs for Mamba support! |
Yes, nuget caches the package and will not take your compiled one if it has the same version tag.
That's an unexpected thing. What prompt were you using? If you have cmake installed on your PC, you could also try to run the same model and prompt directly in llama.cpp to see if the output is still in a mess. |
About this, the model was one of the unique I was able to find in hugging face in GGUF that was actually mamba (MambaHermes 3B). I tested it with the same formatting with the same processor I made for Phi3 and it also "kinda worked" (the responses were then very short, but more coherent). I also got it working a little better with the version quantized with 6-bits instead of 4. But I realized something a little strange, there is something on the implementation of llama.cpp that makes models run progressively slower? I thought it was because I was using transformed-based models before, but even with mamba the time for initial token is many times increasing absurdly with each message (like, from 1 second to the first token to 5, then 10, then 26, etc). One of my tests where they performed reasonably well: |
AFAIK, there's no such thing in llama.cpp. Could you please post the huggingface model link here so that we can try to reproduce this case?
Though LM studio is not open-source, if I remember correctly, it also uses llama.cpp as the backend. As you mentioned above, phi-3 works well in LM studio while mamba becomes slower in llama.cpp. It doesn't indicate that it's llama.cpp's problem, but also probably the model's problem. Could you please try mamba in LM-studio, or try phi-3 with llama.cpp/LLamaSharp? |
You'll get a progressive slowdown if you are using a stateless executor, and submitting larger and larger chat history each time. The stateful executors internally store the chat history and should be around the same time for every token. I'm not sure exactly how the situation there differs for Mamba, but it should be roughly the same afaik. |
I'll close this one now, since mamba is now supported. If there's still problems please don't hesitate to re-open or to create new issues :) |
Is mamba supported already in the current version of llama.cpp this library uses?
(ggerganov/llama.cpp#5328)
The text was updated successfully, but these errors were encountered: