-
Notifications
You must be signed in to change notification settings - Fork 379
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mixtral Support When? #557
Comments
Is this supported upstream in llama.cpp yet? If so, it'll be in the next release once I merge it |
I don't know much about llama.cpp however from what I have seen no. Tho there is some experimental stuff going on. |
A fork claims it supports Mixtral: https://huggingface.co/TheBloke/Mixtral-8x7B-v0.1-GGUF/discussions/8 I didn't have a chance to test it yet. |
It's supported on the mixtral branch of llamacpp. Tested it with Mixtral Instruct Q4_M from TheBloke, it works fine. |
It is better to wait for the release from LostRuins. That fork is questionable |
I personally wouldn't trust the compiled release on a random fork either, but a few heuristic positives on virustotal isn't a reliable indicator to know whether an executable is dangerous. This will probably show up for a lot of unknown executables from github. It wouldn't be too hard for anyone who wants to use this fork to review the code and compile it (it's only 2 commits ahead, 1 of which is a merge from the upstream mixtral branch of llama.cpp) |
(For information) I've tested that fork It worked really good! But then, after about 800 tokens of rollplay, it suddenly got completely off-rails, printing absolute nonsense like:
Restarting does not help. Lowering temperature does not help either. Tried 32k and 8k contexts. Also I got I don't know what's going on, but given the superior model quality on short stories – this must be a bug somewhere (maybe on my side if nobody else is seeing this). |
Anecdotally this output looks to me like what happens when RoPE is misconfigured. |
Thought rope gets auto set for gguf. Had similar output when I tried going above 4k context |
The PR to track is here: ggerganov#4406 |
I've tried manually setting RoPE Base to 1000000.0 or to 10000.0 with context length of 32000, 32768, ~8300 – but nothing seemed to be resolving the issue. |
It's merged |
v1.52 is out, mixtral support is added, please try it. Note: Mixtral currently does prompt processing very slowly. You may want to try with |
Maybe I'm dumb, but disabling batch processing doesn't make it go any faster, they are both slow and if someone put a gun to my head, I'd say batches of 512 are still a little bit faster than no batch at all. To me it seems no batch just looks faster because it's updating more often in the cli. But yeah, it's real painful for context sizes >4000 |
I downloaded mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf and tried again on the same place where mixtral-8x7b-v0.1.Q5_K_M.gguf failed. It worked great!! In both Frankenstein and in official koboldcpp-1.52, with the same exact settings. Moreover, my story does not include special [INST] tags, so the base model ought to behave even better than instruct one. And it does until it breaks. P.S. Blas batching is working in 1.52 like normal. |
I tried it with the model "synthia-moe-v3-mixtral-8x7b". Primary context processing is VERY slow, generation is fast, BUT: the model has a very bad memory - it doesn't remember the name of the character that was called two replicas ago. I suspect some bug in context processing via context shift. Well, or a defect in the model, quantization and the like.... |
Can confirm - context processing is VERY slow at every model I tried, as soon as use smaller quant which can go all into vram - everything is super fast. Any solution to this? |
Tried again on the newest version of the program, only the model is now "synthia-moe-v3-mixtral-8x7b.Q6_K.gguf". Much better. Good model, at least no dumber than 70b, but generation is much faster (~3 tokens per second on my system). But in the context of 4k tokens, you have to wait 10+ minutes for the first response. It's about the same with the regular 70b model, but its speed only allowed it to be used for demo purposes. It is different with this model. The issue of context preservation is now more relevant than ever :) |
Oddly enough, I can't see anyone mentioning this problem on llama.cpp's official repo. mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf gives very good results to me and feels like a great all-arounder (haven't tested synthia yet), but this blast processing issue stopping me from enjoying the model. |
Can you guys actually measure your BLAS with different strategies, sizes and models? I'll try to present mine. I think 512 tokens of context + 512 tokens of generation would be enough for benchmarking, let's see… |
Okay, my results with OpenBLAS. Model
Model
Model
I don't see any HUGE difference here. No batch is just slightly better than max batch for mixtral. |
In the case of a Mixtral type model, it does not make sense to consider models below 5_0. They are stupid. Apparently something is too corrupted by quantization. The 6K Mixtral model is 30% slower than the 5_0 model... |
This is synthia-moe-v3-mixtral-8x7b.Q4_K_M.gguf processing 995 tokens and then generating 512 new tokens on an RTX 3060 12 GB / Ryzen 5 5600 / 3066 MHz RAM PC. Seems like offloading does help, but not by a lot. I also wanted to include tests with
|
As I said, for me Though I'm not sure, should I re-download a larger instruct one, or just wait for a new mix of those.
Maybe this is what have happened with Frankenstein fork too? Later this day I will hopefully repeat my setup but on CuBLAS, since I have RTX 3060 too! |
I am getting information from various sources that all K-quant Mixtral models are broken. I have personally tested Q2_K and Q3_K and can confirm this. However Q6_K I have also tested and it seems to be OK, but you should keep this information in mind. Only Q_0 models should be used for now. |
My results with CuBLAS for
This is actually good! Processing time is still lower than generation time, large batch is better. Here is my server KCPPS:
And here is my client JSON:
I have also downloaded |
Try running the program and immediately give the model a 4k context (a common scenario when continuing a chat). Is everything still fine? My system (intel 12500) takes >10 minutes for the first response. And after that, it's easy - until the model screws up and needs a reroll and it starts recalculating the entire context. It's a pain. |
All my experiments consisted of restarting koboldcpp and giving it 512 tokens of context for generation of additional 512 tokens, resulting in Given maximal tested BLAS batch size of 512 I don't think having 4096 (of e.g. 8192) in context would matter anyhow differently than just
I gave my KCPPS and JSON. Try those and conclude your own results. (Maybe something fishy is going own, and yours will differ even with the exact same setup – that would be interesting to debug together) |
In version 1.52.2, nothing noticeable has changed in the speed of promt processing for Mixtral models. (Checked on two models). |
Same, tested 4-5 MOE models, always the same - first message takes 4-5 min (context processing, generation always ok), then it works normal/fast. |
We still haven't concluded an independent test with common settings. |
OpenBLAS probably has its own internal thread scheduler that handles the GEMM routines. |
Seems like a recent PR in llama.cpp managed to fix mixtral slow prompt processing on CUDA. Take a look: ggerganov#4538 Edit: They are currently working on partial offload support separately (ggerganov#4553) |
Tested the CUDA PR with koboldcpp, and I had a x11 speedup with my 2*P40 setup (from 0.1tok/sec at full 32k ctx to 1.4 tok/sec) |
Nice, I'll make sure it goes into the next ver. |
In the new version (1.53), the speed of prompt processing in Mixtral models is good. The performance of the graphics card is noticeable :) |
Unsurprising the new Mixtral-8x7B and more specifically Mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf does not work. As experienced from other users they get the error create_tensor: tensor 'blk.0.ffn_gate.weight' not found. I understand that it just came out and will take some time for it to get up and working I'm just trying to put it on the radar as I haven't seen anyone talk about it here. If support for it gets added in the next update I'd be happy :D
The text was updated successfully, but these errors were encountered: