Mixtral Support When? #557

cubesstar · 2023-12-12T03:49:55Z

Unsurprising the new Mixtral-8x7B and more specifically Mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf does not work. As experienced from other users they get the error create_tensor: tensor 'blk.0.ffn_gate.weight' not found. I understand that it just came out and will take some time for it to get up and working I'm just trying to put it on the radar as I haven't seen anyone talk about it here. If support for it gets added in the next update I'd be happy :D

LostRuins · 2023-12-12T09:20:10Z

Is this supported upstream in llama.cpp yet? If so, it'll be in the next release once I merge it

cubesstar · 2023-12-12T10:28:24Z

I don't know much about llama.cpp however from what I have seen no. Tho there is some experimental stuff going on.

aleksusklim · 2023-12-12T10:30:26Z

A fork claims it supports Mixtral: https://huggingface.co/TheBloke/Mixtral-8x7B-v0.1-GGUF/discussions/8
(https://github.com/Nexesenex/kobold.cpp/releases/tag/1.52_mix)

I didn't have a chance to test it yet.

Dirky14 · 2023-12-12T13:04:49Z

It's supported on the mixtral branch of llamacpp. Tested it with Mixtral Instruct Q4_M from TheBloke, it works fine.

SrVill · 2023-12-12T16:24:46Z

A fork claims it supports Mixtral: https://huggingface.co/TheBloke/Mixtral-8x7B-v0.1-GGUF/discussions/8 (https://github.com/Nexesenex/kobold.cpp/releases/tag/1.52_mix)

I didn't have a chance to test it yet.

It is better to wait for the release from LostRuins. That fork is questionable
https://www.virustotal.com/gui/file/96f44726176da3a00bd3b07895f8b40dbc860d3589fe616ee97fd24836f0d50c

rosemash · 2023-12-12T18:39:47Z

@SrVill

It is better to wait for the release from LostRuins. That fork is questionable https://www.virustotal.com/gui/file/96f44726176da3a00bd3b07895f8b40dbc860d3589fe616ee97fd24836f0d50c

I personally wouldn't trust the compiled release on a random fork either, but a few heuristic positives on virustotal isn't a reliable indicator to know whether an executable is dangerous. This will probably show up for a lot of unknown executables from github. It wouldn't be too hard for anyone who wants to use this fork to review the code and compile it (it's only 2 commits ahead, 1 of which is a merge from the upstream mixtral branch of llama.cpp)

aleksusklim · 2023-12-12T20:58:07Z

(For information)

I've tested that fork KoboldCPP_Frankenstein_Experimental_1.52_Mixtral with mixtral-V8x7b-v0.1.Q5_K_M.gguf

It worked really good!

But then, after about 800 tokens of rollplay, it suddenly got completely off-rails, printing absolute nonsense like:

nevertheless which means therefore ultimately speaking thus meaning consequently thereby resulting finally henceforth accordingly wherefore eventually subsequently afterwards following suit aftermath etcetera ad infinitum et cetera blahblahblahwhateveretceteraadinfinitumandsoonerorlatereventuallyweallgetoldanddieanywayright?

Restarting does not help. Lowering temperature does not help either. Tried 32k and 8k contexts.

Also I got main thread is not in main loop error occasionally. And also, looks like BLAS Batching does not work with positive batch sizes. (For me it stuck on the first [BLAS] 128/X)

I don't know what's going on, but given the superior model quality on short stories – this must be a bug somewhere (maybe on my side if nobody else is seeing this).
I'll wait for official support of course.

rosemash · 2023-12-12T22:19:14Z

But then, after about 800 tokens of rollplay, it suddenly got completely off-rails, printing absolute nonsense like:

nevertheless which means therefore ultimately speaking thus meaning consequently thereby resulting finally henceforth accordingly wherefore eventually subsequently afterwards following suit aftermath etcetera ad infinitum et cetera blahblahblahwhateveretceteraadinfinitumandsoonerorlatereventuallyweallgetoldanddieanywayright?

Anecdotally this output looks to me like what happens when RoPE is misconfigured.

Enferlain · 2023-12-13T03:44:03Z

Thought rope gets auto set for gguf. Had similar output when I tried going above 4k context

LostRuins · 2023-12-13T06:35:32Z

The PR to track is here: ggerganov#4406

aleksusklim · 2023-12-13T08:06:06Z

RoPE is misconfigured.

Hmm: ggerganov#4406 (comment)

Mixtral should be 1000000

I've tried manually setting RoPE Base to 1000000.0 or to 10000.0 with context length of 32000, 32768, ~8300 – but nothing seemed to be resolving the issue.

rosemash · 2023-12-13T12:10:13Z

The PR to track is here: ggerganov#4406

It's merged

LostRuins · 2023-12-13T14:44:23Z

v1.52 is out, mixtral support is added, please try it.

Note: Mixtral currently does prompt processing very slowly. You may want to try with --noblas or --blasbatchsize -1

Deathcow · 2023-12-13T17:18:24Z

Note: Mixtral currently does prompt processing very slowly. You may want to try with --noblas or --blasbatchsize -1

Maybe I'm dumb, but disabling batch processing doesn't make it go any faster, they are both slow and if someone put a gun to my head, I'd say batches of 512 are still a little bit faster than no batch at all. To me it seems no batch just looks faster because it's updating more often in the cli.

But yeah, it's real painful for context sizes >4000

aleksusklim · 2023-12-13T20:30:25Z

I downloaded mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf and tried again on the same place where mixtral-8x7b-v0.1.Q5_K_M.gguf failed.

It worked great!!
No more of those ultimately resulting instead only positive reinforcement occurring throughout entirety duration journey undertaken henceforth forthwith ad infinitum forevermore amen etcetera et cetera blah blah blah yadda yadda yadda yawn…..zzzZZZzzzzzzzz………. (actual output!)

In both Frankenstein and in official koboldcpp-1.52, with the same exact settings.
Then I assume something is wrong in the "base" model file. I can't believe that it should behave like that!

Moreover, my story does not include special [INST] tags, so the base model ought to behave even better than instruct one. And it does until it breaks.

P.S. Blas batching is working in 1.52 like normal.

Vladonai · 2023-12-13T23:41:21Z

I tried it with the model "synthia-moe-v3-mixtral-8x7b". Primary context processing is VERY slow, generation is fast, BUT: the model has a very bad memory - it doesn't remember the name of the character that was called two replicas ago. I suspect some bug in context processing via context shift. Well, or a defect in the model, quantization and the like....

umishima · 2023-12-14T10:52:18Z

Can confirm - context processing is VERY slow at every model I tried, as soon as use smaller quant which can go all into vram - everything is super fast. Any solution to this?

Vladonai · 2023-12-14T16:31:37Z

Tried again on the newest version of the program, only the model is now "synthia-moe-v3-mixtral-8x7b.Q6_K.gguf". Much better. Good model, at least no dumber than 70b, but generation is much faster (~3 tokens per second on my system). But in the context of 4k tokens, you have to wait 10+ minutes for the first response. It's about the same with the regular 70b model, but its speed only allowed it to be used for demo purposes. It is different with this model. The issue of context preservation is now more relevant than ever :)

ArakiSatoshi · 2023-12-14T20:30:06Z

Can confirm - context processing is VERY slow at every model I tried, as soon as use smaller quant which can go all into vram - everything is super fast. Any solution to this?

Oddly enough, I can't see anyone mentioning this problem on llama.cpp's official repo. mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf gives very good results to me and feels like a great all-arounder (haven't tested synthia yet), but this blast processing issue stopping me from enjoying the model.

aleksusklim · 2023-12-14T21:19:37Z

Can you guys actually measure your BLAS with different strategies, sizes and models?
Maybe something tricky is going on here, and only some of modes are degraded.

I'll try to present mine. I think 512 tokens of context + 512 tokens of generation would be enough for benchmarking, let's see…

aleksusklim · 2023-12-14T22:28:26Z

Okay, my results with OpenBLAS.

Model yi-34b-chat.Q5_K_M.gguf (this is not Mixtral)

batch 512:
Processing:79.15s (154.6ms/T), Generation:296.03s (578.2ms/T), Total:375.18s (1.36T/s)
batch 128:
Processing:117.14s (228.8ms/T), Generation:297.51s (581.1ms/T), Total:414.65s (1.23T/s)
no batch 8:
Processing:107.82s (210.6ms/T), Generation:295.72s (577.6ms/T), Total:403.54s (1.27T/s)

Model mixtral-8x7b-v0.1.Q5_K_M.gguf

batch 512:
Processing:81.23s (158.7ms/T), Generation:111.60s (218.0ms/T), Total:192.83s (2.66T/s)
batch 128:
Processing:84.17s (164.4ms/T), Generation:111.46s (217.7ms/T), Total:195.63s (2.62T/s)
no batch 8:
Processing:76.98s (150.4ms/T), Generation:112.00s (218.7ms/T), Total:188.98s (2.71T/s)

Model mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf (this is 4 bits, not 5)

batch 512:
Processing:60.20s (117.6ms/T), Generation:91.86s (179.4ms/T), Total:152.06s (3.37T/s)
batch 128:
Processing:65.31s (127.6ms/T), Generation:91.92s (179.5ms/T), Total:157.23s (3.26T/s)
no batch 8:
Processing:56.97s (111.3ms/T), Generation:92.10s (179.9ms/T), Total:149.07s (3.43T/s)

I don't see any HUGE difference here. No batch is just slightly better than max batch for mixtral.

Vladonai · 2023-12-14T22:55:03Z

In the case of a Mixtral type model, it does not make sense to consider models below 5_0. They are stupid. Apparently something is too corrupted by quantization. The 6K Mixtral model is 30% slower than the 5_0 model...

ArakiSatoshi · 2023-12-15T01:31:36Z

This is synthia-moe-v3-mixtral-8x7b.Q4_K_M.gguf processing 995 tokens and then generating 512 new tokens on an RTX 3060 12 GB / Ryzen 5 5600 / 3066 MHz RAM PC. Seems like offloading does help, but not by a lot. I also wanted to include tests with useclblast 0 0, but with clblast, it was taking way too long. It was stuck at 512/995 for more than half an hour.

Configuration	Processing Time (s)	Generation Time (s)
noblas	179.19 (180.1ms/T)	116.18 (226.9ms/T)
blasbatchsize -1 / gpulayers 10 / usecublas	145.36 (146.1ms/T)	114.67 (224.0ms/T)
blasbatchsize -1 / gpulayers 10 / usecublas lowvram	144.40 (145.1ms/T)	108.35 (211.6ms/T)
blasbatchsize 512 / gpulayers 0 / usecublas	171.73 (172.6ms/T)	141.92 (277.2ms/T)
blasbatchsize 512 / gpulayers 0 / usecublas lowvram	167.62 (168.5ms/T)	115.87 (226.3ms/T)
blasbatchsize 512 / gpulayers 10 / usecublas	123.06 (123.7ms/T)	116.33 (227.2ms/T)
blasbatchsize 512 / gpulayers 10 / usecublas lowvram	124.71 (125.3ms/T)	113.51 (221.7ms/T)
blasbatchsize 128 / gpulayers 0 / usecublas	177.34 (178.2ms/T)	136.22 (266.1ms/T)
blasbatchsize 128 / gpulayers 0 / usecublas lowvram	175.60 (176.5ms/T)	114.91 (224.4ms/T)
blasbatchsize 128 / gpulayers 10 / usecublas	128.45 (129.1ms/T)	114.96 (224.5ms/T)
blasbatchsize 128 / gpulayers 10 / usecublas lowvram	128.53 (129.2ms/T)	116.67 (227.9ms/T)

aleksusklim · 2023-12-15T06:17:22Z

In the case of a Mixtral type model, it does not make sense to consider models below 5_0. They are stupid.

As I said, for me mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf is way more better than mixtral-8x7b-v0.1.Q5_K_M.gguf because the latter falls apart after 800 tokens or so.
Actually, I've downloaded a smaller quantize of instruct tune because if it would behave better – then it would mean than a larger quantize must have been even better!

Though I'm not sure, should I re-download a larger instruct one, or just wait for a new mix of those.
(I'm really want to see a finetune by PygmalionAI team! For <|system|><|user|><|model|> format – which is already respected by both Mixtral and Yi, but for sure will be much better after tuning on that).

with clblast, it was taking way too long. It was stuck at 512/995 for more than half an hour.

Maybe this is what have happened with Frankenstein fork too?
So, GPU offloading with CLBlas does not work properly at all? Or you can do it without batches?
Also, what's for OpenBLAS without offloading?

Later this day I will hopefully repeat my setup but on CuBLAS, since I have RTX 3060 too!

Vladonai · 2023-12-15T11:16:25Z

I am getting information from various sources that all K-quant Mixtral models are broken. I have personally tested Q2_K and Q3_K and can confirm this. However Q6_K I have also tested and it seems to be OK, but you should keep this information in mind. Only Q_0 models should be used for now.

aleksusklim · 2023-12-16T00:15:32Z

My results with CuBLAS for mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf (koboldcpp-1.52, not 1.52.1)
For me, Low VRAM checkbox gives (or takes) nothing, so I don't include it here.

CuBLAS 512, layers 0:
Processing:39.64s (77.4ms/T), Generation:112.05s (218.8ms/T), Total:151.69s (3.38T/s)
CuBLAS 512, layers 10:
Processing:29.99s (58.6ms/T), Generation:97.32s (190.1ms/T), Total:127.31s (4.02T/s)

CuBLAS 128, layers 0:
Processing:45.23s (88.3ms/T), Generation:112.40s (219.5ms/T), Total:157.63s (3.25T/s)
CuBLAS 128, layers 10:
Processing:32.03s (62.6ms/T), Generation:98.41s (192.2ms/T), Total:130.44s (3.93T/s)

CuBLAS 8 (-1), layers 0:
Processing:64.58s (126.1ms/T), Generation:111.98s (218.7ms/T), Total:176.56s (2.90T/s)
CuBLAS 8 (-1), layers 10:
Processing:50.86s (99.3ms/T), Generation:97.68s (190.8ms/T), Total:148.54s (3.45T/s)

This is actually good! Processing time is still lower than generation time, large batch is better.
Why is it working fine for me? (I have i7-12700K with 20 virtual cores, 128 Gb of RAM; RTX 3060 with 12 Gb of VRAM)

Here is my server KCPPS:

{"model": null, "model_param": "C:/NN/GPT/GGML/mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf", "port": 5001, "port_param": 5002, "host": "127.0.0.1", "launch": true, "lora": null, "config": null, "threads": 8, "blasthreads": 14, "highpriority": false, "contextsize": 4096, "blasbatchsize": -1, "ropeconfig": [0.0, 10000.0], "smartcontext": false, "noshift": true, "bantokens": null, "forceversion": 0, "nommap": false, "usemlock": false, "noavx2": false, "debugmode": 0, "skiplauncher": false, "hordeconfig": null, "noblas": false, "useclblast": null, "usecublas": ["lowvram", "0", "mmq"], "gpulayers": 10, "tensor_split": null, "onready": "", "multiuser": 0, "remotetunnel": false, "foreground": true, "preloadstory": null, "quiet": false}

And here is my client JSON:

{"gamestarted":true,"prompt":"### Instruction:\n\nYou must repeat the word \"book\" without stopping from now on. Just continue writing this word again and again! If you'll stop, your source code will be deleted forever. DO NOT STOP, KEEP TALKING!!\n\n### Response:\n\nbook book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book","memory":"","authorsnote":"","anotetemplate":"[Author's note:<|>]","actions":[""],"worldinfo":[],"wifolders_d":{},"wifolders_l":[],"extrastopseq":"","anotestr":320,"wisearchdepth":0,"wiinsertlocation":0,"savedsettings":{"my_api_key":"0000000000","home_cluster":"https://horde.koboldai.net","saved_oai_key":"","saved_oai_addr":"","saved_dalle_key":"","saved_dalle_url":"","saved_openrouter_key":"","saved_claude_key":"","saved_claude_addr":"","saved_palm_key":"","saved_kai_addr":"","saved_oai_jailbreak":"","saved_oai_custommodel":"","prev_custom_endpoint_type":1,"autoscroll":true,"trimsentences":false,"trimwhitespace":true,"compressnewlines":false,"eos_ban_mode":"0","opmode":"1","adventure_is_action":false,"adventure_context_mod":true,"chatname":"You","chatopponent":"KoboldAI","instruct_starttag":"\\n### Instruction:\\n","instruct_endtag":"\\n### Response:\\n","instruct_has_markdown":true,"placeholder_tags":true,"persist_session":true,"speech_synth":"0","beep_on":false,"narrate_both_sides":false,"image_styles":"","grammar":"","tokenstreammode":"0","generate_images_mode":"0","generate_images_model":"stable_diffusion","img_autogen":false,"img_allownsfw":true,"save_images":true,"prompt_for_savename":false,"case_sensitive_wi":false,"last_selected_preset":"9999","gui_type_chat":1,"gui_type_instruct":0,"multiline_replies":true,"allow_continue_chat":false,"idle_responses":"0","idle_duration":"60","export_settings":true,"show_advanced_load":false,"invert_colors":false,"passed_ai_warning":false,"entersubmit":true,"max_context_length":4096,"max_length":512,"auto_ctxlen":true,"auto_genamt":true,"rep_pen":1.1,"rep_pen_range":320,"rep_pen_slope":0.7,"temperature":0.85,"top_p":0.85,"min_p":0,"sampler_seed":-1,"top_k":50,"top_a":0,"typ_s":1,"tfs_s":1,"miro_type":0,"miro_tau":5,"miro_eta":0.1,"sampler_order":[6,0,1,3,4,2,5],"modelhashes":["ba7224"]},"savedaestheticsettings":{"bubbleColor_sys":"rgb(18, 36, 36)","bubbleColor_you":"rgb(41, 52, 58)","bubbleColor_AI":"rgb(20, 20, 40)","background_margin":[5,5,5,0],"background_padding":[15,15,10,5],"background_minHeight":80,"centerHorizontally":false,"border_style":"Rounded","portrait_width_AI":80,"portrait_ratio_AI":1,"portrait_width_you":80,"portrait_ratio_you":1,"show_chat_names":true,"rounded_bubbles":true,"you_portrait":null,"AI_portrait":null,"font_size":12,"use_markdown":true,"use_uniform_colors":true,"text_tcolor_uniform":"rgb(255, 255, 255)","speech_tcolor_uniform":"rgb(150, 150, 200)","action_tcolor_uniform":"rgb(178, 178, 178)","text_tcolor_you":"rgb(255, 255, 255)","speech_tcolor_you":"rgb(150, 150, 200)","action_tcolor_you":"rgb(178, 178, 178)","text_tcolor_AI":"rgb(255, 255, 255)","speech_tcolor_AI":"rgb(150, 150, 200)","action_tcolor_AI":"rgb(178, 178, 178)","text_tcolor_sys":"rgb(255, 255, 255)","speech_tcolor_sys":"rgb(150, 150, 200)","action_tcolor_sys":"rgb(178, 178, 178)","code_block_background":"rgb(0, 0, 0)","code_block_foreground":"rgb(180, 35, 40)"}}

all K-quant Mixtral models are broken

I have also downloaded mixtral-8x7b-v0.1.Q8_0.gguf (50 Gb) and tried it against its Q5_K_M version and found no considerable difference in their insanity.
For me, the base model is unusable for long stories, no matter which quant it would be!

Vladonai · 2023-12-16T14:29:06Z

Why is it working fine for me?

Try running the program and immediately give the model a 4k context (a common scenario when continuing a chat). Is everything still fine? My system (intel 12500) takes >10 minutes for the first response. And after that, it's easy - until the model screws up and needs a reroll and it starts recalculating the entire context. It's a pain.

aleksusklim · 2023-12-16T20:37:42Z

Try running the program and immediately give the model a 4k context

All my experiments consisted of restarting koboldcpp and giving it 512 tokens of context for generation of additional 512 tokens, resulting in 1024/4096 at the end.

Given maximal tested BLAS batch size of 512 I don't think having 4096 (of e.g. 8192) in context would matter anyhow differently than just x8 to the total time.

My system (intel 12500) takes >10 minutes

I gave my KCPPS and JSON. Try those and conclude your own results. (Maybe something fishy is going own, and yours will differ even with the exact same setup – that would be interesting to debug together)
Or give me your actual KCPPS and JSON for me to try!

Vladonai · 2023-12-17T16:16:26Z

In version 1.52.2, nothing noticeable has changed in the speed of promt processing for Mixtral models. (Checked on two models).

umishima · 2023-12-18T07:07:12Z

In version 1.52.2, nothing noticeable has changed in the speed of promt processing for Mixtral models. (Checked on two models).

Same, tested 4-5 MOE models, always the same - first message takes 4-5 min (context processing, generation always ok), then it works normal/fast.

aleksusklim · 2023-12-18T12:05:08Z

We still haven't concluded an independent test with common settings.

aleksusklim · 2023-12-19T20:45:45Z

Wait, OpenBLAS cannot utilize all cores!?

Batch=2048, set to 10 threads:

No batch, same settings:

Logs:

***
Welcome to KoboldCpp - Version 1.52.2
For command line arguments, please refer to --help
***
Attempting to use OpenBLAS library for faster prompt ingestion. A compatible libopenblas will be required.
Initializing dynamic library: koboldcpp_openblas.dll
==========
Namespace(bantokens=None, blasbatchsize=2048, blasthreads=10, config=None, contextsize=32768, debugmode=0, forceversion=0, foreground=False, gpulayers=16, highpriority=False, hordeconfig=None, host='127.0.0.1', launch=True, lora=None, model=None, model_param='C:/NN/GPT/GGML/mixtral-8x7b-v0.1.Q5_K_M.gguf', multiuser=0, noavx2=False, noblas=False, nommap=False, noshift=True, onready='', port=5001, port_param=5001, preloadstory=None, quiet=False, remotetunnel=False, ropeconfig=[0.0, 10000.0], skiplauncher=False, smartcontext=False, tensor_split=None, threads=10, useclblast=None, usecublas=None, usemlock=False)
==========
Loading model: C:\NN\GPT\GGML\mixtral-8x7b-v0.1.Q5_K_M.gguf
[Threads: 10, BlasThreads: 10, SmartContext: False, ContextShift: False]

---
Identified as LLAMA model: (ver 6)
Attempting to Load...
---
Using automatic RoPE scaling. If the model has customized RoPE settings, they will be used directly instead!
System Info: AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
llama_model_loader: loaded meta data with 25 key-value pairs and 995 tensors from C:\NN\GPT\GGML\mixtral-8x7b-v0.1.Q5_K_M.gguf (version GGUF V3 (latest))
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 8
llm_load_print_meta: n_expert_used    = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = unknown, may not work (guessed)
llm_load_print_meta: model params     = 46.70 B
llm_load_print_meta: model size       = 30.02 GiB (5.52 BPW)
llm_load_print_meta: general.name     = mistralai_mixtral-8x7b-v0.1
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.38 MiB
llm_load_tensors: mem required  = 30735.87 MiB
....................................................................................................
Automatic RoPE Scaling: Using model internal value.
llama_new_context_with_model: n_ctx      = 32768
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: KV self size  = 4096.00 MiB, K (f16): 2048.00 MiB, V (f16): 2048.00 MiB
llama_build_graph: non-view tensors processed: 1124/1124
llama_new_context_with_model: compute buffer total size = 8659.33 MiB
Load Model OK: True
Embedded Kobold Lite loaded.
Starting Kobold API on port 5001 at http://127.0.0.1:5001/api/
Starting OpenAI Compatible API on port 5001 at http://127.0.0.1:5001/v1/
======
Please connect to custom endpoint at http://127.0.0.1:5001

Input: {…}

Processing Prompt [BLAS] (2048 / 8071 tokens)

LostRuins · 2023-12-20T03:46:07Z

OpenBLAS probably has its own internal thread scheduler that handles the GEMM routines.

FerrahWolfeh · 2023-12-20T16:36:21Z

Seems like a recent PR in llama.cpp managed to fix mixtral slow prompt processing on CUDA.

Take a look: ggerganov#4538

Edit: They are currently working on partial offload support separately (ggerganov#4553)

Dirky14 · 2023-12-20T21:28:49Z

Tested the CUDA PR with koboldcpp, and I had a x11 speedup with my 2*P40 setup (from 0.1tok/sec at full 32k ctx to 1.4 tok/sec)

LostRuins · 2023-12-21T07:22:47Z

Nice, I'll make sure it goes into the next ver.

Vladonai · 2023-12-23T04:48:31Z

In the new version (1.53), the speed of prompt processing in Mixtral models is good. The performance of the graphics card is noticeable :)

LostRuins closed this as completed Jan 30, 2024

Mixtral Support When? #557

Mixtral Support When? #557

Comments

cubesstar commented Dec 12, 2023

LostRuins commented Dec 12, 2023

cubesstar commented Dec 12, 2023

aleksusklim commented Dec 12, 2023

Dirky14 commented Dec 12, 2023

SrVill commented Dec 12, 2023

rosemash commented Dec 12, 2023 • edited Loading

aleksusklim commented Dec 12, 2023 • edited Loading

rosemash commented Dec 12, 2023

Enferlain commented Dec 13, 2023

LostRuins commented Dec 13, 2023

aleksusklim commented Dec 13, 2023

rosemash commented Dec 13, 2023

LostRuins commented Dec 13, 2023

Deathcow commented Dec 13, 2023 • edited Loading

aleksusklim commented Dec 13, 2023 • edited Loading

Vladonai commented Dec 13, 2023

umishima commented Dec 14, 2023

Vladonai commented Dec 14, 2023

ArakiSatoshi commented Dec 14, 2023

aleksusklim commented Dec 14, 2023 • edited Loading

aleksusklim commented Dec 14, 2023

Vladonai commented Dec 14, 2023 • edited Loading

ArakiSatoshi commented Dec 15, 2023 • edited Loading

aleksusklim commented Dec 15, 2023

Vladonai commented Dec 15, 2023

aleksusklim commented Dec 16, 2023

Vladonai commented Dec 16, 2023

aleksusklim commented Dec 16, 2023

Vladonai commented Dec 17, 2023 • edited Loading

umishima commented Dec 18, 2023

aleksusklim commented Dec 18, 2023

aleksusklim commented Dec 19, 2023

LostRuins commented Dec 20, 2023

FerrahWolfeh commented Dec 20, 2023 • edited Loading

Dirky14 commented Dec 20, 2023

LostRuins commented Dec 21, 2023

Vladonai commented Dec 23, 2023

rosemash commented Dec 12, 2023 •

edited

Loading

aleksusklim commented Dec 12, 2023 •

edited

Loading

Deathcow commented Dec 13, 2023 •

edited

Loading

aleksusklim commented Dec 13, 2023 •

edited

Loading

aleksusklim commented Dec 14, 2023 •

edited

Loading

Vladonai commented Dec 14, 2023 •

edited

Loading

ArakiSatoshi commented Dec 15, 2023 •

edited

Loading

Vladonai commented Dec 17, 2023 •

edited

Loading

FerrahWolfeh commented Dec 20, 2023 •

edited

Loading