-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible ideas for speeding up CPU inference with Mixtral (KV cache prioritization) #4518
Comments
We can play with these ideas, though I don't expect much. Prioritising the KV layers seems more viable, but it would also increase the host <-> device transfers, so not sure if there will be a net positive. I don't expect offloading full experts to help because it seems that each layers chooses an expert with very even probabilities. Haven't done detailed stats on this, so I could be wrong. But for sure we can experiment around this. |
This issue is stale because it has been open for 30 days with no activity. |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
Possible Implementations
KV Cache / Context prioritization
The kv layers are quite small for Mixtral on moderate context sizes (e.g. 4096 ctx), as it uses Grouped Query Attention, which may make it beneficial to prioritize them instead of splitting the KV cache equally amongst all layers.
I may be wrong about this, but considering the split KV layers PR massively improves prompt processing speed proportional to how many layers are offloaded to the GPU, I think it'd be much more viable to prioritize the smaller GQA kv layers as being offloaded first rather than evenly distributing them for MoE setups.
(This is the old test I did with prompt processing on a 13b, which has a much larger KV cache...)
Splitting 'Grouped Layers'
In addition to this, it may be most beneficial to keep as many full experts in VRAM as possible, so that the slowdown only applies to one or two particular experts which have their layers (or some of their layers) in regular memory.
The way I interpret how llama.cpp handles it right now is that each layer you offload via -ngl is actually 8 hidden layers for each Mixtral expert (my assumption is, the -ngl actually specifies 'layer groups').
Wouldn't it be wiser to offload as many full experts as possible + the KV cache, or would you get a net loss in terms of parallelization efficiency?
If most of the model's actual matmul is happening in VRAM except for one or two odd experts, I think this could be greatly beneficial for overall inference speed, since prompt processing is the number 1 issue that currently plagues memory bound users / offloading inference in general.
7b generation speeds are fast enough on pure CPU for this to make sense to me.
The text was updated successfully, but these errors were encountered: