Mixtral offloading

This project is a fork of the original Mixtral offloading project. Our main contributions involve testing the performance of various caching strategies (including layer-wise independent caching strategies to handle varying distributions of expert selections across layers), as well as upper-bounding the performance of speculative decoding by hard-coding expert activations for a selected set of prompts.

Specifically in dvmazur/mixtral-offloading they used i) LRU caching of experts, and ii) speculative pre-loading by predicting the active experts ahead of time, to accelerate token generation. In our project, we delve deeper into these two ideas and conduct a comprehensive analysis. Our investigation revealed the following:

Performance (measured by throughput) is largely unaffected by the caching strategy, and techniques such as LRU and LFU caching offer only marginal improvements over totally random cache eviction policies.
Speculative pre-loading of experts offers no further performance gains for 4-bit quantized MoE inference, and is bound by CPU-GPU communication overheads.
Reducing communication between the GPU and CPU and conducting inference on the CPU presents a favourable approach for MoE inference. Consequently, development of quantized multi-precision operation kernels for CPU inference presents the most promising, but challenging direction for further optimization.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Mixtral offloading

Files

README.md

Latest commit

History

README.md

File metadata and controls

Mixtral offloading