A key limitation of the X-LoRA architecture is the need for 2 forward passes of the model per generation step. To trade off model performance for speed, mistral.rs allows the user to reduce the granularity of the scalings by caching them in a technique we call Non Granular Scalings.
For the first
This can be enabled by passing --tgt-non-granular-index
followed by
./mistralrs_server --port 1234 x-lora-plain -o orderings/xlora-paper-ordering.json -x lamm-mit/x-lora --tgt-non-granular-index 5
Set the tgt_non_granular_index
attribute to a non-None
value in the Which
selection:
from mistralrs import Runner, Which
runner = Runner(
which=Which.XLoraGGUF(
tok_model_id=None, # Automatically determine from ordering file
quantized_model_id="TheBloke/zephyr-7B-beta-GGUF",
quantized_filename="zephyr-7b-beta.Q4_0.gguf",
tokenizer_json=None,
repeat_last_n=64,
xlora_model_id="lamm-mit/x-lora",
order="orderings/xlora-paper-ordering.json",
tgt_non_granular_index=5,
)
)
...