You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This bug has only recently come to my attention when I started using shorter TTL values, and using a chatty model (QwQ). But it's very easy to reproduce with a very short TTL (like 10 seconds) and a prompt that will take longer to run than the TTL.
Steps to reproduce:
Set TTL to 10 of a sufficiently large model
Ask the model to tell a story. Make sure it generates a story that takes longer than 10 seconds to generate
Expected outcome:
The model finishes generating the story, and the TTL will then start to count, giving you 10 seconds to ask a followup question
Actual outcome:
llama-swap prints a "!!! Unloading model Qwen2.5-Coder-32B-Instruct-Q4_K_S, TTL of 10 reached." message midway through the generation. Thankfully it does not unload the model while it's still generating.
But it does instantly unload the model after the prompt is done, resulting in reloads of the model if you ask a followup question.
Suggested fix:
Consider the model idle when it finishes processing all requests, and start counting towards the TTL when that happens.
Consider the model busy as soon as a new request comes in. The model is considered busy until it finishes. Only then will the TTL start counting.
The text was updated successfully, but these errors were encountered:
This bug has only recently come to my attention when I started using shorter TTL values, and using a chatty model (QwQ). But it's very easy to reproduce with a very short TTL (like 10 seconds) and a prompt that will take longer to run than the TTL.
Steps to reproduce:
Expected outcome:
Actual outcome:
Suggested fix:
The text was updated successfully, but these errors were encountered: