Replies: 2 comments 4 replies
-
Thank you for submitting such a detailed FR. I thoroughly understand your use case. Before delving into another post on why tabby relies on the token decoding interface ...
These enhancement should provides reasonable tradeoff / deployment choice regarding of chat use cases. |
Beta Was this translation helpful? Give feedback.
-
We can't utilize Ollama as it merely offers an OpenAI-like API interface. Tabby, however, necessitates an inference endpoint that returns logits, enabling us to tailor the decoding by enhancing it for code generation. Tabby's requirement for an inference endpoint provides a more customizable approach, allowing us to fine-tune the decoding process specifically for code generation. This feature makes it a more suitable choice compared to Ollama, which only offers a simple OpenAI-like API interface. |
Beta Was this translation helpful? Give feedback.
-
I know some others have requested alternative backends/apis in other threads, but I wanted to ask about supporting Ollama and/or Litellm. The reason I'm asking is that consumer hardware is often limited. My 4090 only has 24GB of VRAM. Tabby takes up about 10GB when I'm using it.
Let's say I'm working on a project using Tabby in my code editor, then I want to jump to my WebUI to ask a coding question. I'd first need to ssh into my dedicated inference server, stop tabby to clear the VRAM, ask my question in the WebUI which uses Ollama as a backend, then restart tabby afterwards. This is not a realistic workflow.
The benefit of using a single backend like Ollama (or Litellm on top of Ollama), is that Ollama can dynamically switch out models on the fly, and it can queue requests. It would be much better if local LLM/AI projects supported such backends out of the box to enable more efficient management of precious VRAM. If everyone just relied on using their own separate backends, we'd never be able to make use of multiple tools at once.
I've only been using Tabby for a day or so, and it seems like something I'd definitely like to integrate into my workflow. However, as I also heavily rely on Ollama in my current workflow, I cant really use both simultaneously without creating extra hassle.
I guess the other alternative would be to have the ability to unload models using a keybinding in the text editor (I use nvim). I've seen issue 624, however this seems related to shutting down the whole docker container (which is running on the same machine). Just having the option to temporarily unload the model with an api call would be more suitable.
Beta Was this translation helpful? Give feedback.
All reactions