Ollama / Litellm support #1096

madsamjp · 2023-12-20T17:45:14Z

madsamjp
Dec 20, 2023

I know some others have requested alternative backends/apis in other threads, but I wanted to ask about supporting Ollama and/or Litellm. The reason I'm asking is that consumer hardware is often limited. My 4090 only has 24GB of VRAM. Tabby takes up about 10GB when I'm using it.

Let's say I'm working on a project using Tabby in my code editor, then I want to jump to my WebUI to ask a coding question. I'd first need to ssh into my dedicated inference server, stop tabby to clear the VRAM, ask my question in the WebUI which uses Ollama as a backend, then restart tabby afterwards. This is not a realistic workflow.

The benefit of using a single backend like Ollama (or Litellm on top of Ollama), is that Ollama can dynamically switch out models on the fly, and it can queue requests. It would be much better if local LLM/AI projects supported such backends out of the box to enable more efficient management of precious VRAM. If everyone just relied on using their own separate backends, we'd never be able to make use of multiple tools at once.

I've only been using Tabby for a day or so, and it seems like something I'd definitely like to integrate into my workflow. However, as I also heavily rely on Ollama in my current workflow, I cant really use both simultaneously without creating extra hassle.

I guess the other alternative would be to have the ability to unload models using a keybinding in the text editor (I use nvim). I've seen issue 624, however this seems related to shutting down the whole docker container (which is running on the same machine). Just having the option to temporarily unload the model with an api call would be more suitable.

wsxiaoys · 2023-12-21T03:18:23Z

wsxiaoys
Dec 21, 2023
Maintainer

Thank you for submitting such a detailed FR. I thoroughly understand your use case. Before delving into another post on why tabby relies on the token decoding interface ...

There's a chat playground within Tabby, with --chat-model and --webserver set as arguments.
Check out Making /v1beta/chat/completions streaming output compatible with openai #1076, we're actively working on making Tabby's chat completion interface compatible with OpenAI.

These enhancement should provides reasonable tradeoff / deployment choice regarding of chat use cases.

1 reply

ic4-y Mar 1, 2024

As I understand the request, it is more about using alternative backends with TabbyML clients, in this instance ollama. Now if TabbyML adopts an OpenAI compatible API for it's own backend, a thing that ollama has recently done as well, should this not become much easier in the near future?

Before delving into another post on why tabby relies on the token decoding interface

Since I am not familiar with the internals of TabbyML, could you link a post where you already elaborated on this? Maybe there are some important implementation details I (and possibly others) do not understand yet.

wsxiaoys · 2024-04-14T03:37:24Z

wsxiaoys
Apr 14, 2024
Maintainer

We can't utilize Ollama as it merely offers an OpenAI-like API interface. Tabby, however, necessitates an inference endpoint that returns logits, enabling us to tailor the decoding by enhancing it for code generation.

Tabby's requirement for an inference endpoint provides a more customizable approach, allowing us to fine-tune the decoding process specifically for code generation. This feature makes it a more suitable choice compared to Ollama, which only offers a simple OpenAI-like API interface.

3 replies

mrtysn Sep 5, 2024

@wsxiaoys, thank you for the explanation. How about sharing downloaded models? Currently, I have above 100 GB worth of different models downloaded residing at ~/.ollama/models. Excuse me if the answer to this question is obvious but, is it possible to load TabbyML with these blobs?

wsxiaoys Sep 5, 2024
Maintainer

This answer is no longer true, we've support ollama through remote HTTP backend, see https://tabby.tabbyml.com/docs/references/models-http-api/ollama/

Boscop Sep 19, 2024

@wsxiaoys What happened to this argument?

We can't utilize Ollama as it merely offers an OpenAI-like API interface. Tabby, however, necessitates an inference endpoint that returns logits, enabling us to tailor the decoding by enhancing it for code generation.
Tabby's requirement for an inference endpoint provides a more customizable approach, allowing us to fine-tune the decoding process specifically for code generation. This feature makes it a more suitable choice compared to Ollama, which only offers a simple OpenAI-like API interface.

Is that not necessary anymore? I'm curious 🙂

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ollama / Litellm support #1096

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 4 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Ollama / Litellm support #1096

madsamjp Dec 20, 2023

Replies: 2 comments · 4 replies

wsxiaoys Dec 21, 2023 Maintainer

ic4-y Mar 1, 2024

wsxiaoys Apr 14, 2024 Maintainer

mrtysn Sep 5, 2024

wsxiaoys Sep 5, 2024 Maintainer

Boscop Sep 19, 2024

madsamjp
Dec 20, 2023

Replies: 2 comments 4 replies

wsxiaoys
Dec 21, 2023
Maintainer

wsxiaoys
Apr 14, 2024
Maintainer

wsxiaoys Sep 5, 2024
Maintainer