Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explore integration options with Ollama and other backends #6

Open
mcharytoniuk opened this issue Jun 5, 2024 · 5 comments
Open

Explore integration options with Ollama and other backends #6

mcharytoniuk opened this issue Jun 5, 2024 · 5 comments
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@mcharytoniuk
Copy link
Member

llama.cpp exposes the /health endpoint, which makes it easy to deal with slots. What about other similar solutions?

@mcharytoniuk mcharytoniuk added the good first issue Good for newcomers label Jun 5, 2024
@aiseei
Copy link

aiseei commented Aug 19, 2024

hi @mcharytoniuk - thanks for this interesting project ! we use a combination of llama-cpp -server and ollama - both running on dockers and have implemented our ow python based proxy/LB. looking to move to something specialist like paddler. Can we do this today with paddler?

@mcharytoniuk
Copy link
Member Author

mcharytoniuk commented Aug 20, 2024

@aiseei Thank you for reaching out!

You can absolutely use Paddler with your llama.cpp setup in production. Personally, I am using it with Auto Scaling groups with llama.cpp.

When it comes to Ollama, not at the moment.

The issue is that Ollama potentially starts and manages multiple llamas.cpp servers internally on its own and does not expose some llama.cpp internal endpoints (like /health: ollama/ollama#1378), and statuses; currently, it does not allow hooking into some llama.cpp APIs that Paddler requires to function.

I might try to get it to work for just OpenAPI-like endpoints if there is some interest in having Ollama integration, though. However, that would have some limitations compared to balancing based on slots (slots allow us to predict how many requests a server can handle at most, so that allows predictable buffering). Do you think that would be ok for your use case?

@mcharytoniuk
Copy link
Member Author

@aiseei I think I have a few ideas on how to handle the issue. I will add Ollama, and other OpenaAI-style APIs support to paddler. See also: #18

@mcharytoniuk mcharytoniuk reopened this Aug 21, 2024
@mcharytoniuk mcharytoniuk added the enhancement New feature or request label Aug 21, 2024
@aiseei
Copy link

aiseei commented Sep 5, 2024

@mcharytoniuk hi - sorry for the late reply. Yes , supporting the OPENA AI API style would work. Btw came across this issue tiday ollama/ollama#6492 might be relevant as u support ollama.

@mcharytoniuk
Copy link
Member Author

@mcharytoniuk hi - sorry for the late reply. Yes , supporting the OPENA AI API style would work. Btw came across this issue tiday ollama/ollama#6492 might be relevant as u support ollama.

Bringing issues and news like that help me with maintaining the package, it is easier for me to follow what is relevant int the ecosystem. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

3 participants
@mcharytoniuk @aiseei and others