Feature Requests: Use llamafile / OpenAI compatible / API? #18
Replies: 1 comment
-
I'm presently engaged with adding support for a new self-developed backend, This new backend adds support for HF-Transformer & AWQ-quantized models directly off the hub, while providing on-the-fly quantization via BitsAndBytes, HQQ and Quanto.. It also negates the need to manually download LLMs yourself, simply working off the model name to do the rest. It works OOB with no setup necessary, and provides concurrency and streaming responses all within a single platform-agnostic Python script that can be ported anywhere. It will soon be the default LLM-loader in LARS! As Ollama is another implementation of llama.cpp, explicit support for it is not planned at this time though I recognize the benefits. llama.cpp will be retained in LARS as a user-electable alternative to HF-Waitress for GGUF models, primarily due to their advantage of hybrid-inferencing. You'll be able to bring in your own GGUFs same as today. OpenAI is not planned at this time as LARS remains open-source, local-deployment centric. However, code to make OpenAI work is already in the LARS codebase so if an official engagement necessitates it, I will work on enabling it. In the meanwhile, community-contributions are absolutely welcome as always for these features! |
Beta Was this translation helpful? Give feedback.
-
https://github.com/Mozilla-Ocho/llamafile
faster than vanilla llama.cpp on cpu
** Better just add OpenAI compatible backend
flexibility to use any api
great rag project! - also any plan for an API?
Beta Was this translation helpful? Give feedback.
All reactions