-
Notifications
You must be signed in to change notification settings - Fork 15.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support TensorRT-LLM? #12474
Comments
🤖 Hello @npuichigo, It's good to see you again and thank you for your interest in LangChain. As of the current version (v0.0.325), LangChain does not have any existing support or integration with NVIDIA's TensorRT or Triton Inference Server. The framework also does not have any dependencies or integrations with any NVIDIA software or libraries. You can confirm this by checking the project's dependencies in the pyproject.toml file. However, the modular architecture of LangChain allows for the integration of a wide variety of tools, including language models. This is achieved through the use of "Chains", which are reusable components that encode a sequence of calls to components like models, document retrievers, other Chains, etc. You can find more details about this in the LangChain's architecture documentation. Given this architecture, it's technically feasible to integrate TensorRT-LLM into the LangChain framework. However, the specific technical requirements and dependencies for such an integration, as well as the specific features of TensorRT-LLM that could enhance the functionality of LangChain, would need to be thoroughly evaluated. Your suggestion to support TensorRT-LLM is valuable and I encourage you to provide more details about the potential benefits and use cases. This would help the LangChain community to better understand the need for this integration and to evaluate its feasibility and priority. Looking forward to your insights. Best, Sources
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot. |
I invented a wheel to provide TensorRT LLM an OpenAI compatible API. Welcome to have a try https://github.com/npuichigo/openai_trtllm |
We have an official NVIDIA connector now. Still WIP: https://pypi.org/project/langchain-nvidia-trt/ |
Thanks, @fciannella. Is https://github.com/langchain-ai/langchain/tree/master/libs/partners/nvidia-trt/langchain_nvidia_trt the source code for this package? |
Yes!
We are still working on a bug fix before we push more documentation and
announce it officially.
Feel free to provide any feedback!
…On Thu, Jan 25, 2024 at 12:05 PM Mikhail Khludnev ***@***.***> wrote:
We have an official NVIDIA connector now. Still WIP:
https://pypi.org/project/langchain-nvidia-trt/
Thanks, @fciannella <https://github.com/fciannella>. Is
https://github.com/langchain-ai/langchain/tree/master/libs/partners/nvidia-trt/langchain_nvidia_trt
the source code for this package?
—
Reply to this email directly, view it on GitHub
<#12474 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACLQL2M5XSQ6SFVGTW35VRLYQK3IFAVCNFSM6AAAAAA6TWKW36VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMJQHEYDQMJTGU>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Ok. Thanks. I evaluated it in my env. Here we go:
|
@fciannella btw #16749 is worth a release! |
UPDtriton doesn't stop the request triton-inference-server/server#4818 I'll come up with PR #17764. I use this code with vLLM model deployed with
It seems ok, but it doesn't support explicit stop signal.
I had to change
force_batch=True, there's to argument to turn it properly.
Also, I have to pass
triton_final_response is always False , the steam never ends.
|
The following problem resides not it this code base, but in integration technology. TL;DRTriton REST seems fine, but gRPRC, which is used by this code, seems prone to trimming LLM response. UPD:turns out vLLM backend accepts |
I'd rather created issue for discussion, but this repo doesn't have issues enabled. First of all such prepending seems redundant especially for long RAG prompts. Then, it's an actual problem since I notice that triton gRPC crops the long response. Curiously, REST doesn't crop payload and full concatenation of prompt and output arrives to client. Here I put more details of the issue langchain-ai/langchain#12474 (comment)
Hi, are there any updates on when this integration will be released please? |
Feature request
Any plan to support nvidia's latest TensorRT-LLM, maybe via triton-inference-server backend?
Motivation
New integration
Your contribution
Test
The text was updated successfully, but these errors were encountered: