nvidia-trt: no stop request, stop response for decoupled model #17764

mkhludnev · 2024-02-19T21:20:31Z

Description:
- suppress explicit stop request since Triton doesn't support it. Request Cancellation triton-inference-server/server#4818
- detect decoupled model and ask for explicit stop response
Issue: now it sends stop:True as an input, and triton rejects it; then without triton_enable_empty_final_response=True stream hangs forever
Add tests and docs: If you're adding a new integration, please include
1. a test for the integration, preferably unit tests that do not rely on network access,
2. an example notebook showing its use. It lives in docs/docs/integrations directory.
Lint and test: Run make format, make lint and make test from the root of the package(s) you've modified. See contribution guidelines for more: https://python.langchain.com/docs/contributing/

If no one reviews your PR within a few days, please @-mention one of baskaryan, efriis, eyurtsev, hwchase17.

- detect decoupled model and request explicit stop response

vercel · 2024-02-19T21:20:35Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment

Name	Status	Preview	Comments	Updated (UTC)
langchain	⬜️ Ignored (Inspect)	Visit Preview		Feb 20, 2024 9:43pm

mkhludnev · 2024-02-19T21:23:37Z

libs/partners/nvidia-trt/langchain_nvidia_trt/llms.py

@@ -218,7 +218,7 @@ def _invoke_triton(self, model_name, inputs, outputs, stop_words):
        result_queue = StreamingResponseGenerator(
            self,
            request_id,
-            force_batch=False,
+            signal_stop=False,


renaming to meaningful. Inverting it at line 405.
TLDR triton doesn't let one cancel a request triton-inference-server/server#4818

mkhludnev · 2024-02-19T21:25:56Z

libs/partners/nvidia-trt/langchain_nvidia_trt/llms.py

@@ -238,6 +238,7 @@ def _invoke_triton(self, model_name, inputs, outputs, stop_words):
            inputs=inputs,
            outputs=outputs,
            request_id=request_id,
+            enable_empty_final_response=self.client.get_model_config(model_name).config.model_transaction_policy.decoupled


I want to improve it here. Maybe cache it or use it as model_ready call above. I also don't know how much this grpc code is null proof.

efriis · 2024-03-19T19:45:55Z

Should be against the https://github.com/langchain-ai/langchain-nvidia repo instead!

- supress explicit stop request since Triton doesn't support it.

3001575

- detect decoupled model and request explicit stop response

efriis added the partner label Feb 19, 2024

efriis self-assigned this Feb 19, 2024

mkhludnev mentioned this pull request Feb 19, 2024

Support TensorRT-LLM? #12474

Closed

mkhludnev commented Feb 19, 2024

View reviewed changes

mkhludnev added 3 commits February 20, 2024 00:27

fix

9989f78

ruff

e18688f

there's no stop

e19b2fa

mkhludnev marked this pull request as ready for review February 20, 2024 21:45

dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. 🤖:improvement Medium size change to existing code to handle new use-cases labels Feb 20, 2024

efriis closed this Mar 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nvidia-trt: no stop request, stop response for decoupled model #17764

nvidia-trt: no stop request, stop response for decoupled model #17764

mkhludnev commented Feb 19, 2024 •

edited

Loading

vercel bot commented Feb 19, 2024 •

edited

Loading

mkhludnev Feb 19, 2024

mkhludnev Feb 19, 2024

efriis commented Mar 19, 2024

nvidia-trt: no stop request, stop response for decoupled model #17764

nvidia-trt: no stop request, stop response for decoupled model #17764

Conversation

mkhludnev commented Feb 19, 2024 • edited Loading

vercel bot commented Feb 19, 2024 • edited Loading

mkhludnev Feb 19, 2024

Choose a reason for hiding this comment

mkhludnev Feb 19, 2024

Choose a reason for hiding this comment

efriis commented Mar 19, 2024

mkhludnev commented Feb 19, 2024 •

edited

Loading

vercel bot commented Feb 19, 2024 •

edited

Loading