Skip to content

Commit

Permalink
Avoid prepending output with prompt
Browse files Browse the repository at this point in the history
I'd rather created issue for discussion, but this repo doesn't have issues enabled. 
First of all such prepending seems redundant especially for long RAG prompts. Then, it's an actual problem since I notice that triton gRPC crops the long response. Curiously, REST doesn't crop payload and full concatenation of prompt and output arrives to client. 

Here I put more details of the issue langchain-ai/langchain#12474 (comment)
  • Loading branch information
mkhludnev authored Feb 5, 2024
1 parent 52c1c3c commit 38267e1
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion src/model.py
Original file line number Diff line number Diff line change
Expand Up @@ -213,7 +213,7 @@ def create_response(self, vllm_output):
"""
prompt = vllm_output.prompt
text_outputs = [
(prompt + output.text).encode("utf-8") for output in vllm_output.outputs
output.text.encode("utf-8") for output in vllm_output.outputs
]
triton_output_tensor = pb_utils.Tensor(
"text_output", np.asarray(text_outputs, dtype=self.output_dtype)
Expand Down

0 comments on commit 38267e1

Please sign in to comment.