Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streaming support? #345

Closed
ProjCRys opened this issue Nov 7, 2023 · 4 comments · Fixed by #1262
Closed

Streaming support? #345

ProjCRys opened this issue Nov 7, 2023 · 4 comments · Fixed by #1262
Labels
roadmap Planned features

Comments

@ProjCRys
Copy link

ProjCRys commented Nov 7, 2023

This could be a roadmap so that text output should be streaming as the llm generates the message or thought. A use case I can think for this is would be the implementation of TTS with shorter response time (TTS would speak every sentence generated).

Though this would have to refractor a lot of MemGPT's code as the LLM would generally have to output a JSON but I think this could be solved by having each functions be done by agents. One handles the thought, one handles the message (Both could be using streaming output), and the other would be function calling (The one that doesn't necessarily need text streaming as an output.)

This could also make it easier for developers to make the GUI with the model showing the users the live outputting of the LLMs

@cpacker
Copy link
Collaborator

cpacker commented Dec 2, 2023

This is definitely on the roadmap - it's a little tricky due to how we use structured outputs, but it's possible.

@cpacker cpacker added the roadmap Planned features label Dec 2, 2023
@renatokuipers
Copy link

renatokuipers commented Dec 15, 2023

If you take a look at (for example) LMstudio, there is a little snippet in there, that causes realtime text-streaming.

# Chat with an intelligent assistant in your terminal
from openai import OpenAI

# Point to the local server
client = OpenAI(base_url="http://localhost:1234/v1", api_key="not-needed")

history = [
    {"role": "system", "content": "You are an intelligent assistant. You always provide well-reasoned answers that are both correct and helpful."},
    {"role": "user", "content": "Hello, introduce yourself to someone opening this program for the first time. Be concise."},
]

while True:
    completion = client.chat.completions.create(
        model="local-model", # this field is currently unused
        messages=history,
        temperature=0.7,
        stream=True,
    )

    new_message = {"role": "assistant", "content": ""}
    
    for chunk in completion:
        if chunk.choices[0].delta.content:
            print(chunk.choices[0].delta.content, end="", flush=True)
            new_message["content"] += chunk.choices[0].delta.content

    history.append(new_message)
    
    # Uncomment to see chat history
    # import json
    # gray_color = "\033[90m"
    # reset_color = "\033[0m"
    # print(f"{gray_color}\n{'-'*20} History dump {'-'*20}\n")
    # print(json.dumps(history, indent=2))
    # print(f"\n{'-'*55}\n{reset_color}")

    print()
    history.append({"role": "user", "content": input("> ")})

with in particular the part:

    new_message = {"role": "assistant", "content": ""}
    
    for chunk in completion:
        if chunk.choices[0].delta.content:
            print(chunk.choices[0].delta.content, end="", flush=True)
            new_message["content"] += chunk.choices[0].delta.content

    history.append(new_message)

Maybe this is a good start to get this implemented in memgpt.

I was already looking into it myself, but I can't seem to figure it out on my own I am afraid...

@gavsgav
Copy link

gavsgav commented Dec 18, 2023

I have also played about with the streaming text. Each llm servers have slightly different approaches to this function, but the for loop is key to each. I think the best way to figure it out for each server is to play about with a stand alone script first. Follow the relevant servers docs and then once confirmed, test it out with memgpt.

@spjcontextual
Copy link

Have a similar issue here with vLLM. For now my work around might just be wait for a full generation by Mem and then do a fake delay which iterates over the assistant_message output and streams that back to my client.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
roadmap Planned features
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants