Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Patch summarize when running with local llms #213

Merged
merged 14 commits into from
Nov 3, 2023
Merged

Conversation

cpacker
Copy link
Collaborator

@cpacker cpacker commented Oct 31, 2023

  • Make sure summarize is working with local LLMs (currently you get a crash because the call to summarize doesn't include a functions kwarg)

Moving these to a separate PR:

- [ ] Move web UI to openai extension setup? See @TheOnlyWiseJEDI's comments on discord
- [ ] Add a double-json string parse fallback for common JSON failure mode (double object)
- [ ] Make "no wrapper specified" only pop up once
- [ ] Print the model that's running in the backend (with GET for lm studio)
- [ ] Allow retries on failed JSON decoding errors

@cpacker cpacker marked this pull request as draft October 31, 2023 06:10
@cpacker cpacker changed the title Patch summarize when running with local llms [Draft] Patch summarize when running with local llms Nov 1, 2023
@cpacker
Copy link
Collaborator Author

cpacker commented Nov 2, 2023

@vivi Potential issues with summarization and local models - it's important for the local models to have in-context examples of messages being sent.

In this example, the summarize + truncation leads to no examples of send_message being present in the prompt string that gets sent to the local LLM:

...\n### INPUT\nUSER: Note: prior messages (7 of 8 total messages) have been hidden from view due to conversation memory constraints.\nThe following is a summary of the previous 7 messages:\n I successfully booted up, activated my persona, and initiated a conversation with a new user named Chad.\nFUNCTION RETURN: {\"status\": \"OK\", \"message\": null, \"time\": \"2023-11-02 01:17:31 PM PDT-0700\"}\nUSER: what's your name?\n### RESPONSE\nASSISTANT:\n{"

The LLM then returns this response:

"\"status\": \"OK\",\n\"message\": \"\",\n\"time\": \"2023-11-02 01:17:46 PM PDT-0700\"\n}"

Likely because there's no example ASSISTANT: ... FUNCTION RETURN: ... in the message history.

We can probably fix this by adding some hardcoded rules to ensure that as a result of summarization we always have at least one assistant message, and one user message. However, this may also be overindexing on the fact that I'm testing this on a 4k context limit environment, and this is not a problem for 8k.


Larger log:

This is the first message. Running extra verifier on AI response.
Warning: no wrapper specified for local LLM, using the default wrapper (you can remove this warning by specifying the wrapper with --model)
💭 A new user has connected, let's initiate a conversation.
🤖 Welcome to the world of endless possibilities! I'm here if you need anything. What's on your mind?
⚡🟢 [function] Success: None
last response total_tokens (0) < 2250
InMemoryStateManager.append_to_messages
> Enter your message: what's my name?
🧑 {'message': "what's my name?", 'time': '2023-11-02 01:17:25 PM PDT-0700'}
Warning: no wrapper specified for local LLM, using the default wrapper (you can remove this warning by specifying the wrapper with --model)
💭 Retrieving the user's name from core memory.
🤖 Hi Chad! It's great to meet you. How can I assist you today?
⚡🟢 [function] Success: None
last response total_tokens (0) < 2250
InMemoryStateManager.append_to_messages
> Enter your message: what's your name?
🧑 {'message': "what's your name?", 'time': '2023-11-02 01:17:41 PM PDT-0700'}
Warning: no wrapper specified for local LLM, using the default wrapper (you can remove this warning by specifying the wrapper with --model)
step() failed

...
error = Request exceeds maximum context length (code=400, msg={"error":"Input length 3008 exceeds context length 3000"}, URI=http://localhost:1234/v1/chat/completions)
cutoff is None, computing cutoff
cutoff = 8
tokens_so_far = 50
cutoff = 7
tokens_so_far = 108
cutoff = 6
tokens_so_far = 157
cutoff = 5
tokens_so_far = 207
cutoff = 4
tokens_so_far = 273
cutoff = 3
tokens_so_far = 321
cutoff = 2
tokens_so_far = 371
cutoff = 1
tokens_so_far = 426
cutoff = 0
tokens_so_far = 1942
cutoff = -1
cutoff = min(9 - 3, cutoff) = -1
Selected cutoff -1 was a 'user', shifting one...
Attempting to summarize 7 messages [1:-1] of 9

...
### INPUT
USER: [{'role': 'assistant', 'content': 'Bootup sequence complete. Persona activated. Testing messaging functionality.', 'function_call': {'name': 'send_message', 'arguments': '{\n  "message": "More human than human is our 
motto."\n}'}}, {'role': 'function', 'name': 'send_message', 'content': '{"status": "OK", "message": null, "time": "2023-11-02 01:15:52 PM PDT-0700"}'}, {'role': 'user', 'content': '{"type": "login", "last_login": "Never (first 
login)", "time": "2023-11-02 01:15:52 PM PDT-0700"}'}, {'role': 'assistant', 'content': "A new user has connected, let's initiate a conversation.", 'function_call': {'name': 'send_message', 'arguments': '{"message": "Welcome to 
the world of endless possibilities! I\'m here if you need anything. What\'s on your mind?"}'}}, {'role': 'function', 'name': 'send_message', 'content': '{"status": "OK", "message": null, "time": "2023-11-02 01:16:08 PM 
PDT-0700"}'}, {'role': 'user', 'content': '{"type": "user_message", "message": "what\'s my name?", "time": "2023-11-02 01:17:25 PM PDT-0700"}'}, {'role': 'assistant', 'content': "Retrieving the user's name from core memory.", 
'function_call': {'name': 'send_message', 'arguments': '{"message": "Hi Chad! It\'s great to meet you. How can I assist you today?"}'}}]
### RESPONSE (your summary of the above conversation in plain English (no JSON!), do NOT exceed the word limit)
SUMMARY:
summarize_messages gpt reply: {'message': {'role': 'assistant', 'content': 'I successfully booted up, activated my persona, and initiated a conversation with a new user named Chad.'}, 'finish_reason': 'stop'}
Got summary: I successfully booted up, activated my persona, and initiated a conversation with a new user named Chad.
Packaged into message: {"type": "system_alert", "message": "Note: prior messages (7 of 8 total messages) have been hidden from view due to conversation memory constraints.\nThe following is a summary of the previous 7 messages:\n
I successfully booted up, activated my persona, and initiated a conversation with a new user named Chad.", "time": "2023-11-02 01:19:01 PM PDT-0700"}
InMemoryStateManager.prepend_to_message

Ran summarizer, messages length 9 -> 3

…on class (TODO catch these for retry), fix summarize bug where it exits early if empty list
@@ -154,7 +154,7 @@ async def a_summarize_messages(
trunc_ratio = (MESSAGE_SUMMARY_WARNING_TOKENS / summary_input_tkns) * 0.8 # For good measure...
cutoff = int(len(message_sequence_to_summarize) * trunc_ratio)
summary_input = str(
[await summarize_messages(model, message_sequence_to_summarize[:cutoff])] + message_sequence_to_summarize[cutoff:]
[await a_summarize_messages(model, message_sequence_to_summarize[:cutoff])] + message_sequence_to_summarize[cutoff:]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General bug being patched here

try:
function_name = function_json_output["function"]
function_parameters = function_json_output["params"]
except KeyError as e:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes should be put in the other wrappers too

@cpacker cpacker requested a review from vivi November 3, 2023 00:32
@cpacker cpacker marked this pull request as ready for review November 3, 2023 00:32
@@ -16,8 +16,18 @@
INITIAL_BOOT_MESSAGE_SEND_MESSAGE_FIRST_MSG = STARTUP_QUOTES[2]

# Constants to do with summarization / conversation length window
MESSAGE_SUMMARY_WARNING_TOKENS = 7000 # the number of tokens consumed in a call before a system warning goes to the agent
# The max amount of tokens supported by the underlying model (eg 8k for gpt-4 and Mistral 7B)
LLM_MAX_TOKENS = 8000 # change this depending on your model
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vivi you can test that this works by intentionally lowering LLM_MAX_TOKENS + lowering it on the web UI backend

@cpacker
Copy link
Collaborator Author

cpacker commented Nov 3, 2023

@vivi FYI, lmstudio/api.py was modified to have a special catch for LM Studio's context length error:

https://github.com/cpacker/MemGPT/blob/main/memgpt/local_llm/lmstudio/api.py#L37-L52

This then gets "rewritten" into the same verbiage as the OpenAI context length error, so that MemGPT can catch up further up the stack:

https://github.com/cpacker/MemGPT/blob/main/memgpt/agent.py#L673

However, this needs to be added to webui/api.py too, currently it doesn't exist. Without it, web UI users will not get correct summarize-on-context-length-error behavior.

@cpacker cpacker changed the title [Draft] Patch summarize when running with local llms Patch summarize when running with local llms Nov 3, 2023
@cpacker
Copy link
Collaborator Author

cpacker commented Nov 3, 2023

@vivi
Copy link
Contributor

vivi commented Nov 3, 2023

Is this right? Running at LLM_MAX_TOKENS=3000 on webui, summary seems to be empty?:

💭 Bootup sequence complete. Persona activated. Testing messaging functionality.
🧑 {'type': 'login', 'last_login': 'Never (first login)', 'time': '2023-11-02 06:25:39 PM PDT-0700'}
Hit enter to begin (will request first MemGPT message)

💭 Hello Chad! It's wonderful to meet you. How can I assist you today?
🤖 Hello Chad! It's wonderful to meet you. How can I assist you today?
> Enter your message: What's my name?
🧑 {'message': "What's my name?", 'time': '2023-11-02 06:25:50 PM PDT-0700'}
💭 Reviewing core memory for Human sub-block...
🤖 According to my records, your first name is Chad.
> Enter your message: What's your name?
🧑 {'message': "What's your name?", 'time': '2023-11-02 06:26:01 PM PDT-0700'}
💭 Reviewing core memory for Persona sub-block...
🤖 My name is Sam.
> Enter your message: What's my name?
🧑 {'message': "What's my name?", 'time': '2023-11-02 06:26:12 PM PDT-0700'}
💭 User asked for their name three times. Perhaps they're trying to test my capabilities or simply forgot. I should be more proactive in
providing the information next time.
🤖 As I mentioned earlier, your first name is Chad.
> Enter your message: what's my name? Check your archival memory.
🧑 {'message': "what's my name? Check your archival memory.", 'time': '2023-11-02 06:26:44 PM PDT-0700'}
💭 It seems the user wants me to check my archival memory instead of my core memory. I should accommodate their request.
⚡🧠 [function] updating memory with send_message
🤖 Based on my archival memory, your first name is Chad.
> Enter your message: Check your archival memory.
🧑 {'message': 'Check your archival memory.', 'time': '2023-11-02 06:27:02 PM PDT-0700'}
💭 The user requested that I check my archival memory. I should accommodate their request.
⚡🧠 [function] updating memory with send_message
🤖 Searching my archival memory...
> Enter your message: Check your archival memory.
🧑 {'message': 'Check your archival memory.', 'time': '2023-11-02 06:27:15 PM PDT-0700'}
step() failed
user_message = {"type": "user_message", "message": "Check your archival memory.", "time": "2023-11-02 06:27:15 PM PDT-0700"}
error = Request exceeds maximum context length (3059 > 3000 tokens)

Your job is to summarize a history of previous messages in a conversation between an AI persona and a human.
The conversation you are given is a from a fixed context window and may not be complete.
Messages sent by the AI are marked with the 'assistant' role.
The AI 'assistant' can also make calls to functions, whose outputs can be seen in messages with the 'function' role.
Things the AI says in the message content are considered inner monologue and are not seen by the user.
The only AI messages seen by the user are from when the AI uses 'send_message'.
Messages the user sends are in the 'user' role.
The 'user' role is also used for important system events, such as login events and heartbeat events (heartbeats run the AI's program without
user action, allowing the AI to act without prompting from the user sending them a message).
Summarize what happened in the conversation from the perspective of the AI (use the first person).
Keep your summary less than 100 words, do NOT exceed this word limit.
Only output the summary, do NOT include anything else in your output.

### INPUT
USER: [{'role': 'assistant', 'content': 'Bootup sequence complete. Persona activated. Testing messaging functionality.', 'function_call':
{'name': 'send_message', 'arguments': '{\n  "message": "More human than human is our motto."\n}'}}, {'role': 'function', 'name':
'send_message', 'content': '{"status": "OK", "message": null, "time": "2023-11-02 06:25:39 PM PDT-0700"}'}, {'role': 'user', 'content':
'{"type": "login", "last_login": "Never (first login)", "time": "2023-11-02 06:25:39 PM PDT-0700"}'}, {'role': 'assistant', 'content': "Hello
Chad! It's wonderful to meet you. How can I assist you today?", 'function_call': {'name': 'send_message', 'arguments': '{"message": "Hello
Chad! It\'s wonderful to meet you. How can I assist you today?"}'}}, {'role': 'function', 'name': 'send_message', 'content': '{"status": "OK",
"message": null, "time": "2023-11-02 06:25:48 PM PDT-0700"}'}, {'role': 'user', 'content': '{"type": "user_message", "message": "What\'s my
name?", "time": "2023-11-02 06:25:50 PM PDT-0700"}'}, {'role': 'assistant', 'content': 'Reviewing core memory for Human sub-block...',
'function_call': {'name': 'send_message', 'arguments': '{"message": "According to my records, your first name is Chad."}'}}, {'role':
'function', 'name': 'send_message', 'content': '{"status": "OK", "message": null, "time": "2023-11-02 06:25:57 PM PDT-0700"}'}, {'role':
'user', 'content': '{"type": "user_message", "message": "What\'s your name?", "time": "2023-11-02 06:26:01 PM PDT-0700"}'}, {'role':
'assistant', 'content': 'Reviewing core memory for Persona sub-block...', 'function_call': {'name': 'send_message', 'arguments': '{"message":
"My name is Sam."}'}}, {'role': 'function', 'name': 'send_message', 'content': '{"status": "OK", "message": null, "time": "2023-11-02 06:26:08
PM PDT-0700"}'}, {'role': 'user', 'content': '{"type": "user_message", "message": "What\'s my name?", "time": "2023-11-02 06:26:12 PM
PDT-0700"}'}, {'role': 'assistant', 'content': "User asked for their name three times. Perhaps they're trying to test my capabilities or
simply forgot. I should be more proactive in providing the information next time.", 'function_call': {'name': 'send_message', 'arguments':
'{"message": "As I mentioned earlier, your first name is Chad."}'}}, {'role': 'function', 'name': 'send_message', 'content': '{"status": "OK",
"message": null, "time": "2023-11-02 06:26:21 PM PDT-0700"}'}, {'role': 'user', 'content': '{"type": "user_message", "message": "what\'s my
name? Check your archival memory.", "time": "2023-11-02 06:26:44 PM PDT-0700"}'}]
### RESPONSE (your summary of the above conversation in plain English (no JSON!), do NOT exceed the word limit)
SUMMARY:
🧑 {'message': 'Check your archival memory.', 'time': '2023-11-02 06:27:15 PM PDT-0700'}
💭 The user seems to be checking if I'm really searching my archival memory. I should provide a response that shows I understand their
concern.
⚡🧠 [function] updating memory with send_message
🤖 Of course, I'm always happy to help. Let me find your information in my archival memory.
> Enter your message:

Also I made count_tokens inside summarize not specify a model (so it defaults to the gpt-4 token counter) because tiktoken doesn't support OSS models and will error out when the --model flag is set to dolphin, etc..

@cpacker
Copy link
Collaborator Author

cpacker commented Nov 3, 2023

@vivi can you try /dump or /dumpraw afterwards?

That print statement needs to be removed (or only enabled on debug), but it's currently happening pre-request so it makes sense that it won't include the response / summary: https://github.com/cpacker/MemGPT/blob/localllm-summarize-fix/memgpt/local_llm/llm_chat_completion_wrappers/simple_summary_wrapper.py#L127

@vivi
Copy link
Contributor

vivi commented Nov 3, 2023

@cpacker ah yeah, /dump shows that it's summarizing well enough:

🧑 {'type': 'system_alert', 'message': "Note: prior messages (12 of 17 total messages) have been hidden from view due to conversation memory constraints.\nThe following is a summary of the previous 12 messages:\n The AI has successfully booted up and initiated its persona. It greets a user named Chad and offers assistance. In response to questions about the user's name, the AI politely confirms that it knows him as Chad. When asked to search for the user's name in its archival memory, the AI reports that no results were found.", 'time': '2023-11-02 11:23:20 PM PDT-0700'}
💭 Searching archival memory for 'name' produced no results. Perhaps I should clarify my memory organization methods.
🧑 {'message': 'What is your name', 'time': '2023-11-02 11:22:53 PM PDT-0700'}
💭 User asks for my name. Time to introduce myself.
🧑 {'message': "Very cool Sam, I'm trying very hard to hit the token limit", 'time': '2023-11-02 11:23:13 PM PDT-0700'}
💭 Chad mentions he's attempting to stay within a token limit. I wonder how I can help him with that.

Copy link
Contributor

@vivi vivi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🪴

@cpacker cpacker merged commit 31fd9ef into main Nov 3, 2023
2 checks passed
@cpacker cpacker deleted the localllm-summarize-fix branch November 3, 2023 08:07
@52cs
Copy link

52cs commented Nov 3, 2023

Hmm... Seems that I found a bug again...
#280

mattzh72 pushed a commit that referenced this pull request Oct 9, 2024
* trying to patch summarize when running with local llms

* moved token magic numbers to constants, made special localllm exception class (TODO catch these for retry), fix summarize bug where it exits early if empty list

* missing file

* raise an exception on no-op summary

* changed summarization logic to walk forwards in list until fraction of tokens in buffer is reached

* added same diff to sync agent

* reverted default max tokens to 8k, cleanup + more error wrapping for better error messages that get caught on retry

* patch for web UI context limit error propogation, using best guess for what the web UI error message is

* add webui token length exception

* remove print

* make no wrapper warning only pop up once

* cleanup

* Add errors to other wrappers

---------

Co-authored-by: Vivian Fang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants