-
-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The UI does not set the truncation length on the OpenAI API server. #3153
Comments
Just to provide some addition info for anyone else finding this issue, here is an example from my own config-user.yaml: # I always load this with the kaiokendev_superhot-30b-8k-no-rlhf-test Lora
# and set compress_pos_emb: 2, max_seq_len: 3072
CalderaAI_30B-Lazarus-GPTQ4bit:
truncation_length: 3072
.*(7|13)b-.*-superhot-8k:
truncation_length: 6144
.*(30|33)b-.*-superhot-8k:
truncation_length: 3072 The major issue is that the built in API and WEBUI all pass truncation_length as a parameter, so it's not an issue for them, but the openai API doesn't have such a thing so we need to rely on the server updating the shared.settings['truncation_length'] value, which only happens when the model settings are loaded. Changing the truncation_length value in the UI has no effect on this server setting. |
I'm confused. OpenAI is it's own thing. Max sequence length already sets the length. It is up to the client to manage context, is it not? Previously I had issues with --api not allowing longer context and always cutting it to 2048. Now models work fine through silly tavern and the built in chat/notebook. |
neither context length nor truncation length are available in the openai api, there is only max_tokens to control how many tokens to generate. It's a rather poor setup because as a user of the openai api, you can't even use the API to determine the context length, you need to look it up on the model web page and just 'know' it when you select the model in your code. The text-generation-webui has a value loaded form the model which is set called shared.settings['truncation_length'] but this doesn't get set to anything higher if you use compression and a higher max_seq_length (it probably should), and because the blocking api and webui use client side truncation_length you don't notice it's not updated on the server. This is exclusively an openai api problem at this point, but the workaround is pretty simple for now, just update your models/config-user.yaml with the truncation length you want. If this doesn't get fixed another way I'll try another PR soon, the last fix sat for a few weeks and is not mergeable anymore, but I think there is probably a better fix now anyways. |
This issue has been closed due to inactivity for 30 days. If you believe it is still relevant, please leave a comment below. |
I'm reopening this, it's still an issue with only a workaround so far. |
It would be great if this could be fixed. I implemented using textgenwebui as a potential option for users of a Skyrim AI NPC mod and this inability to get the proper truncation length working through the openai API is causing issues for them. Only hard coding the value in completions.py of 4096 instead of referring back to settings.setting['truncation_length'] worked to fix the issue. Thanks for all your work on this matonic, hope you or someone else can figure this out and it gets merged. |
This workaround no longer works. I set the models/config-user.yaml to be:
But I get this message in the console:
|
I couldn't get the config yaml-solution to work but hardcoding the token limit into the openai extension worked for me: #4152 (comment) |
Yeah, it seems kind of busted now. context length and instruction format are both broken right now afaik. |
Problem
The UI does not set the truncation length on the server.
Right now there is no other way for the openai api or its client to know the model size.
Expected behaviour
The
truncation_length
should:/v1/models
?However, the original OpenAI API does not have this parameter, so I'm not sure setting it as a request parameter is the correct way forward. If not, please close this issue.
Workaround
According to @matatonic in #3049 (comment)_
you can configure your models/config-user.yaml to include a line setting the truncation length for the model (or a pattern).
For example:
Testing
I loaded Lazarus-30b-SuperHOT-8k-GPTQ and Guanaco-33B-SuperHOT-8K-GPTQ with
python server.py \ --extensions openai \ --listen \ --notebook \ --model_type LLaMA \ --loader exllama \ --gpu-split 10,24 \ --max_seq_len 8192 \ --alpha_value 4 \ --model-menu \ "$@"
and tested in the web UI and via API.
The model using the UI was able to remember a secret password after about 3k tokens. Using the API, requests with over 2k context were handled fine by Lazarus, but I get deranged garbage with Guanaco. That is presumably an issue with the model, not the API.
The text was updated successfully, but these errors were encountered: