Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The UI does not set the truncation length on the OpenAI API server. #3153

Closed
atisharma opened this issue Jul 15, 2023 · 9 comments
Closed

The UI does not set the truncation length on the OpenAI API server. #3153

atisharma opened this issue Jul 15, 2023 · 9 comments
Labels

Comments

@atisharma
Copy link

atisharma commented Jul 15, 2023

Problem

The UI does not set the truncation length on the server.
Right now there is no other way for the openai api or its client to know the model size.

Expected behaviour

The truncation_length should:

  • be set correctly at loading and then be respected (currently the case)
  • be specified as an API parameter in the request
  • ideally be returned in the model information so the client can plan appropriately, perhaps at /v1/models?

However, the original OpenAI API does not have this parameter, so I'm not sure setting it as a request parameter is the correct way forward. If not, please close this issue.

Workaround

According to @matatonic in #3049 (comment)_
you can configure your models/config-user.yaml to include a line setting the truncation length for the model (or a pattern).

For example:

.*superhot-8k:
  truncation_length: 8192

Testing

I loaded Lazarus-30b-SuperHOT-8k-GPTQ and Guanaco-33B-SuperHOT-8K-GPTQ with

python server.py \
    --extensions openai \
    --listen \
    --notebook \
    --model_type LLaMA \
    --loader exllama \
    --gpu-split 10,24 \
    --max_seq_len 8192 \
    --alpha_value 4 \
    --model-menu \
    "$@"

and tested in the web UI and via API.

The model using the UI was able to remember a secret password after about 3k tokens. Using the API, requests with over 2k context were handled fine by Lazarus, but I get deranged garbage with Guanaco. That is presumably an issue with the model, not the API.

@matatonic
Copy link
Contributor

Just to provide some addition info for anyone else finding this issue, here is an example from my own config-user.yaml:

# I always load this with the kaiokendev_superhot-30b-8k-no-rlhf-test Lora
# and set compress_pos_emb: 2, max_seq_len: 3072
CalderaAI_30B-Lazarus-GPTQ4bit:
  truncation_length: 3072
.*(7|13)b-.*-superhot-8k:
  truncation_length: 6144
.*(30|33)b-.*-superhot-8k:
  truncation_length: 3072

The major issue is that the built in API and WEBUI all pass truncation_length as a parameter, so it's not an issue for them, but the openai API doesn't have such a thing so we need to rely on the server updating the shared.settings['truncation_length'] value, which only happens when the model settings are loaded. Changing the truncation_length value in the UI has no effect on this server setting.

@Ph0rk0z
Copy link
Contributor

Ph0rk0z commented Jul 15, 2023

I'm confused. OpenAI is it's own thing. Max sequence length already sets the length. It is up to the client to manage context, is it not?

Previously I had issues with --api not allowing longer context and always cutting it to 2048. Now models work fine through silly tavern and the built in chat/notebook.

@matatonic
Copy link
Contributor

neither context length nor truncation length are available in the openai api, there is only max_tokens to control how many tokens to generate.

It's a rather poor setup because as a user of the openai api, you can't even use the API to determine the context length, you need to look it up on the model web page and just 'know' it when you select the model in your code.

The text-generation-webui has a value loaded form the model which is set called shared.settings['truncation_length'] but this doesn't get set to anything higher if you use compression and a higher max_seq_length (it probably should), and because the blocking api and webui use client side truncation_length you don't notice it's not updated on the server.

This is exclusively an openai api problem at this point, but the workaround is pretty simple for now, just update your models/config-user.yaml with the truncation length you want.

If this doesn't get fixed another way I'll try another PR soon, the last fix sat for a few weeks and is not mergeable anymore, but I think there is probably a better fix now anyways.

@github-actions
Copy link

This issue has been closed due to inactivity for 30 days. If you believe it is still relevant, please leave a comment below.

@matatonic
Copy link
Contributor

I'm reopening this, it's still an issue with only a workaround so far.

@teddybear082
Copy link

It would be great if this could be fixed. I implemented using textgenwebui as a potential option for users of a Skyrim AI NPC mod and this inability to get the proper truncation length working through the openai API is causing issues for them. Only hard coding the value in completions.py of 4096 instead of referring back to settings.setting['truncation_length'] worked to fix the issue. Thanks for all your work on this matonic, hope you or someone else can figure this out and it gets merged.

@spreck
Copy link

spreck commented Oct 2, 2023

This workaround no longer works. I set the models/config-user.yaml to be:

TheBloke_WizardCoder-Python-13B-V1.0-GPTQ_gptq-4bit-32g-actorder_True$:
  loader: ExLlama_HF
  use_fast: true
  cfg_cache: false
  gpu_split: '12'
  max_seq_len: 16384
  compress_pos_emb: 2
  alpha_value: 1
  rope_freq_base: 1000000
  truncation_length: 8192

But I get this message in the console:

Warning: $This model maximum context length is 8192 tokens. However, your messages resulted in over 784 tokens and max_tokens is 8192.

@pythonjohan
Copy link

This workaround no longer works. I set the models/config-user.yaml to be:

TheBloke_WizardCoder-Python-13B-V1.0-GPTQ_gptq-4bit-32g-actorder_True$:
  loader: ExLlama_HF
  use_fast: true
  cfg_cache: false
  gpu_split: '12'
  max_seq_len: 16384
  compress_pos_emb: 2
  alpha_value: 1
  rope_freq_base: 1000000
  truncation_length: 8192

But I get this message in the console:

Warning: $This model maximum context length is 8192 tokens. However, your messages resulted in over 784 tokens and max_tokens is 8192.

I couldn't get the config yaml-solution to work but hardcoding the token limit into the openai extension worked for me: #4152 (comment)

@matatonic
Copy link
Contributor

This workaround no longer works. I set the models/config-user.yaml to be:

TheBloke_WizardCoder-Python-13B-V1.0-GPTQ_gptq-4bit-32g-actorder_True$:
  loader: ExLlama_HF
  use_fast: true
  cfg_cache: false
  gpu_split: '12'
  max_seq_len: 16384
  compress_pos_emb: 2
  alpha_value: 1
  rope_freq_base: 1000000
  truncation_length: 8192

But I get this message in the console:
Warning: $This model maximum context length is 8192 tokens. However, your messages resulted in over 784 tokens and max_tokens is 8192.

I couldn't get the config yaml-solution to work but hardcoding the token limit into the openai extension worked for me: #4152 (comment)

Yeah, it seems kind of busted now. context length and instruction format are both broken right now afaik.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants