Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

openai api extention can use long context model: if the model context is 16k, the api also can use this 16k context #3668

Closed
elven2016 opened this issue Aug 24, 2023 · 8 comments
Labels
enhancement New feature or request stale

Comments

@elven2016
Copy link

Description
when use long context model, the api only support 4k max tokens, 16kax tokens is need ,please add this feature,

A clear and concise description of what you want to be implemented.

Additional Context

If applicable, please provide any extra information, external links, or screenshots that could be useful.

@elven2016 elven2016 added the enhancement New feature or request label Aug 24, 2023
@matatonic
Copy link
Contributor

See #3153 for a workaround with openai

@elven2016
Copy link
Author

See #3153 for a workaround with openai

thanks for your reply, i change the config.yml set the truncate_length=8192 ,the context seems to be actvitated ,but the completion call get error:
image

@matatonic
Copy link
Contributor

matatonic commented Aug 28, 2023 via email

@elven2016
Copy link
Author

Can you include the server logs for this error? it should have a full stack trace. ideally, please enable OPENEDAI_DEBUG=1 Environment variable too.

On Mon, Aug 28, 2023, 1:40 a.m. elven2016 @.> wrote: See #3153 <#3153> for a workaround with openai thanks for your reply, i change the config.yml set the truncate_length=8192 ,the context seems to be actvitated ,but the completion call get error: [image: image] https://user-images.githubusercontent.com/16677082/263590696-8791d2fc-437f-4e7b-87b4-64f797459fda.png — Reply to this email directly, view it on GitHub <#3668 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ARO7ETIKDO635V6OHIKW4P3XXQK3BANCNFSM6AAAAAA34Q2ACM . You are receiving this because you commented.Message ID: @.>

the log prints below:
Traceback (most recent call last):
File "/home/elven/miniconda3/envs/finreport/lib/python3.10/site-packages/streamlit/runtime/scriptrunner/script_runner.py", line 552, in _run_script
exec(code, module.dict)
File "/home/elven/finreport/finapp/app.py", line 182, in
main()
File "/home/elven/finreport/finapp/app.py", line 120, in main
handle_userinput(user_question)
File "/home/elven/finreport/finapp/app.py", line 82, in handle_userinput
response = st.session_state.conversation({'question': user_question})
File "/home/elven/miniconda3/envs/finreport/lib/python3.10/site-packages/langchain/chains/base.py", line 282, in call
raise e
File "/home/elven/miniconda3/envs/finreport/lib/python3.10/site-packages/langchain/chains/base.py", line 276, in call
self._call(inputs, run_manager=run_manager)
File "/home/elven/miniconda3/envs/finreport/lib/python3.10/site-packages/langchain/chains/conversational_retrieval/base.py", line 141, in _call
answer = self.combine_docs_chain.run(
File "/home/elven/miniconda3/envs/finreport/lib/python3.10/site-packages/langchain/chains/base.py", line 480, in run
return self(kwargs, callbacks=callbacks, tags=tags, metadata=metadata)[
File "/home/elven/miniconda3/envs/finreport/lib/python3.10/site-packages/langchain/chains/base.py", line 282, in call
raise e
File "/home/elven/miniconda3/envs/finreport/lib/python3.10/site-packages/langchain/chains/base.py", line 276, in call
self._call(inputs, run_manager=run_manager)
File "/home/elven/miniconda3/envs/finreport/lib/python3.10/site-packages/langchain/chains/combine_documents/base.py", line 105, in _call
output, extra_return_dict = self.combine_docs(
File "/home/elven/miniconda3/envs/finreport/lib/python3.10/site-packages/langchain/chains/combine_documents/stuff.py", line 171, in combine_docs
return self.llm_chain.predict(callbacks=callbacks, **inputs), {}
File "/home/elven/miniconda3/envs/finreport/lib/python3.10/site-packages/langchain/chains/llm.py", line 255, in predict
return self(kwargs, callbacks=callbacks)[self.output_key]
File "/home/elven/miniconda3/envs/finreport/lib/python3.10/site-packages/langchain/chains/base.py", line 282, in call
raise e
File "/home/elven/miniconda3/envs/finreport/lib/python3.10/site-packages/langchain/chains/base.py", line 276, in call
self._call(inputs, run_manager=run_manager)
File "/home/elven/miniconda3/envs/finreport/lib/python3.10/site-packages/langchain/chains/llm.py", line 91, in _call
response = self.generate([inputs], run_manager=run_manager)
File "/home/elven/miniconda3/envs/finreport/lib/python3.10/site-packages/langchain/chains/llm.py", line 101, in generate
return self.llm.generate_prompt(
File "/home/elven/miniconda3/envs/finreport/lib/python3.10/site-packages/langchain/chat_models/base.py", line 414, in generate_prompt
return self.generate(prompt_messages, stop=stop, callbacks=callbacks, **kwargs)
File "/home/elven/miniconda3/envs/finreport/lib/python3.10/site-packages/langchain/chat_models/base.py", line 309, in generate
raise e
File "/home/elven/miniconda3/envs/finreport/lib/python3.10/site-packages/langchain/chat_models/base.py", line 299, in generate
self._generate_with_cache(
File "/home/elven/miniconda3/envs/finreport/lib/python3.10/site-packages/langchain/chat_models/base.py", line 446, in _generate_with_cache
return self._generate(
File "/home/elven/miniconda3/envs/finreport/lib/python3.10/site-packages/langchain/chat_models/openai.py", line 345, in _generate
response = self.completion_with_retry(
File "/home/elven/miniconda3/envs/finreport/lib/python3.10/site-packages/langchain/chat_models/openai.py", line 278, in completion_with_retry
return _completion_with_retry(**kwargs)
File "/home/elven/miniconda3/envs/finreport/lib/python3.10/site-packages/tenacity/init.py", line 289, in wrapped_f
return self(f, *args, **kw)
File "/home/elven/miniconda3/envs/finreport/lib/python3.10/site-packages/tenacity/init.py", line 379, in call
do = self.iter(retry_state=retry_state)
File "/home/elven/miniconda3/envs/finreport/lib/python3.10/site-packages/tenacity/init.py", line 325, in iter
raise retry_exc.reraise()
File "/home/elven/miniconda3/envs/finreport/lib/python3.10/site-packages/tenacity/init.py", line 158, in reraise
raise self.last_attempt.result()
File "/home/elven/miniconda3/envs/finreport/lib/python3.10/concurrent/futures/_base.py", line 451, in result
return self.__get_result()
File "/home/elven/miniconda3/envs/finreport/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
raise self._exception
File "/home/elven/miniconda3/envs/finreport/lib/python3.10/site-packages/tenacity/init.py", line 382, in call
result = fn(*args, **kwargs)
File "/home/elven/miniconda3/envs/finreport/lib/python3.10/site-packages/langchain/chat_models/openai.py", line 276, in _completion_with_retry
return self.client.create(**kwargs)
File "/home/elven/miniconda3/envs/finreport/lib/python3.10/site-packages/openai/api_resources/chat_completion.py", line 25, in create
return super().create(args, **kwargs)
File "/home/elven/miniconda3/envs/finreport/lib/python3.10/site-packages/openai/api_resources/abstract/engine_api_resource.py", line 153, in create
response, _, api_key = requestor.request(
File "/home/elven/miniconda3/envs/finreport/lib/python3.10/site-packages/openai/api_requestor.py", line 298, in request
resp, got_stream = self._interpret_response(result, stream)
File "/home/elven/miniconda3/envs/finreport/lib/python3.10/site-packages/openai/api_requestor.py", line 700, in _interpret_response
self._interpret_response_line(
File "/home/elven/miniconda3/envs/finreport/lib/python3.10/site-packages/openai/api_requestor.py", line 765, in _interpret_response_line
raise self.handle_error_response(
openai.error.APIError: UnboundLocalError("local variable 'tokens' referenced before assignment") {"error": {"message": "UnboundLocalError("local variable 'tokens' referenced before assignment")", "code": 500, "type": "OpenAIError", "param": ""}} 500 {'error': {'message': 'UnboundLocalError("local variable 'tokens' referenced before assignment")', 'code': 500, 'type': 'OpenAIError', 'param': ''}} {'Connection': 'close', 'Content-Length': '150', 'Access-Control-Allow-Credentials': 'true', 'Access-Control-Allow-Headers': 'Origin, Accept, X-Requested-With, Content-Type, Access-Control-Request-Method, Access-Control-Request-Headers, Authorization', 'Access-Control-Allow-Methods': 'GET,HEAD,OPTIONS,POST,PUT', 'Access-Control-Allow-Origin': '
', 'Content-Type': 'application/json', 'Date': 'Mon, 28 Aug 2023 08:14:40 GMT', 'Server': 'BaseHTTP/0.6 Python/3.10.10'}

@matatonic
Copy link
Contributor

matatonic commented Aug 30, 2023 via email

@elven2016
Copy link
Author

elven2016 commented Aug 31, 2023

ignore the message output,the error prints seems it's the issue of context too long and the cuda oom,the GPU memory seems is enough free

127.0.0.1 - - [31/Aug/2023 15:03:02] "POST /v1/chat/completions HTTP/1.1" 500 -
b'{"error": {"message": "UnboundLocalError(\"local variable 'tokens' referenced before assignment\")", "code": 500, "type": "OpenAIError", "param": ""}}'
POST /v1/chat/completions HTTP/1.1
Host: 127.0.0.1:5001
User-Agent: OpenAI/v1 PythonBindings/0.27.9
Content-Length: 27513
Accept: /
Accept-Encoding: gzip, deflate
Authorization: Bearer sk-111111111111111111111111111111111111111111111111
Content-Type: application/json
X-Openai-Client-User-Agent: {"bindings_version": "0.27.9", "httplib": "requests", "lang": "python", "lang_version": "3.10.10", "platform": "Linux-5.19.0-1010-nvidia-lowlatency-x86_64-with-glibc2.35", "publisher": "openai", "uname": "Linux 5.19.0-1010-nvidia-lowlatency #10-Ubuntu SMP PREEMPT_DYNAMIC Wed Apr 26 00:40:27 UTC 2023 x86_64 x86_64"}

{'messages': [{'role': 'system', 'content': "Use the following pieces of context to answer the users question. \nIf you don't know the answer, just say that you don't know, don't try to make up an answer.\n----------------\n XXXXXX(igonre the long content)'}], 'model': 'chatglm2-6b', 'max_tokens': None, 'stream': False, 'n': 1, 'temperature': 0.0}
Loaded instruction role format: Vicuna-v1.1
Warning: $This model maximum context length is 16384 tokens. However, your messages resulted in over 7420 tokens and max_tokens is 16384.
{'prompt': "Use the following pieces of context to answer the users question. \nIf you don't know the answer, just say that you don't know, don't try to make up an answer.\n----------------\n XXXXXX(igonre the long content)
。\nA chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.\n\nUSER: 开山集团在地热能源方面做了哪些工作?\nASSISTANT:", 'req_params': {'max_new_tokens': 8964, 'auto_max_new_tokens': False, 'max_tokens_second': 0, 'temperature': 0.01, 'top_p': 1.0, 'top_k': 20, 'repetition_penalty': 1.18, 'repetition_penalty_range': 0, 'encoder_repetition_penalty': 1.0, 'suffix': None, 'stream': False, 'echo': False, 'seed': -1, 'truncation_length': 16384, 'add_bos_token': True, 'do_sample': True, 'typical_p': 1.0, 'epsilon_cutoff': 0.0, 'eta_cutoff': 0.0, 'tfs': 1.0, 'top_a': 0.0, 'min_length': 0, 'no_repeat_ngram_size': 0, 'num_beams': 1, 'penalty_alpha': 0.0, 'length_penalty': 1.0, 'early_stopping': False, 'mirostat_mode': 0, 'mirostat_tau': 5.0, 'mirostat_eta': 0.1, 'guidance_scale': 1, 'negative_prompt': '', 'ban_eos_token': False, 'skip_special_tokens': True, 'custom_stopping_strings': ''}}
Traceback (most recent call last):
File "/ssd_data01/text-generation-webui/modules/callbacks.py", line 56, in gentask
ret = self.mfunc(callback=_callback, *args, **self.kwargs)
File "/ssd_data01/text-generation-webui/modules/text_generation.py", line 321, in generate_with_callback
shared.model.generate(**kwargs)
File "/home/elven/miniconda3/envs/tgweb/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/elven/miniconda3/envs/tgweb/lib/python3.10/site-packages/transformers/generation/utils.py", line 1642, in generate
return self.sample(
File "/home/elven/miniconda3/envs/tgweb/lib/python3.10/site-packages/transformers/generation/utils.py", line 2724, in sample
outputs = self(
File "/home/elven/miniconda3/envs/tgweb/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/elven/miniconda3/envs/tgweb/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/elven/miniconda3/envs/tgweb/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 809, in forward
outputs = self.model(
File "/home/elven/miniconda3/envs/tgweb/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/elven/miniconda3/envs/tgweb/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 697, in forward
layer_outputs = decoder_layer(
File "/home/elven/miniconda3/envs/tgweb/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/elven/miniconda3/envs/tgweb/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/elven/miniconda3/envs/tgweb/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 413, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "/home/elven/miniconda3/envs/tgweb/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/elven/miniconda3/envs/tgweb/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/elven/miniconda3/envs/tgweb/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 335, in forward
attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) / math.sqrt(self.head_dim)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 13.61 GiB (GPU 0; 23.65 GiB total capacity; 5.44 GiB already allocated; 8.93 GiB free; 6.77 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Output generated in 0.25 seconds (0.00 tokens/s, 0 tokens, context 15109, seed 867992534)
OpenAIError UnboundLocalError("local variable 'tokens' referenced before assignment")
Traceback (most recent call last):
File "/ssd_data01/text-generation-webui/extensions/openai/script.py", line 101, in wrapper
func(self)
File "/ssd_data01/text-generation-webui/extensions/openai/script.py", line 172, in do_POST
response = OAIcompletions.chat_completions(body, is_legacy=is_legacy)
File "/ssd_data01/text-generation-webui/extensions/openai/completions.py", line 295, in chat_completions
completion_token_count = len(encode(answer)[0])
File "/ssd_data01/text-generation-webui/modules/text_generation.py", line 113, in encode
input_ids = shared.tokenizer.encode(str(prompt), return_tensors='pt', add_special_tokens=add_special_tokens)
File "/home/elven/miniconda3/envs/tgweb/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2373, in encode
encoded_inputs = self.encode_plus(
File "/home/elven/miniconda3/envs/tgweb/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2781, in encode_plus
return self._encode_plus(
File "/home/elven/miniconda3/envs/tgweb/lib/python3.10/site-packages/transformers/tokenization_utils.py", line 656, in _encode_plus
first_ids = get_input_ids(text)
File "/home/elven/miniconda3/envs/tgweb/lib/python3.10/site-packages/transformers/tokenization_utils.py", line 623, in get_input_ids
tokens = self.tokenize(text, **kwargs)
File "/home/elven/miniconda3/envs/tgweb/lib/python3.10/site-packages/transformers/models/llama/tokenization_llama.py", line 208, in tokenize
if tokens[0] == SPIECE_UNDERLINE and tokens[1] in self.all_special_tokens:
UnboundLocalError: local variable 'tokens' referenced before assignment

127.0.0.1 - - [31/Aug/2023 11:48:15] "POST /v1/chat/completions HTTP/1.1" 500 -
b'{"error": {"message": "UnboundLocalError(\"local variable 'tokens' referenced before assignment\")", "code": 500, "type": "OpenAIError", "param": ""}}

another thing is that he cuda have enouch resoures when the api returns error
image

@matatonic
Copy link
Contributor

matatonic commented Aug 31, 2023 via email

@github-actions github-actions bot added the stale label Oct 12, 2023
@github-actions
Copy link

This issue has been closed due to inactivity for 6 weeks. If you believe it is still relevant, please leave a comment below. You can tag a developer in your comment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request stale
Projects
None yet
Development

No branches or pull requests

2 participants