Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error Occurs After Asking Consecutive Questions in LLM-Chatbot #2421

Open
tim102187S opened this issue Sep 27, 2024 · 7 comments · Fixed by #2428
Open

Error Occurs After Asking Consecutive Questions in LLM-Chatbot #2421

tim102187S opened this issue Sep 27, 2024 · 7 comments · Fixed by #2428
Labels
bug Something isn't working

Comments

@tim102187S
Copy link

I am using OpenVINO 2024.4.0 and have downloaded the llama-3-8b-instruct model for use. When I run multiple consecutive queries (usually on the third query), an error occurs. I have checked my device’s memory usage, and it has not exceeded 100%.

Here is the error report I received:

Selected model llama-3-8b-instruct
Checkbox(value=True, description='Prepare INT4 model')
Checkbox(value=False, description='Prepare INT8 model')
Checkbox(value=False, description='Prepare FP16 model')
Size of model with INT4 compressed weights is 5085.79 MB
Loading model from /home/adv/Downloads/openvino_notebooks/notebooks/llm-chatbot/llama-3-8b-instruct/INT4_compressed_weights
Compiling the model to CPU ...
Running on local URL: http://127.0.0.1:7861

To create a public link, set share=True in launch().
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's attention_mask to obtain reliable results.
Setting pad_token_id to eos_token_id:128001 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's attention_mask to obtain reliable results.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's attention_mask to obtain reliable results.
Setting pad_token_id to eos_token_id:128001 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's attention_mask to obtain reliable results.
Setting pad_token_id to eos_token_id:128001 for open-end generation.
Traceback (most recent call last):
File "/home/adv/openvino-llm/lib/python3.12/site-packages/gradio/queueing.py", line 536, in process_events
response = await route_utils.call_process_api(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/adv/openvino-llm/lib/python3.12/site-packages/gradio/route_utils.py", line 322, in call_process_api
output = await app.get_blocks().process_api(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/adv/openvino-llm/lib/python3.12/site-packages/gradio/blocks.py", line 1935, in process_api
result = await self.call_function(
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/adv/openvino-llm/lib/python3.12/site-packages/gradio/blocks.py", line 1532, in call_function
prediction = await utils.async_iteration(iterator)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/adv/openvino-llm/lib/python3.12/site-packages/gradio/utils.py", line 671, in async_iteration
return await iterator.anext()
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/adv/openvino-llm/lib/python3.12/site-packages/gradio/utils.py", line 664, in anext
return await anyio.to_thread.run_sync(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/adv/openvino-llm/lib/python3.12/site-packages/anyio/to_thread.py", line 56, in run_sync
return await get_async_backend().run_sync_in_worker_thread(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/adv/openvino-llm/lib/python3.12/site-packages/anyio/_backends/_asyncio.py", line 2405, in run_sync_in_worker_thread
return await future
^^^^^^^^^^^^
File "/home/adv/openvino-llm/lib/python3.12/site-packages/anyio/_backends/_asyncio.py", line 914, in run
result = context.run(func, *args)
^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/adv/openvino-llm/lib/python3.12/site-packages/gradio/utils.py", line 647, in run_sync_iterator_async
return next(iterator)
^^^^^^^^^^^^^^
File "/home/adv/openvino-llm/lib/python3.12/site-packages/gradio/utils.py", line 809, in gen_wrapper
response = next(iterator)
^^^^^^^^^^^^^^
File "/home/adv/Downloads/EAS_GenAI_Intel14th/docker_build/llm_chatbot/run_chatbot.py", line 532, in bot
for new_text in streamer:
File "/home/adv/openvino-llm/lib/python3.12/site-packages/transformers/generation/streamers.py", line 223, in next
value = self.text_queue.get(timeout=self.timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/queue.py", line 179, in get
raise Empty
_queue.Empty

@brmarkus
Copy link

brmarkus commented Sep 27, 2024

Are you talking about "https://github.com/openvinotoolkit/openvino_notebooks/tree/latest/notebooks/llm-chatbot" (where llama-3-8b-instruct is mentioned) or "https://github.com/openvinotoolkit/openvino_notebooks/tree/latest/notebooks/llm-question-answering" (where e.g. tiny-llama-1b-chat is mentioned)?

Can you provide more details about your system, please (SoC, amount of memory, OS, version of Python, etc.)?

Can you provide example prompts, please?

@tim102187S
Copy link
Author

tim102187S commented Sep 27, 2024

Thank you for your response.

I am using the model (llama-3-8b-instruct) and code from this project: https://github.com/openvinotoolkit/openvino_notebooks/tree/latest/notebooks/llm-chatbot/llm-chatbot.ipynb

Here are my system details:

OS: Ubuntu 24.04
Memory: 32GB
CPU: Intel(R) CoreTM Ultra 7 165U
Python version: 3.12.3

The prompts I am using also come from the examples in the project, such as:

"hello there! How are you doing?"
"What is OpenVINO?"
"Who are you?"
"Can you explain to me briefly what is Python programming language?"
etc.

Please let me know if you need any further information.

@brmarkus
Copy link

Have you seen errors or warnings in the steps for conversion and compression?

Do you see the same when using the INT8 or FP16 variant instead of the INT4 variant?

Do you start the Jupyter-notebook from within an virtual-environment (with a "guaranteed" set of versions of components), or "global-local", using the components installed globally on your local machine)?

Do you use a specific version or branch of the OpenVINO-Notebooks repo, or the "latest head revision"?

When running under MS-Win11 with the latest version I can query multiple prompts without problems using the INT4 model... (but my Laptop has 64MB of RAM, Core-Ultra-7-155H)

@tim102187S
Copy link
Author

tim102187S commented Sep 27, 2024

Thank you for your suggestions.

I did not see any errors or warnings during the model conversion and compression steps.

We have not yet tried using the non-INT4 variants, as the focus of our research project is primarily on INT4 models.

We are running the Jupyter notebook in a Python virtual environment and following the steps outlined in the llm-chatbot.ipynb notebook.

This research project requires the use of the Ubuntu 24.04 system, so we are hoping to resolve the issue within this setup. (During the execution of the chatbot, the memory usage is approximately 7GB, so the errors are not due to insufficient memory.)

@brmarkus
Copy link

For conversion and compression I would anyway expect the OperatingSystem to start swapping memory to HDD/SSD if the system memory is not big enough...

Let's see if someone else can reproduce it under a similar environment... sorry, I don't see your described problems.
Have you modified the code or the model?

Can you reproduce it with another model?

@tim102187S
Copy link
Author

Thank you for your follow-up.

I have also tried using the llama-2-7b-chatbot model with INT4, INT8, and FP16, and I encountered the same issue in all cases.

Additionally, I would like to clarify that I have not made any modifications to the code or the model.

@aleksandr-mokrov
Copy link
Contributor

@tim102187S, looks like it due to 30 seconds timeout. Could you try to increase the value or to delete it at all in this row and check:
streamer = TextIteratorStreamer(tok, timeout=30.0, skip_prompt=True, skip_special_tokens=True)

@andrei-kochin andrei-kochin reopened this Oct 2, 2024
@avitial avitial added the bug Something isn't working label Oct 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants