-
-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error during inference with Mixtral 7bx8 GPTQ #2271
Comments
Got the same error with the origin model |
got the same error with finetuned Mixtral 7bx8 |
I tried to load a GPTQ version of Mixtral 8x7b and got an error, but a different one than posted here. I got:
I tried changing the datatype in the config.json to torch.float16 to try and fix it but instead got the same error as in #2251. Maybe these two errors are actually the same and related to vLLM not supporting torch.bfloat16? @casper-hansen |
You need to use float16 or half for quantization. |
I switched it to torch.float16 in the config.json and my error changed to the one in #2251 |
Did you try upgrading to the latest vLLM? |
I'll try doing that now |
Yep! It seems like the latest vLLM has fixed this bug. Both GPTQ and AWQ are working for me now. Thanks for the help :) |
Traceback (most recent call last):
File "/home/marco/Scrivania/TESI/serving/vllm/vllm/engine/async_llm_engine.py", line 28, in _raise_exception_on_finish
task.result()
File "/home/marco/Scrivania/TESI/serving/vllm/vllm/engine/async_llm_engine.py", line 359, in run_engine_loop
has_requests_in_progress = await self.engine_step()
File "/home/marco/Scrivania/TESI/serving/vllm/vllm/engine/async_llm_engine.py", line 338, in engine_step
request_outputs = await self.engine.step_async()
File "/home/marco/Scrivania/TESI/serving/vllm/vllm/engine/async_llm_engine.py", line 199, in step_async
return self._process_model_outputs(output, scheduler_outputs) + ignored
File "/home/marco/Scrivania/TESI/serving/vllm/vllm/engine/llm_engine.py", line 562, in _process_model_outputs
self._process_sequence_group_outputs(seq_group, outputs)
File "/home/marco/Scrivania/TESI/serving/vllm/vllm/engine/llm_engine.py", line 554, in _process_sequence_group_outputs
self.scheduler.free_seq(seq)
File "/home/marco/Scrivania/TESI/serving/vllm/vllm/core/scheduler.py", line 312, in free_seq
self.block_manager.free(seq)
File "/home/marco/Scrivania/TESI/serving/vllm/vllm/core/block_manager.py", line 277, in free
self._free_block_table(block_table)
File "/home/marco/Scrivania/TESI/serving/vllm/vllm/core/block_manager.py", line 268, in _free_block_table
self.gpu_allocator.free(block)
File "/home/marco/Scrivania/TESI/serving/vllm/vllm/core/block_manager.py", line 48, in free
raise ValueError(f"Double free! {block} is already freed.")
ValueError: Double free! PhysicalTokenBlock(device=Device.GPU, block_number=2611, ref_count=0) is already freed.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/marco/miniconda3/envs/serving/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 426, in run_asgi
result = await app( # type: ignore[func-returns-value]
File "/home/marco/miniconda3/envs/serving/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 84, in call
return await self.app(scope, receive, send)
File "/home/marco/miniconda3/envs/serving/lib/python3.10/site-packages/fastapi/applications.py", line 1106, in call
await super().call(scope, receive, send)
File "/home/marco/miniconda3/envs/serving/lib/python3.10/site-packages/starlette/applications.py", line 122, in call
await self.middleware_stack(scope, receive, send)
File "/home/marco/miniconda3/envs/serving/lib/python3.10/site-packages/starlette/middleware/errors.py", line 184, in call
raise exc
File "/home/marco/miniconda3/envs/serving/lib/python3.10/site-packages/starlette/middleware/errors.py", line 162, in call
await self.app(scope, receive, _send)
File "/home/marco/miniconda3/envs/serving/lib/python3.10/site-packages/starlette/middleware/base.py", line 108, in call
response = await self.dispatch_func(request, call_next)
File "/home/marco/Scrivania/TESI/serving/vllm_server.py", line 63, in add_cors_header
response = await call_next(request)
File "/home/marco/miniconda3/envs/serving/lib/python3.10/site-packages/starlette/middleware/base.py", line 84, in call_next
raise app_exc
File "/home/marco/miniconda3/envs/serving/lib/python3.10/site-packages/starlette/middleware/base.py", line 70, in coro
await self.app(scope, receive_or_disconnect, send_no_error)
File "/home/marco/miniconda3/envs/serving/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 79, in call
raise exc
File "/home/marco/miniconda3/envs/serving/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 68, in call
await self.app(scope, receive, sender)
File "/home/marco/miniconda3/envs/serving/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 20, in call
raise e
File "/home/marco/miniconda3/envs/serving/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 17, in call
await self.app(scope, receive, send)
File "/home/marco/miniconda3/envs/serving/lib/python3.10/site-packages/starlette/routing.py", line 718, in call
await route.handle(scope, receive, send)
File "/home/marco/miniconda3/envs/serving/lib/python3.10/site-packages/starlette/routing.py", line 276, in handle
await self.app(scope, receive, send)
File "/home/marco/miniconda3/envs/serving/lib/python3.10/site-packages/starlette/routing.py", line 66, in app
response = await func(request)
File "/home/marco/miniconda3/envs/serving/lib/python3.10/site-packages/fastapi/routing.py", line 274, in app
raw_response = await run_endpoint_function(
File "/home/marco/miniconda3/envs/serving/lib/python3.10/site-packages/fastapi/routing.py", line 191, in run_endpoint_function
return await dependant.call(**values)
File "/home/marco/Scrivania/TESI/serving/vllm_server.py", line 137, in generate
async for request_output in results_generator:
File "/home/marco/Scrivania/TESI/serving/vllm/vllm/engine/async_llm_engine.py", line 445, in generate
raise e
File "/home/marco/Scrivania/TESI/serving/vllm/vllm/engine/async_llm_engine.py", line 439, in generate
async for request_output in stream:
File "/home/marco/Scrivania/TESI/serving/vllm/vllm/engine/async_llm_engine.py", line 70, in anext
raise result
File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run
File "/home/marco/Scrivania/TESI/serving/vllm/vllm/engine/async_llm_engine.py", line 37, in _raise_exception_on_finish
raise exc
File "/home/marco/Scrivania/TESI/serving/vllm/vllm/engine/async_llm_engine.py", line 32, in _raise_exception_on_finish
raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause.
The text was updated successfully, but these errors were encountered: