Error during inference with Mixtral 7bx8 GPTQ #2271

mlinmg · 2023-12-26T14:50:01Z

Traceback (most recent call last):
File "/home/marco/Scrivania/TESI/serving/vllm/vllm/engine/async_llm_engine.py", line 28, in _raise_exception_on_finish
task.result()
File "/home/marco/Scrivania/TESI/serving/vllm/vllm/engine/async_llm_engine.py", line 359, in run_engine_loop
has_requests_in_progress = await self.engine_step()
File "/home/marco/Scrivania/TESI/serving/vllm/vllm/engine/async_llm_engine.py", line 338, in engine_step
request_outputs = await self.engine.step_async()
File "/home/marco/Scrivania/TESI/serving/vllm/vllm/engine/async_llm_engine.py", line 199, in step_async
return self._process_model_outputs(output, scheduler_outputs) + ignored
File "/home/marco/Scrivania/TESI/serving/vllm/vllm/engine/llm_engine.py", line 562, in _process_model_outputs
self._process_sequence_group_outputs(seq_group, outputs)
File "/home/marco/Scrivania/TESI/serving/vllm/vllm/engine/llm_engine.py", line 554, in _process_sequence_group_outputs
self.scheduler.free_seq(seq)
File "/home/marco/Scrivania/TESI/serving/vllm/vllm/core/scheduler.py", line 312, in free_seq
self.block_manager.free(seq)
File "/home/marco/Scrivania/TESI/serving/vllm/vllm/core/block_manager.py", line 277, in free
self._free_block_table(block_table)
File "/home/marco/Scrivania/TESI/serving/vllm/vllm/core/block_manager.py", line 268, in _free_block_table
self.gpu_allocator.free(block)
File "/home/marco/Scrivania/TESI/serving/vllm/vllm/core/block_manager.py", line 48, in free
raise ValueError(f"Double free! {block} is already freed.")
ValueError: Double free! PhysicalTokenBlock(device=Device.GPU, block_number=2611, ref_count=0) is already freed.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/marco/miniconda3/envs/serving/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 426, in run_asgi
result = await app( # type: ignore[func-returns-value]
File "/home/marco/miniconda3/envs/serving/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 84, in call
return await self.app(scope, receive, send)
File "/home/marco/miniconda3/envs/serving/lib/python3.10/site-packages/fastapi/applications.py", line 1106, in call
await super().call(scope, receive, send)
File "/home/marco/miniconda3/envs/serving/lib/python3.10/site-packages/starlette/applications.py", line 122, in call
await self.middleware_stack(scope, receive, send)
File "/home/marco/miniconda3/envs/serving/lib/python3.10/site-packages/starlette/middleware/errors.py", line 184, in call
raise exc
File "/home/marco/miniconda3/envs/serving/lib/python3.10/site-packages/starlette/middleware/errors.py", line 162, in call
await self.app(scope, receive, _send)
File "/home/marco/miniconda3/envs/serving/lib/python3.10/site-packages/starlette/middleware/base.py", line 108, in call
response = await self.dispatch_func(request, call_next)
File "/home/marco/Scrivania/TESI/serving/vllm_server.py", line 63, in add_cors_header
response = await call_next(request)
File "/home/marco/miniconda3/envs/serving/lib/python3.10/site-packages/starlette/middleware/base.py", line 84, in call_next
raise app_exc
File "/home/marco/miniconda3/envs/serving/lib/python3.10/site-packages/starlette/middleware/base.py", line 70, in coro
await self.app(scope, receive_or_disconnect, send_no_error)
File "/home/marco/miniconda3/envs/serving/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 79, in call
raise exc
File "/home/marco/miniconda3/envs/serving/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 68, in call
await self.app(scope, receive, sender)
File "/home/marco/miniconda3/envs/serving/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 20, in call
raise e
File "/home/marco/miniconda3/envs/serving/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 17, in call
await self.app(scope, receive, send)
File "/home/marco/miniconda3/envs/serving/lib/python3.10/site-packages/starlette/routing.py", line 718, in call
await route.handle(scope, receive, send)
File "/home/marco/miniconda3/envs/serving/lib/python3.10/site-packages/starlette/routing.py", line 276, in handle
await self.app(scope, receive, send)
File "/home/marco/miniconda3/envs/serving/lib/python3.10/site-packages/starlette/routing.py", line 66, in app
response = await func(request)
File "/home/marco/miniconda3/envs/serving/lib/python3.10/site-packages/fastapi/routing.py", line 274, in app
raw_response = await run_endpoint_function(
File "/home/marco/miniconda3/envs/serving/lib/python3.10/site-packages/fastapi/routing.py", line 191, in run_endpoint_function
return await dependant.call(**values)
File "/home/marco/Scrivania/TESI/serving/vllm_server.py", line 137, in generate
async for request_output in results_generator:
File "/home/marco/Scrivania/TESI/serving/vllm/vllm/engine/async_llm_engine.py", line 445, in generate
raise e
File "/home/marco/Scrivania/TESI/serving/vllm/vllm/engine/async_llm_engine.py", line 439, in generate
async for request_output in stream:
File "/home/marco/Scrivania/TESI/serving/vllm/vllm/engine/async_llm_engine.py", line 70, in anext
raise result
File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run
File "/home/marco/Scrivania/TESI/serving/vllm/vllm/engine/async_llm_engine.py", line 37, in _raise_exception_on_finish
raise exc
File "/home/marco/Scrivania/TESI/serving/vllm/vllm/engine/async_llm_engine.py", line 32, in _raise_exception_on_finish
raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause.

oushu1zhangxiangxuan1 · 2023-12-27T07:59:55Z

Got the same error with the origin model

adamlin120 · 2023-12-30T02:16:08Z

got the same error with finetuned Mixtral 7bx8

iibw · 2024-01-03T05:15:16Z

I tried to load a GPTQ version of Mixtral 8x7b and got an error, but a different one than posted here.

I got:

config.py gptq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
config.py gptq does not support CUDA graph yet. Disabling CUDA graph.
worker.py -- Started a local Ray instance.
llm_engine.py Initializing an LLM engine with config: model='model_path/TheBloke_Mixtral-8x7B-Instruct-v0.1-GPTQ', tokenizer='model_path/TheBloke_Mixtral-8x7B-Instruct-v0.1-GPTQ', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=4, quantization=gptq, enforce_eager=True, seed=0)
Traceback (most recent call last):
  File "local_path/mixtral_vllm.py", line 3, in <module>
    llm = LLM("model_path/TheBloke_Mixtral-8x7B-Instruct-v0.1-GPTQ", quantization="GPTQ", tensor_parallel_size=4)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "anaconda_env_path/lib/python3.11/site-packages/vllm/entrypoints/llm.py", line 105, in __init__
    self.llm_engine = LLMEngine.from_engine_args(engine_args)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "anaconda_env_path/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 250, in from_engine_args
    engine = cls(*engine_configs,
             ^^^^^^^^^^^^^^^^^^^^
  File "anaconda_env_path/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 108, in __init__
    self._init_workers_ray(placement_group)
  File "anaconda_env_path/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 195, in _init_workers_ray
    self._run_workers(
  File "anaconda_env_path/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 755, in _run_workers
    self._run_workers_in_batch(workers, method, *args, **kwargs))
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "anaconda_env_path/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 732, in _run_workers_in_batch
    all_outputs = ray.get(all_outputs)
                  ^^^^^^^^^^^^^^^^^^^^
  File "anaconda_env_path/lib/python3.11/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "anaconda_env_path/lib/python3.11/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "anaconda_env_path/lib/python3.11/site-packages/ray/_private/worker.py", line 2624, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(ValueError): ray::RayWorkerVllm.execute_method() (pid=X, ip=X.X.X.X, actor_id=X, repr=<vllm.engine.ray_utils.RayWorkerVllm object at X>)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "anaconda_env_path/lib/python3.11/site-packages/vllm/engine/ray_utils.py", line 31, in execute_method
    return executor(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "anaconda_env_path/lib/python3.11/site-packages/vllm/worker/worker.py", line 79, in load_model
    self.model_runner.load_model()
  File "anaconda_env_path/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 57, in load_model
    self.model = get_model(self.model_config)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "anaconda_env_path/lib/python3.11/site-packages/vllm/model_executor/model_loader.py", line 55, in get_model
    raise ValueError(
ValueError: torch.bfloat16 is not supported for quantization method gptq. Supported dtypes: [torch.float16]

I tried changing the datatype in the config.json to torch.float16 to try and fix it but instead got the same error as in #2251. Maybe these two errors are actually the same and related to vLLM not supporting torch.bfloat16? @casper-hansen

casper-hansen · 2024-01-04T11:57:08Z

You need to use float16 or half for quantization.

iibw · 2024-01-04T15:28:56Z

@casper-hansen

You need to use float16 or half for quantization.

I switched it to torch.float16 in the config.json and my error changed to the one in #2251

casper-hansen · 2024-01-04T15:33:36Z

Did you try upgrading to the latest vLLM?

iibw · 2024-01-04T15:58:22Z

I'll try doing that now

iibw · 2024-01-04T20:11:37Z

Yep! It seems like the latest vLLM has fixed this bug. Both GPTQ and AWQ are working for me now. Thanks for the help :)

manzke mentioned this issue Mar 5, 2024

ValueError: Double free! #1556

Closed

hmellor added the duplicate This issue or pull request already exists label Mar 9, 2024

hmellor closed this as not planned Won't fix, can't repro, duplicate, stale Mar 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error during inference with Mixtral 7bx8 GPTQ #2271

Error during inference with Mixtral 7bx8 GPTQ #2271

mlinmg commented Dec 26, 2023

oushu1zhangxiangxuan1 commented Dec 27, 2023 •

edited

Loading

adamlin120 commented Dec 30, 2023

iibw commented Jan 3, 2024

casper-hansen commented Jan 4, 2024

iibw commented Jan 4, 2024

casper-hansen commented Jan 4, 2024

iibw commented Jan 4, 2024

iibw commented Jan 4, 2024

Error during inference with Mixtral 7bx8 GPTQ #2271

Error during inference with Mixtral 7bx8 GPTQ #2271

Comments

mlinmg commented Dec 26, 2023

oushu1zhangxiangxuan1 commented Dec 27, 2023 • edited Loading

adamlin120 commented Dec 30, 2023

iibw commented Jan 3, 2024

casper-hansen commented Jan 4, 2024

iibw commented Jan 4, 2024

casper-hansen commented Jan 4, 2024

iibw commented Jan 4, 2024

iibw commented Jan 4, 2024

oushu1zhangxiangxuan1 commented Dec 27, 2023 •

edited

Loading