CUDA 12.1 vllm==0.2.3 Double Free #1930

tjtanaa · 2023-12-05T08:57:42Z

I tried this with FastChat that uses vLLM backend:
Both inputs:

1. openai.ChatCompletion.create(
        model=model,
        messages=(
        [
            {"role": "user", "content": prompt}
        ]
        ),
        stream=False,
        # temperature=args.temperature,
        presence_penalty=0.0,
        frequency_penalty=0.0,
        max_tokens=max_tokens,
        best_of=best_of,
        n=n,
        temperature=temperature,
        top_p=top_p,
        top_k=top_k,
        use_beam_search=True)

2. openai.ChatCompletion.create(
        model=model,
        messages=(
        [
            {"role": "user", "content": prompt}
        ]
        ),
        stream=True,
        # temperature=args.temperature,
        presence_penalty=0.0,
        frequency_penalty=0.0,
        max_tokens=max_tokens,
        best_of=best_of,
        n=n,
        temperature=temperature,
        top_p=top_p,
        top_k=top_k,
        use_beam_search=True)

raises the following error

2023-12-05 08:52:19 | ERROR | stderr |   File "/home/tan/tjtanaa/vllmcu12/vllm/engine/async_llm_engine.py", line 28, in _raise_exception_on_finish  
2023-12-05 08:52:19 | ERROR | stderr |     task.result()                                                                                            
2023-12-05 08:52:19 | ERROR | stderr |   File "/home/tan/tjtanaa/vllmcu12/vllm/engine/async_llm_engine.py", line 359, in run_engine_loop            
2023-12-05 08:52:19 | ERROR | stderr |     has_requests_in_progress = await self.engine_step()                                                      
2023-12-05 08:52:19 | ERROR | stderr |   File "/home/tan/tjtanaa/vllmcu12/vllm/engine/async_llm_engine.py", line 338, in engine_step                
2023-12-05 08:52:19 | ERROR | stderr |     request_outputs = await self.engine.step_async()                                                         
2023-12-05 08:52:19 | ERROR | stderr |   File "/home/tan/tjtanaa/vllmcu12/vllm/engine/async_llm_engine.py", line 199, in step_async                 
2023-12-05 08:52:19 | ERROR | stderr |     return self._process_model_outputs(output, scheduler_outputs) + ignored                                  
2023-12-05 08:52:19 | ERROR | stderr |   File "/home/tan/tjtanaa/vllmcu12/vllm/engine/llm_engine.py", line 545, in _process_model_outputs           
2023-12-05 08:52:19 | ERROR | stderr |     self._process_sequence_group_outputs(seq_group, outputs)                                                 
2023-12-05 08:52:19 | ERROR | stderr |   File "/home/tan/tjtanaa/vllmcu12/vllm/engine/llm_engine.py", line 537, in _process_sequence_group_outputs  
2023-12-05 08:52:19 | ERROR | stderr |     self.scheduler.free_seq(seq)                                                                             
2023-12-05 08:52:19 | ERROR | stderr |   File "/home/tan/tjtanaa/vllmcu12/vllm/core/scheduler.py", line 310, in free_seq                            
2023-12-05 08:52:19 | ERROR | stderr |     self.block_manager.free(seq)                                                                             
2023-12-05 08:52:19 | ERROR | stderr |   File "/home/tan/tjtanaa/vllmcu12/vllm/core/block_manager.py", line 277, in free                            
2023-12-05 08:52:19 | ERROR | stderr |     self._free_block_table(block_table)                                                                      
2023-12-05 08:52:19 | ERROR | stderr |   File "/home/tan/tjtanaa/vllmcu12/vllm/core/block_manager.py", line 268, in _free_block_table               
2023-12-05 08:52:19 | ERROR | stderr |     self.gpu_allocator.free(block)                                                                           
2023-12-05 08:52:19 | ERROR | stderr |   File "/home/tan/tjtanaa/vllmcu12/vllm/core/block_manager.py", line 48, in free                             
2023-12-05 08:52:19 | ERROR | stderr |     raise ValueError(f"Double free! {block} is already freed.")                                              
2023-12-05 08:52:19 | ERROR | stderr | ValueError: Double free! PhysicalTokenBlock(device=Device.GPU, block_number=11634, ref_count=0) is already freed.

The text was updated successfully, but these errors were encountered:

WoosukKwon · 2023-12-05T09:02:08Z

Hi @tjtanaa thanks for reporting the bug! Which model are you using? Is it Mistral?

tjtanaa · 2023-12-05T09:22:18Z

Yes. I am using openhermes-2.5 which is based on Mistral

qati · 2023-12-21T14:49:22Z

It is happening for me as well, cuda 12.1, vllm 0.2.6 with Mixtral 8x7B, for long prompts.

qati · 2023-12-27T11:56:24Z

@WoosukKwon any tips on this?

nxphi47 · 2024-01-04T12:00:09Z

+1

jonaslsaa · 2024-01-11T10:59:34Z

+1, Samme issue here using CUDA/12.1.1, Python/3.10.4-GCCcore-11.3.0, vllm==0.2.3. Happened after 5-10 inferences with a lora fine tuned mistral 7b model
vllm_double_free_bug.log

EDIT: In our case the fine-tuned model was trained with 1024 input tokens, when this was exceeded it caused the double free error.

manzke mentioned this issue Mar 5, 2024

ValueError: Double free! #1556

Closed

hmellor added the duplicate This issue or pull request already exists label Mar 9, 2024

hmellor closed this as not planned Won't fix, can't repro, duplicate, stale Mar 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA 12.1 vllm==0.2.3 Double Free #1930

CUDA 12.1 vllm==0.2.3 Double Free #1930

tjtanaa commented Dec 5, 2023

WoosukKwon commented Dec 5, 2023

tjtanaa commented Dec 5, 2023

qati commented Dec 21, 2023

qati commented Dec 27, 2023

nxphi47 commented Jan 4, 2024

jonaslsaa commented Jan 11, 2024 •

edited

Loading

CUDA 12.1 vllm==0.2.3 Double Free #1930

CUDA 12.1 vllm==0.2.3 Double Free #1930

Comments

tjtanaa commented Dec 5, 2023

WoosukKwon commented Dec 5, 2023

tjtanaa commented Dec 5, 2023

qati commented Dec 21, 2023

qati commented Dec 27, 2023

nxphi47 commented Jan 4, 2024

jonaslsaa commented Jan 11, 2024 • edited Loading

jonaslsaa commented Jan 11, 2024 •

edited

Loading