Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Loading mistral-7B-instruct-v03 KeyError: 'layers.0.attention.wk.weight' #4989

Closed
timbmg opened this issue May 22, 2024 · 25 comments
Closed
Labels
bug Something isn't working

Comments

@timbmg
Copy link

timbmg commented May 22, 2024

Your current environment

PyTorch version: 2.3.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Rocky Linux 8.8 (Green Obsidian) (x86_64)
GCC version: (GCC) 8.5.0 20210514 (Red Hat 8.5.0-20)
Clang version: Could not collect
CMake version: version 3.29.2
Libc version: glibc-2.28

Python version: 3.9.13 (main, Oct 13 2022, 21:15:33) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-4.18.0-513.9.1.el8_9.x86_64-x86_64-with-glibc2.28
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA H100 PCIe
Nvidia driver version: 535.129.03
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 256
On-line CPU(s) list: 0-255
Thread(s) per core: 2
Core(s) per socket: 64
Socket(s): 2
NUMA node(s): 2
Vendor ID: AuthenticAMD
CPU family: 25
Model: 17
Model name: AMD EPYC 9534 64-Core Processor
Stepping: 1
CPU MHz: 2450.000
CPU max MHz: 3718.0659
CPU min MHz: 1500.0000
BogoMIPS: 4900.22
Virtualization: AMD-V
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 32768K
NUMA node0 CPU(s): 0-63,128-191
NUMA node1 CPU(s): 64-127,192-255
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good amd_lbr_v2 nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 invpcid_single hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp ibrs_enhanced vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local avx512_bf16 clzero irperf xsaveerptr wbnoinvd amd_ppin cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif x2avic v_spec_ctrl avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid overflow_recov succor smca fsrm flush_l1d

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] torch==2.3.0
[pip3] triton==2.3.0
[pip3] vllm-nccl-cu12==2.18.1.0.4.0
[conda] numpy 1.26.4 pypi_0 pypi
[conda] nvidia-nccl-cu12 2.20.5 pypi_0 pypi
[conda] torch 2.3.0 pypi_0 pypi
[conda] triton 2.3.0 pypi_0 pypi
[conda] vllm-nccl-cu12 2.18.1.0.4.0 pypi_0 pypiROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.4.2
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0 NIC0 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X SYS 5-6,133-134 0 N/A
NIC0 SYS X

Legend:

X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks

NIC Legend:

NIC0: mlx5_bond_0

🐛 Describe the bug

I am trying to load the new Mistral 7B instruct v03 model. However, it gives KeyError: 'layers.0.attention.wk.weight'. Curiously it seems to use the llama model loader (see stack trace). I am not sure if that is intended.

KeyError                                  Traceback (most recent call last)
Cell In[13], line 43
     40 else:
     41     raise ValueError(model)
---> 43 llm = LLM(
     44     model=model_path, 
     45     dtype="float16",
     46     max_model_len=max_model_len,
     47     gpu_memory_utilization=gpu_memory_utilization,
     48     **kwargs
     49 )

File ~/miniconda/envs/project-experiments-py39-vllm3/lib/python3.9/site-packages/vllm/entrypoints/llm.py:123, in LLM.__init__(self, model, tokenizer, tokenizer_mode, skip_tokenizer_init, trust_remote_code, tensor_parallel_size, dtype, quantization, revision, tokenizer_revision, seed, gpu_memory_utilization, swap_space, enforce_eager, max_context_len_to_capture, max_seq_len_to_capture, disable_custom_all_reduce, **kwargs)
    102     kwargs["disable_log_stats"] = True
    103 engine_args = EngineArgs(
    104     model=model,
    105     tokenizer=tokenizer,
   (...)
    121     **kwargs,
    122 )
--> 123 self.llm_engine = LLMEngine.from_engine_args(
    124     engine_args, usage_context=UsageContext.LLM_CLASS)
    125 self.request_counter = Counter()

File ~/miniconda/envs/project-experiments-py39-vllm3/lib/python3.9/site-packages/vllm/engine/llm_engine.py:292, in LLMEngine.from_engine_args(cls, engine_args, usage_context)
    289     executor_class = GPUExecutor
    291 # Create the LLM engine.
--> 292 engine = cls(
    293     **engine_config.to_dict(),
    294     executor_class=executor_class,
    295     log_stats=not engine_args.disable_log_stats,
    296     usage_context=usage_context,
    297 )
    298 return engine

File ~/miniconda/envs/project-experiments-py39-vllm3/lib/python3.9/site-packages/vllm/engine/llm_engine.py:160, in LLMEngine.__init__(self, model_config, cache_config, parallel_config, scheduler_config, device_config, load_config, lora_config, vision_language_config, speculative_config, decoding_config, executor_class, log_stats, usage_context)
    156 self.seq_counter = Counter()
    157 self.generation_config_fields = _load_generation_config_dict(
    158     model_config)
--> 160 self.model_executor = executor_class(
    161     model_config=model_config,
    162     cache_config=cache_config,
    163     parallel_config=parallel_config,
    164     scheduler_config=scheduler_config,
    165     device_config=device_config,
    166     lora_config=lora_config,
    167     vision_language_config=vision_language_config,
    168     speculative_config=speculative_config,
    169     load_config=load_config,
    170 )
    172 self._initialize_kv_caches()
    174 # If usage stat is enabled, collect relevant info.

File ~/miniconda/envs/project-experiments-py39-vllm3/lib/python3.9/site-packages/vllm/executor/executor_base.py:41, in ExecutorBase.__init__(self, model_config, cache_config, parallel_config, scheduler_config, device_config, load_config, lora_config, vision_language_config, speculative_config)
     38 self.vision_language_config = vision_language_config
     39 self.speculative_config = speculative_config
---> 41 self._init_executor()

File ~/miniconda/envs/project-experiments-py39-vllm3/lib/python3.9/site-packages/vllm/executor/gpu_executor.py:23, in GPUExecutor._init_executor(self)
     17 """Initialize the worker and load the model.
     18 
     19 If speculative decoding is enabled, we instead create the speculative
     20 worker.
     21 """
     22 if self.speculative_config is None:
---> 23     self._init_non_spec_worker()
     24 else:
     25     self._init_spec_worker()

File ~/miniconda/envs/project-experiments-py39-vllm3/lib/python3.9/site-packages/vllm/executor/gpu_executor.py:69, in GPUExecutor._init_non_spec_worker(self)
     67 self.driver_worker = self._create_worker()
     68 self.driver_worker.init_device()
---> 69 self.driver_worker.load_model()

File ~/miniconda/envs/project-experiments-py39-vllm3/lib/python3.9/site-packages/vllm/worker/worker.py:118, in Worker.load_model(self)
    117 def load_model(self):
--> 118     self.model_runner.load_model()

File ~/miniconda/envs/project-experiments-py39-vllm3/lib/python3.9/site-packages/vllm/worker/model_runner.py:164, in ModelRunner.load_model(self)
    162 def load_model(self) -> None:
    163     with CudaMemoryProfiler() as m:
--> 164         self.model = get_model(
    165             model_config=self.model_config,
    166             device_config=self.device_config,
    167             load_config=self.load_config,
    168             lora_config=self.lora_config,
    169             vision_language_config=self.vision_language_config,
    170             parallel_config=self.parallel_config,
    171             scheduler_config=self.scheduler_config,
    172         )
    174     self.model_memory_usage = m.consumed_memory
    175     logger.info("Loading model weights took %.4f GB",
    176                 self.model_memory_usage / float(2**30))

File ~/miniconda/envs/project-experiments-py39-vllm3/lib/python3.9/site-packages/vllm/model_executor/model_loader/__init__.py:19, in get_model(model_config, load_config, device_config, parallel_config, scheduler_config, lora_config, vision_language_config)
     13 def get_model(
     14         *, model_config: ModelConfig, load_config: LoadConfig,
     15         device_config: DeviceConfig, parallel_config: ParallelConfig,
     16         scheduler_config: SchedulerConfig, lora_config: Optional[LoRAConfig],
     17         vision_language_config: Optional[VisionLanguageConfig]) -> nn.Module:
     18     loader = get_model_loader(load_config)
---> 19     return loader.load_model(model_config=model_config,
     20                              device_config=device_config,
     21                              lora_config=lora_config,
     22                              vision_language_config=vision_language_config,
     23                              parallel_config=parallel_config,
     24                              scheduler_config=scheduler_config)

File ~/miniconda/envs/project-experiments-py39-vllm3/lib/python3.9/site-packages/vllm/model_executor/model_loader/loader.py:224, in DefaultModelLoader.load_model(self, model_config, device_config, lora_config, vision_language_config, parallel_config, scheduler_config)
    221 with torch.device(device_config.device):
    222     model = _initialize_model(model_config, self.load_config,
    223                               lora_config, vision_language_config)
--> 224 model.load_weights(
    225     self._get_weights_iterator(model_config.model,
    226                                model_config.revision,
    227                                fall_back_to_pt=getattr(
    228                                    model,
    229                                    "fall_back_to_pt_during_load",
    230                                    True)), )
    231 for _, module in model.named_modules():
    232     quant_method = getattr(module, "quant_method", None)

File ~/miniconda/envs/project-experiments-py39-vllm3/lib/python3.9/site-packages/vllm/model_executor/models/llama.py:415, in LlamaForCausalLM.load_weights(self, weights)
    413 if name.endswith(".bias") and name not in params_dict:
    414     continue
--> 415 param = params_dict[name]
    416 weight_loader = getattr(param, "weight_loader",
    417                         default_weight_loader)
    418 weight_loader(param, loaded_weight)

KeyError: 'layers.0.attention.wk.weight'
@timbmg timbmg added the bug Something isn't working label May 22, 2024
@sn-rf
Copy link

sn-rf commented May 22, 2024

I am also facing the same issue

NVIDIA-SMI 525.147.05 Driver Version: 525.147.05 CUDA Version: 12.2
A100 - 80GB

Package Version
torch 2.3.0
torchmetrics 1.4.0.post0
torchvision 0.18.0
nvidia-cublas-cu12 12.1.3.1
nvidia-cuda-cupti-cu12 12.1.105
nvidia-cuda-nvrtc-cu12 12.1.105
nvidia-cuda-runtime-cu12 12.1.105
nvidia-cudnn-cu12 8.9.2.26
nvidia-cufft-cu12 11.0.2.54
nvidia-curand-cu12 10.3.2.106
nvidia-cusolver-cu12 11.4.5.107
nvidia-cusparse-cu12 12.1.0.106
nvidia-ml-py 12.550.52
nvidia-nccl-cu12 2.20.5
nvidia-nvjitlink-cu12 12.5.40
nvidia-nvtx-cu12 12.1.105
numpy 1.26.4
vllm 0.4.2
vllm-nccl-cu12 2.18.1.0.4.0
sentence-transformers 2.7.0
transformers 4.41.0

@binarycrayon
Copy link

subscribed, thanks for the bug report

@ckgresla
Copy link

+1

@s-natsubori
Copy link

+1 same issue

@Yueeeeeeee
Copy link

+1 same issue

@robertgshaw2-neuralmagic
Copy link
Collaborator

Fixed by #5005

@mgoin mgoin closed this as completed May 24, 2024
@yananchen1989
Copy link

@robertgshaw2-neuralmagic
still face the issue of vllm version 0.5.4

llm = LLM(model= "mistralai/Mistral-7B-Instruct-v0.3" , dtype='float16', max_model_len=4000, tensor_parallel_size=1, gpu_memory_utilization=1, 
       quantization="bitsandbytes", load_format="bitsandbytes", enforce_eager=True
   )

@yananchen1989
Copy link

WARNING 08-20 23:25:13 config.py:1454] Casting torch.bfloat16 to torch.float16.
WARNING 08-20 23:25:13 config.py:254] bitsandbytes quantization is not fully optimized yet. The speed can be slower than non-quantized models.
WARNING 08-20 23:25:13 config.py:1342] bitsandbytes quantization is not tested with LoRA yet.
INFO 08-20 23:25:13 llm_engine.py:174] Initializing an LLM engine (v0.5.4) with config: model='mistralai/Mistral-7B-Instruct-v0.3', speculative_config=None, tokenizer='mistralai/Mistral-7B-Instruct-v0.3', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=4000, download_dir=None, load_format=LoadFormat.BITSANDBYTES, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=bitsandbytes, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=mistralai/Mistral-7B-Instruct-v0.3, use_v2_block_manager=False, enable_prefix_caching=False)
tokenizer_config.json: 100%|█████████████████████████████████████████████████████████████| 141k/141k [00:00<00:00, 12.8MB/s]
INFO 08-20 23:25:15 model_runner.py:720] Starting to load model mistralai/Mistral-7B-Instruct-v0.3...
INFO 08-20 23:25:15 loader.py:871] Loading weights with BitsAndBytes quantization. May take a while ...
INFO 08-20 23:25:15 weight_utils.py:225] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 25% Completed | 1/4 [00:01<00:05, 1.76s/it]
Loading safetensors checkpoint shards: 50% Completed | 2/4 [00:03<00:03, 1.80s/it]
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/ubuntu/moa/test_add.py", line 16, in
[rank0]: llm = LLM(model= args.llm_name, dtype='float16', max_model_len=4000,
[rank0]: File "/opt/conda/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 158, in init
[rank0]: self.llm_engine = LLMEngine.from_engine_args(
[rank0]: File "/opt/conda/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 445, in from_engine_args
[rank0]: engine = cls(
[rank0]: File "/opt/conda/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 249, in init
[rank0]: self.model_executor = executor_class(
[rank0]: File "/opt/conda/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 47, in init
[rank0]: self._init_executor()
[rank0]: File "/opt/conda/lib/python3.10/site-packages/vllm/executor/gpu_executor.py", line 36, in _init_executor
[rank0]: self.driver_worker.load_model()
[rank0]: File "/opt/conda/lib/python3.10/site-packages/vllm/worker/worker.py", line 139, in load_model
[rank0]: self.model_runner.load_model()
[rank0]: File "/opt/conda/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 722, in load_model
[rank0]: self.model = get_model(model_config=self.model_config,
[rank0]: File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/model_loader/init.py", line 21, in get_model
[rank0]: return loader.load_model(model_config=model_config,
[rank0]: File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 946, in load_model
[rank0]: self._load_weights(model_config, model)
[rank0]: File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 885, in _load_weights
[rank0]: model.load_weights(qweight_iterator)
[rank0]: File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 513, in load_weights
[rank0]: param = params_dict[name]
[rank0]: KeyError: 'layers.0.attention.wk.weight'
Loading safetensors checkpoint shards: 50% Completed | 2/4 [00:03<00:03, 1.92s/it]

@C3po-D2rd2
Copy link

Same here!

@torch.library.impl_abstract("xformers_flash::flash_bwd")
INFO 08-21 12:34:00 model_runner.py:720] Starting to load model /home/barbatus/finetuning/models...
INFO 08-21 12:34:00 selector.py:170] Cannot use FlashAttention-2 backend due to sliding window.
INFO 08-21 12:34:00 selector.py:54] Using XFormers backend.
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
[rank0]: Traceback (most recent call last):
[rank0]: File ".local/bin/vllm", line 8, in
[rank0]: sys.exit(main())
[rank0]: File ".local/lib/python3.10/site-packages/vllm/scripts.py", line 149, in main
[rank0]: args.dispatch_function(args)
[rank0]: File ".local/lib/python3.10/site-packages/vllm/scripts.py", line 30, in serve
[rank0]: asyncio.run(run_server(args))
[rank0]: File "/usr/lib/python3.10/asyncio/runners.py", line 44, in run
[rank0]: return loop.run_until_complete(main)
[rank0]: File "/usr/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
[rank0]: return future.result()
[rank0]: File ".local/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 342, in run_server
[rank0]: async with build_async_engine_client(args) as async_engine_client:
[rank0]: File "/usr/lib/python3.10/contextlib.py", line 199, in aenter
[rank0]: return await anext(self.gen)
[rank0]: File ".local/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 102, in build_async_engine_client
[rank0]: async_engine_client = AsyncLLMEngine.from_engine_args(
[rank0]: File ".local/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 471, in from_engine_args
[rank0]: engine = cls(
[rank0]: File ".local/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 381, in init
[rank0]: self.engine = self._init_engine(*args, **kwargs)
[rank0]: File ".local/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 552, in _init_engine
[rank0]: return engine_class(*args, **kwargs)
[rank0]: File ".local/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 249, in init
[rank0]: self.model_executor = executor_class(
[rank0]: File ".local/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 47, in init
[rank0]: self._init_executor()
[rank0]: File ".local/lib/python3.10/site-packages/vllm/executor/gpu_executor.py", line 36, in _init_executor
[rank0]: self.driver_worker.load_model()
[rank0]: File ".local/lib/python3.10/site-packages/vllm/worker/worker.py", line 139, in load_model
[rank0]: self.model_runner.load_model()
[rank0]: File ".local/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 722, in load_model
[rank0]: self.model = get_model(model_config=self.model_config,
[rank0]: File ".local/lib/python3.10/site-packages/vllm/model_executor/model_loader/init.py", line 21, in get_model
[rank0]: return loader.load_model(model_config=model_config,
[rank0]: File ".local/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 327, in load_model
[rank0]: model.load_weights(
[rank0]: File ".local/lib/python3.10/site-packages/vllm/model_executor/models/llama_embedding.py", line 84, in load_weights
[rank0]: param = params_dict[name]
[rank0]: KeyError: 'layers.0.attention.wk.weight'
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]

@C3po-D2rd2
Copy link

@mgoin I think the issue is not solved in the current vllm version (0.5.4)

@mgoin
Copy link
Collaborator

mgoin commented Aug 21, 2024

The issues you are reporting are likely due to other arguments like the bitsandbytes quantization

I just ran the model on 0.5.4 and on main with default arguments and it loaded fine:

vllm serve mistralai/Mistral-7B-Instruct-v0.3

INFO 08-21 16:11:14 api_server.py:339] vLLM API server version 0.5.4
INFO 08-21 16:11:14 api_server.py:340] args: Namespace(model_tag='mistralai/Mistral-7B-Instruct-v0.3', host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, model='mistralai/Mistral-7B-Instruct-v0.3', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None, dispatch_function=<function serve at 0x7fb01e0ec820>)
WARNING 08-21 16:11:15 config.py:1454] Casting torch.bfloat16 to torch.float16.
INFO 08-21 16:11:15 llm_engine.py:174] Initializing an LLM engine (v0.5.4) with config: model='mistralai/Mistral-7B-Instruct-v0.3', speculative_config=None, tokenizer='mistralai/Mistral-7B-Instruct-v0.3', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=mistralai/Mistral-7B-Instruct-v0.3, use_v2_block_manager=False, enable_prefix_caching=False)
INFO 08-21 16:11:16 model_runner.py:720] Starting to load model mistralai/Mistral-7B-Instruct-v0.3...
INFO 08-21 16:11:16 weight_utils.py:225] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/3 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  33% Completed | 1/3 [00:00<00:01,  1.67it/s]
Loading safetensors checkpoint shards:  67% Completed | 2/3 [00:01<00:00,  1.32it/s]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:02<00:00,  1.23it/s]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:02<00:00,  1.28it/s]

INFO 08-21 16:11:19 model_runner.py:732] Loading model weights took 13.5083 GB
INFO 08-21 16:11:23 gpu_executor.py:102] # GPU blocks: 27438, # CPU blocks: 2048
INFO 08-21 16:11:24 model_runner.py:1024] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 08-21 16:11:24 model_runner.py:1028] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 08-21 16:11:35 model_runner.py:1225] Graph capturing finished in 11 secs.
WARNING 08-21 16:11:35 serving_embedding.py:171] embedding_mode is False. Embedding API will not work.
INFO 08-21 16:11:35 launcher.py:14] Available routes are:
INFO 08-21 16:11:35 launcher.py:22] Route: /openapi.json, Methods: GET, HEAD
INFO 08-21 16:11:35 launcher.py:22] Route: /docs, Methods: GET, HEAD
INFO 08-21 16:11:35 launcher.py:22] Route: /docs/oauth2-redirect, Methods: GET, HEAD
INFO 08-21 16:11:35 launcher.py:22] Route: /redoc, Methods: GET, HEAD
INFO 08-21 16:11:35 launcher.py:22] Route: /health, Methods: GET
INFO 08-21 16:11:35 launcher.py:22] Route: /tokenize, Methods: POST
INFO 08-21 16:11:35 launcher.py:22] Route: /detokenize, Methods: POST
INFO 08-21 16:11:35 launcher.py:22] Route: /v1/models, Methods: GET
INFO 08-21 16:11:35 launcher.py:22] Route: /version, Methods: GET
INFO 08-21 16:11:35 launcher.py:22] Route: /v1/chat/completions, Methods: POST
INFO 08-21 16:11:35 launcher.py:22] Route: /v1/completions, Methods: POST
INFO 08-21 16:11:35 launcher.py:22] Route: /v1/embeddings, Methods: POST
INFO:     Started server process [2887700]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

@yananchen1989
Copy link

yes, without these arguments --quantization bitsandbytes --load_format bitsandbytes --enforce_eager, mistralai/Mistral-7B-Instruct-v0.3 works fine.

this issue only applies to mistral, while llama-3.1 is not affected.

@robertgshaw2-neuralmagic
Copy link
Collaborator

robertgshaw2-neuralmagic commented Aug 21, 2024

@yananchen1989

Could you investigate and submit a fix? This is likely due to this Mistral model having multiple copies of the checkpoint with slightly different state dicts which seems to be interacting poorly with bnb

@C3po-D2rd2
Copy link

I am not sure to understand because I never set those argument anywhere. I am launching it the same way than you. May be my issue come from the config.json, I used the params.json delivered with the model and had to add that:
"model_type": "mistral",
"architectures": ["MistralModel"]

@robertgshaw2-neuralmagic
Copy link
Collaborator

I am not sure to understand because I never set those argument anywhere. I am launching it the same way than you. May be my issue come from the config.json, I used the params.json delivered with the model and had to add that: "model_type": "mistral", "architectures": ["MistralModel"]

Post your config.json?

@C3po-D2rd2
Copy link

{
    "dim": 4096,
    "n_layers": 32,
    "head_dim": 128,
    "hidden_dim": 14336,
    "n_heads": 32,
    "n_kv_heads": 8,
    "norm_eps": 1e-05,
    "vocab_size": 32768,
    "rope_theta": 1000000.0,
    "model_type": "mistral",
    "architectures": ["MistralModel"]
}

@robertgshaw2-neuralmagic
Copy link
Collaborator

robertgshaw2-neuralmagic commented Aug 21, 2024

This is not a valid HF transformers Config

I am not sure what the params.json is in the checkpoint, but vLLM supports the official transformers config.json only

@C3po-D2rd2
Copy link

how can I find an exemple?

@robertgshaw2-neuralmagic
Copy link
Collaborator

@C3po-D2rd2
Copy link

Thanks a lot!
I am using it now and I still have the same issue. Here is the full trace

$ vllm serve ~/finetuning/models/7B
INFO 08-21 16:45:21 api_server.py:339] vLLM API server version 0.5.4
INFO 08-21 16:45:21 api_server.py:340] args: Namespace(model_tag='/home/barbatus/finetuning/models/7B', host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=[''], allowed_methods=[''], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, model='/home/barbatus/finetuning/models/7B', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None, dispatch_function=<function serve at 0x7fe875ef9510>)
WARNING 08-21 16:45:21 config.py:1454] Casting torch.bfloat16 to torch.float16.
INFO 08-21 16:45:21 llm_engine.py:174] Initializing an LLM engine (v0.5.4) with config: model='/home/barbatus/finetuning/models/7B', speculative_config=None, tokenizer='/home/barbatus/finetuning/models/7B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=/home/barbatus/finetuning/models/7B, use_v2_block_manager=False, enable_prefix_caching=False)
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the legacy (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set legacy=False. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in huggingface/transformers#24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the legacy (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set legacy=False. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in huggingface/transformers#24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.
INFO 08-21 16:45:21 model_runner.py:720] Starting to load model /home/barbatus/finetuning/models/7B...
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
Process Process-1:
Traceback (most recent call last):
File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/home/barbatus/.local/lib/python3.10/site-packages/vllm/entrypoints/openai/rpc/server.py", line 217, in run_rpc_server
server = AsyncEngineRPCServer(async_engine_args, usage_context, port)
File "/home/barbatus/.local/lib/python3.10/site-packages/vllm/entrypoints/openai/rpc/server.py", line 25, in init
self.engine = AsyncLLMEngine.from_engine_args(async_engine_args,
File "/home/barbatus/.local/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 471, in from_engine_args
engine = cls(
File "/home/barbatus/.local/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 381, in init
self.engine = self._init_engine(*args, **kwargs)
File "/home/barbatus/.local/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 552, in _init_engine
return engine_class(*args, **kwargs)
File "/home/barbatus/.local/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 249, in init
self.model_executor = executor_class(
File "/home/barbatus/.local/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 47, in init
self._init_executor()
File "/home/barbatus/.local/lib/python3.10/site-packages/vllm/executor/gpu_executor.py", line 36, in _init_executor
self.driver_worker.load_model()
File "/home/barbatus/.local/lib/python3.10/site-packages/vllm/worker/worker.py", line 139, in load_model
self.model_runner.load_model()
File "/home/barbatus/.local/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 722, in load_model
self.model = get_model(model_config=self.model_config,
File "/home/barbatus/.local/lib/python3.10/site-packages/vllm/model_executor/model_loader/init.py", line 21, in get_model
return loader.load_model(model_config=model_config,
File "/home/barbatus/.local/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 327, in load_model
model.load_weights(
File "/home/barbatus/.local/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 513, in load_weights
param = params_dict[name]
KeyError: 'layers.0.attention.wk.weight'
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]

@C3po-D2rd2
Copy link

I had to rename the tokenizer from tokenizer.model.v3 to tokenizer.model othewise it does not find it

@robertgshaw2-neuralmagic
Copy link
Collaborator

Original checkpoint had two copies

  • one with merged qkv and merged gate_up
  • one with unmerged

vLLM supports loading the unmerged one. It seems like your checkpoint has the merged weights.

I’m not sure how you saved or made this checkpoint, it seems like you’re not saving it in the hugging face format via saved_pretrained()

@C3po-D2rd2
Copy link

C3po-D2rd2 commented Aug 21, 2024

I am using the tarball from mistral, I found here: https://models.mistralcdn.com/mistral-7b-v0-3/mistral-7B-Instruct-v0.3.tar
I used it to run some fine-tuning but it does not modify the base model (the lora checkpoint is in an other directory), so it might be an issue from mistral directly?

@robertgshaw2-neuralmagic
Copy link
Collaborator

I am using the tarball from mistral, I found here: https://models.mistralcdn.com/mistral-7b-v0-3/mistral-7B-Instruct-v0.3.tar
I used it to run some fine-tuning but it does not modify the base model (the lora checkpoint is in an other directory), so it might be an issue from mistral directly?

Not sure - don’t know anything about how you did the finetuning. Either way, you need to save the model in the hf format with unfused linear layers to use it with vLLM

@C3po-D2rd2
Copy link

Thank you for your help, I finally downloaded all from hugging face and be able to make it works. I will try to finetune from here and see

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

10 participants