[Bug]: Loading mistral-7B-instruct-v03 KeyError: 'layers.0.attention.wk.weight' #4989

timbmg · 2024-05-22T18:52:58Z

Your current environment

PyTorch version: 2.3.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Rocky Linux 8.8 (Green Obsidian) (x86_64)
GCC version: (GCC) 8.5.0 20210514 (Red Hat 8.5.0-20)
Clang version: Could not collect
CMake version: version 3.29.2
Libc version: glibc-2.28

Python version: 3.9.13 (main, Oct 13 2022, 21:15:33) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-4.18.0-513.9.1.el8_9.x86_64-x86_64-with-glibc2.28
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA H100 PCIe
Nvidia driver version: 535.129.03
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 256
On-line CPU(s) list: 0-255
Thread(s) per core: 2
Core(s) per socket: 64
Socket(s): 2
NUMA node(s): 2
Vendor ID: AuthenticAMD
CPU family: 25
Model: 17
Model name: AMD EPYC 9534 64-Core Processor
Stepping: 1
CPU MHz: 2450.000
CPU max MHz: 3718.0659
CPU min MHz: 1500.0000
BogoMIPS: 4900.22
Virtualization: AMD-V
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 32768K
NUMA node0 CPU(s): 0-63,128-191
NUMA node1 CPU(s): 64-127,192-255
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good amd_lbr_v2 nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 invpcid_single hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp ibrs_enhanced vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local avx512_bf16 clzero irperf xsaveerptr wbnoinvd amd_ppin cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif x2avic v_spec_ctrl avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid overflow_recov succor smca fsrm flush_l1d

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] torch==2.3.0
[pip3] triton==2.3.0
[pip3] vllm-nccl-cu12==2.18.1.0.4.0
[conda] numpy 1.26.4 pypi_0 pypi
[conda] nvidia-nccl-cu12 2.20.5 pypi_0 pypi
[conda] torch 2.3.0 pypi_0 pypi
[conda] triton 2.3.0 pypi_0 pypi
[conda] vllm-nccl-cu12 2.18.1.0.4.0 pypi_0 pypiROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.4.2
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0 NIC0 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X SYS 5-6,133-134 0 N/A
NIC0 SYS X

Legend:

X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks

NIC Legend:

NIC0: mlx5_bond_0

🐛 Describe the bug

I am trying to load the new Mistral 7B instruct v03 model. However, it gives KeyError: 'layers.0.attention.wk.weight'. Curiously it seems to use the llama model loader (see stack trace). I am not sure if that is intended.

KeyError                                  Traceback (most recent call last)
Cell In[13], line 43
     40 else:
     41     raise ValueError(model)
---> 43 llm = LLM(
     44     model=model_path, 
     45     dtype="float16",
     46     max_model_len=max_model_len,
     47     gpu_memory_utilization=gpu_memory_utilization,
     48     **kwargs
     49 )

File ~/miniconda/envs/project-experiments-py39-vllm3/lib/python3.9/site-packages/vllm/entrypoints/llm.py:123, in LLM.__init__(self, model, tokenizer, tokenizer_mode, skip_tokenizer_init, trust_remote_code, tensor_parallel_size, dtype, quantization, revision, tokenizer_revision, seed, gpu_memory_utilization, swap_space, enforce_eager, max_context_len_to_capture, max_seq_len_to_capture, disable_custom_all_reduce, **kwargs)
    102     kwargs["disable_log_stats"] = True
    103 engine_args = EngineArgs(
    104     model=model,
    105     tokenizer=tokenizer,
   (...)
    121     **kwargs,
    122 )
--> 123 self.llm_engine = LLMEngine.from_engine_args(
    124     engine_args, usage_context=UsageContext.LLM_CLASS)
    125 self.request_counter = Counter()

File ~/miniconda/envs/project-experiments-py39-vllm3/lib/python3.9/site-packages/vllm/engine/llm_engine.py:292, in LLMEngine.from_engine_args(cls, engine_args, usage_context)
    289     executor_class = GPUExecutor
    291 # Create the LLM engine.
--> 292 engine = cls(
    293     **engine_config.to_dict(),
    294     executor_class=executor_class,
    295     log_stats=not engine_args.disable_log_stats,
    296     usage_context=usage_context,
    297 )
    298 return engine

File ~/miniconda/envs/project-experiments-py39-vllm3/lib/python3.9/site-packages/vllm/engine/llm_engine.py:160, in LLMEngine.__init__(self, model_config, cache_config, parallel_config, scheduler_config, device_config, load_config, lora_config, vision_language_config, speculative_config, decoding_config, executor_class, log_stats, usage_context)
    156 self.seq_counter = Counter()
    157 self.generation_config_fields = _load_generation_config_dict(
    158     model_config)
--> 160 self.model_executor = executor_class(
    161     model_config=model_config,
    162     cache_config=cache_config,
    163     parallel_config=parallel_config,
    164     scheduler_config=scheduler_config,
    165     device_config=device_config,
    166     lora_config=lora_config,
    167     vision_language_config=vision_language_config,
    168     speculative_config=speculative_config,
    169     load_config=load_config,
    170 )
    172 self._initialize_kv_caches()
    174 # If usage stat is enabled, collect relevant info.

File ~/miniconda/envs/project-experiments-py39-vllm3/lib/python3.9/site-packages/vllm/executor/executor_base.py:41, in ExecutorBase.__init__(self, model_config, cache_config, parallel_config, scheduler_config, device_config, load_config, lora_config, vision_language_config, speculative_config)
     38 self.vision_language_config = vision_language_config
     39 self.speculative_config = speculative_config
---> 41 self._init_executor()

File ~/miniconda/envs/project-experiments-py39-vllm3/lib/python3.9/site-packages/vllm/executor/gpu_executor.py:23, in GPUExecutor._init_executor(self)
     17 """Initialize the worker and load the model.
     18 
     19 If speculative decoding is enabled, we instead create the speculative
     20 worker.
     21 """
     22 if self.speculative_config is None:
---> 23     self._init_non_spec_worker()
     24 else:
     25     self._init_spec_worker()

File ~/miniconda/envs/project-experiments-py39-vllm3/lib/python3.9/site-packages/vllm/executor/gpu_executor.py:69, in GPUExecutor._init_non_spec_worker(self)
     67 self.driver_worker = self._create_worker()
     68 self.driver_worker.init_device()
---> 69 self.driver_worker.load_model()

File ~/miniconda/envs/project-experiments-py39-vllm3/lib/python3.9/site-packages/vllm/worker/worker.py:118, in Worker.load_model(self)
    117 def load_model(self):
--> 118     self.model_runner.load_model()

File ~/miniconda/envs/project-experiments-py39-vllm3/lib/python3.9/site-packages/vllm/worker/model_runner.py:164, in ModelRunner.load_model(self)
    162 def load_model(self) -> None:
    163     with CudaMemoryProfiler() as m:
--> 164         self.model = get_model(
    165             model_config=self.model_config,
    166             device_config=self.device_config,
    167             load_config=self.load_config,
    168             lora_config=self.lora_config,
    169             vision_language_config=self.vision_language_config,
    170             parallel_config=self.parallel_config,
    171             scheduler_config=self.scheduler_config,
    172         )
    174     self.model_memory_usage = m.consumed_memory
    175     logger.info("Loading model weights took %.4f GB",
    176                 self.model_memory_usage / float(2**30))

File ~/miniconda/envs/project-experiments-py39-vllm3/lib/python3.9/site-packages/vllm/model_executor/model_loader/__init__.py:19, in get_model(model_config, load_config, device_config, parallel_config, scheduler_config, lora_config, vision_language_config)
     13 def get_model(
     14         *, model_config: ModelConfig, load_config: LoadConfig,
     15         device_config: DeviceConfig, parallel_config: ParallelConfig,
     16         scheduler_config: SchedulerConfig, lora_config: Optional[LoRAConfig],
     17         vision_language_config: Optional[VisionLanguageConfig]) -> nn.Module:
     18     loader = get_model_loader(load_config)
---> 19     return loader.load_model(model_config=model_config,
     20                              device_config=device_config,
     21                              lora_config=lora_config,
     22                              vision_language_config=vision_language_config,
     23                              parallel_config=parallel_config,
     24                              scheduler_config=scheduler_config)

File ~/miniconda/envs/project-experiments-py39-vllm3/lib/python3.9/site-packages/vllm/model_executor/model_loader/loader.py:224, in DefaultModelLoader.load_model(self, model_config, device_config, lora_config, vision_language_config, parallel_config, scheduler_config)
    221 with torch.device(device_config.device):
    222     model = _initialize_model(model_config, self.load_config,
    223                               lora_config, vision_language_config)
--> 224 model.load_weights(
    225     self._get_weights_iterator(model_config.model,
    226                                model_config.revision,
    227                                fall_back_to_pt=getattr(
    228                                    model,
    229                                    "fall_back_to_pt_during_load",
    230                                    True)), )
    231 for _, module in model.named_modules():
    232     quant_method = getattr(module, "quant_method", None)

File ~/miniconda/envs/project-experiments-py39-vllm3/lib/python3.9/site-packages/vllm/model_executor/models/llama.py:415, in LlamaForCausalLM.load_weights(self, weights)
    413 if name.endswith(".bias") and name not in params_dict:
    414     continue
--> 415 param = params_dict[name]
    416 weight_loader = getattr(param, "weight_loader",
    417                         default_weight_loader)
    418 weight_loader(param, loaded_weight)

KeyError: 'layers.0.attention.wk.weight'

The text was updated successfully, but these errors were encountered:

sn-rf · 2024-05-22T19:27:53Z

I am also facing the same issue

NVIDIA-SMI 525.147.05 Driver Version: 525.147.05 CUDA Version: 12.2
A100 - 80GB

Package	Version
torch	2.3.0
torchmetrics	1.4.0.post0
torchvision	0.18.0
nvidia-cublas-cu12	12.1.3.1
nvidia-cuda-cupti-cu12	12.1.105
nvidia-cuda-nvrtc-cu12	12.1.105
nvidia-cuda-runtime-cu12	12.1.105
nvidia-cudnn-cu12	8.9.2.26
nvidia-cufft-cu12	11.0.2.54
nvidia-curand-cu12	10.3.2.106
nvidia-cusolver-cu12	11.4.5.107
nvidia-cusparse-cu12	12.1.0.106
nvidia-ml-py	12.550.52
nvidia-nccl-cu12	2.20.5
nvidia-nvjitlink-cu12	12.5.40
nvidia-nvtx-cu12	12.1.105
numpy	1.26.4
vllm	0.4.2
vllm-nccl-cu12	2.18.1.0.4.0
sentence-transformers	2.7.0
transformers	4.41.0

binarycrayon · 2024-05-23T16:17:59Z

subscribed, thanks for the bug report

ckgresla · 2024-05-23T18:50:02Z

+1

s-natsubori · 2024-05-24T01:51:02Z

+1 same issue

Yueeeeeeee · 2024-05-24T05:20:47Z

+1 same issue

robertgshaw2-neuralmagic · 2024-05-24T11:26:15Z

Fixed by #5005

yananchen1989 · 2024-08-18T15:45:29Z

@robertgshaw2-neuralmagic
still face the issue of vllm version 0.5.4

llm = LLM(model= "mistralai/Mistral-7B-Instruct-v0.3" , dtype='float16', max_model_len=4000, tensor_parallel_size=1, gpu_memory_utilization=1, 
       quantization="bitsandbytes", load_format="bitsandbytes", enforce_eager=True
   )

yananchen1989 · 2024-08-21T03:28:29Z

WARNING 08-20 23:25:13 config.py:1454] Casting torch.bfloat16 to torch.float16.
WARNING 08-20 23:25:13 config.py:254] bitsandbytes quantization is not fully optimized yet. The speed can be slower than non-quantized models.
WARNING 08-20 23:25:13 config.py:1342] bitsandbytes quantization is not tested with LoRA yet.
INFO 08-20 23:25:13 llm_engine.py:174] Initializing an LLM engine (v0.5.4) with config: model='mistralai/Mistral-7B-Instruct-v0.3', speculative_config=None, tokenizer='mistralai/Mistral-7B-Instruct-v0.3', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=4000, download_dir=None, load_format=LoadFormat.BITSANDBYTES, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=bitsandbytes, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=mistralai/Mistral-7B-Instruct-v0.3, use_v2_block_manager=False, enable_prefix_caching=False)
tokenizer_config.json: 100%|█████████████████████████████████████████████████████████████| 141k/141k [00:00<00:00, 12.8MB/s]
INFO 08-20 23:25:15 model_runner.py:720] Starting to load model mistralai/Mistral-7B-Instruct-v0.3...
INFO 08-20 23:25:15 loader.py:871] Loading weights with BitsAndBytes quantization. May take a while ...
INFO 08-20 23:25:15 weight_utils.py:225] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 25% Completed | 1/4 [00:01<00:05, 1.76s/it]
Loading safetensors checkpoint shards: 50% Completed | 2/4 [00:03<00:03, 1.80s/it]
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/ubuntu/moa/test_add.py", line 16, in
[rank0]: llm = LLM(model= args.llm_name, dtype='float16', max_model_len=4000,
[rank0]: File "/opt/conda/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 158, in init
[rank0]: self.llm_engine = LLMEngine.from_engine_args(
[rank0]: File "/opt/conda/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 445, in from_engine_args
[rank0]: engine = cls(
[rank0]: File "/opt/conda/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 249, in init
[rank0]: self.model_executor = executor_class(
[rank0]: File "/opt/conda/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 47, in init
[rank0]: self._init_executor()
[rank0]: File "/opt/conda/lib/python3.10/site-packages/vllm/executor/gpu_executor.py", line 36, in _init_executor
[rank0]: self.driver_worker.load_model()
[rank0]: File "/opt/conda/lib/python3.10/site-packages/vllm/worker/worker.py", line 139, in load_model
[rank0]: self.model_runner.load_model()
[rank0]: File "/opt/conda/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 722, in load_model
[rank0]: self.model = get_model(model_config=self.model_config,
[rank0]: File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/model_loader/init.py", line 21, in get_model
[rank0]: return loader.load_model(model_config=model_config,
[rank0]: File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 946, in load_model
[rank0]: self._load_weights(model_config, model)
[rank0]: File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 885, in _load_weights
[rank0]: model.load_weights(qweight_iterator)
[rank0]: File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 513, in load_weights
[rank0]: param = params_dict[name]
[rank0]: KeyError: 'layers.0.attention.wk.weight'
Loading safetensors checkpoint shards: 50% Completed | 2/4 [00:03<00:03, 1.92s/it]

C3po-D2rd2 · 2024-08-21T12:47:37Z

Same here!

@torch.library.impl_abstract("xformers_flash::flash_bwd")
INFO 08-21 12:34:00 model_runner.py:720] Starting to load model /home/barbatus/finetuning/models...
INFO 08-21 12:34:00 selector.py:170] Cannot use FlashAttention-2 backend due to sliding window.
INFO 08-21 12:34:00 selector.py:54] Using XFormers backend.
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
[rank0]: Traceback (most recent call last):
[rank0]: File ".local/bin/vllm", line 8, in
[rank0]: sys.exit(main())
[rank0]: File ".local/lib/python3.10/site-packages/vllm/scripts.py", line 149, in main
[rank0]: args.dispatch_function(args)
[rank0]: File ".local/lib/python3.10/site-packages/vllm/scripts.py", line 30, in serve
[rank0]: asyncio.run(run_server(args))
[rank0]: File "/usr/lib/python3.10/asyncio/runners.py", line 44, in run
[rank0]: return loop.run_until_complete(main)
[rank0]: File "/usr/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
[rank0]: return future.result()
[rank0]: File ".local/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 342, in run_server
[rank0]: async with build_async_engine_client(args) as async_engine_client:
[rank0]: File "/usr/lib/python3.10/contextlib.py", line 199, in aenter
[rank0]: return await anext(self.gen)
[rank0]: File ".local/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 102, in build_async_engine_client
[rank0]: async_engine_client = AsyncLLMEngine.from_engine_args(
[rank0]: File ".local/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 471, in from_engine_args
[rank0]: engine = cls(
[rank0]: File ".local/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 381, in init
[rank0]: self.engine = self._init_engine(*args, **kwargs)
[rank0]: File ".local/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 552, in _init_engine
[rank0]: return engine_class(*args, **kwargs)
[rank0]: File ".local/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 249, in init
[rank0]: self.model_executor = executor_class(
[rank0]: File ".local/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 47, in init
[rank0]: self._init_executor()
[rank0]: File ".local/lib/python3.10/site-packages/vllm/executor/gpu_executor.py", line 36, in _init_executor
[rank0]: self.driver_worker.load_model()
[rank0]: File ".local/lib/python3.10/site-packages/vllm/worker/worker.py", line 139, in load_model
[rank0]: self.model_runner.load_model()
[rank0]: File ".local/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 722, in load_model
[rank0]: self.model = get_model(model_config=self.model_config,
[rank0]: File ".local/lib/python3.10/site-packages/vllm/model_executor/model_loader/init.py", line 21, in get_model
[rank0]: return loader.load_model(model_config=model_config,
[rank0]: File ".local/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 327, in load_model
[rank0]: model.load_weights(
[rank0]: File ".local/lib/python3.10/site-packages/vllm/model_executor/models/llama_embedding.py", line 84, in load_weights
[rank0]: param = params_dict[name]
[rank0]: KeyError: 'layers.0.attention.wk.weight'
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]

C3po-D2rd2 · 2024-08-21T14:34:02Z

@mgoin I think the issue is not solved in the current vllm version (0.5.4)

mgoin · 2024-08-21T16:12:08Z

The issues you are reporting are likely due to other arguments like the bitsandbytes quantization

I just ran the model on 0.5.4 and on main with default arguments and it loaded fine:

vllm serve mistralai/Mistral-7B-Instruct-v0.3

INFO 08-21 16:11:14 api_server.py:339] vLLM API server version 0.5.4
INFO 08-21 16:11:14 api_server.py:340] args: Namespace(model_tag='mistralai/Mistral-7B-Instruct-v0.3', host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, model='mistralai/Mistral-7B-Instruct-v0.3', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None, dispatch_function=<function serve at 0x7fb01e0ec820>)
WARNING 08-21 16:11:15 config.py:1454] Casting torch.bfloat16 to torch.float16.
INFO 08-21 16:11:15 llm_engine.py:174] Initializing an LLM engine (v0.5.4) with config: model='mistralai/Mistral-7B-Instruct-v0.3', speculative_config=None, tokenizer='mistralai/Mistral-7B-Instruct-v0.3', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=mistralai/Mistral-7B-Instruct-v0.3, use_v2_block_manager=False, enable_prefix_caching=False)
INFO 08-21 16:11:16 model_runner.py:720] Starting to load model mistralai/Mistral-7B-Instruct-v0.3...
INFO 08-21 16:11:16 weight_utils.py:225] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/3 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  33% Completed | 1/3 [00:00<00:01,  1.67it/s]
Loading safetensors checkpoint shards:  67% Completed | 2/3 [00:01<00:00,  1.32it/s]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:02<00:00,  1.23it/s]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:02<00:00,  1.28it/s]

INFO 08-21 16:11:19 model_runner.py:732] Loading model weights took 13.5083 GB
INFO 08-21 16:11:23 gpu_executor.py:102] # GPU blocks: 27438, # CPU blocks: 2048
INFO 08-21 16:11:24 model_runner.py:1024] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 08-21 16:11:24 model_runner.py:1028] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 08-21 16:11:35 model_runner.py:1225] Graph capturing finished in 11 secs.
WARNING 08-21 16:11:35 serving_embedding.py:171] embedding_mode is False. Embedding API will not work.
INFO 08-21 16:11:35 launcher.py:14] Available routes are:
INFO 08-21 16:11:35 launcher.py:22] Route: /openapi.json, Methods: GET, HEAD
INFO 08-21 16:11:35 launcher.py:22] Route: /docs, Methods: GET, HEAD
INFO 08-21 16:11:35 launcher.py:22] Route: /docs/oauth2-redirect, Methods: GET, HEAD
INFO 08-21 16:11:35 launcher.py:22] Route: /redoc, Methods: GET, HEAD
INFO 08-21 16:11:35 launcher.py:22] Route: /health, Methods: GET
INFO 08-21 16:11:35 launcher.py:22] Route: /tokenize, Methods: POST
INFO 08-21 16:11:35 launcher.py:22] Route: /detokenize, Methods: POST
INFO 08-21 16:11:35 launcher.py:22] Route: /v1/models, Methods: GET
INFO 08-21 16:11:35 launcher.py:22] Route: /version, Methods: GET
INFO 08-21 16:11:35 launcher.py:22] Route: /v1/chat/completions, Methods: POST
INFO 08-21 16:11:35 launcher.py:22] Route: /v1/completions, Methods: POST
INFO 08-21 16:11:35 launcher.py:22] Route: /v1/embeddings, Methods: POST
INFO:     Started server process [2887700]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

yananchen1989 · 2024-08-21T16:19:22Z

yes, without these arguments --quantization bitsandbytes --load_format bitsandbytes --enforce_eager, mistralai/Mistral-7B-Instruct-v0.3 works fine.

this issue only applies to mistral, while llama-3.1 is not affected.

robertgshaw2-neuralmagic · 2024-08-21T16:22:28Z

@yananchen1989

Could you investigate and submit a fix? This is likely due to this Mistral model having multiple copies of the checkpoint with slightly different state dicts which seems to be interacting poorly with bnb

C3po-D2rd2 · 2024-08-21T16:23:34Z

I am not sure to understand because I never set those argument anywhere. I am launching it the same way than you. May be my issue come from the config.json, I used the params.json delivered with the model and had to add that:
"model_type": "mistral",
"architectures": ["MistralModel"]

robertgshaw2-neuralmagic · 2024-08-21T16:25:03Z

I am not sure to understand because I never set those argument anywhere. I am launching it the same way than you. May be my issue come from the config.json, I used the params.json delivered with the model and had to add that: "model_type": "mistral", "architectures": ["MistralModel"]

Post your config.json?

C3po-D2rd2 · 2024-08-21T16:27:45Z

{
    "dim": 4096,
    "n_layers": 32,
    "head_dim": 128,
    "hidden_dim": 14336,
    "n_heads": 32,
    "n_kv_heads": 8,
    "norm_eps": 1e-05,
    "vocab_size": 32768,
    "rope_theta": 1000000.0,
    "model_type": "mistral",
    "architectures": ["MistralModel"]
}

robertgshaw2-neuralmagic · 2024-08-21T16:32:14Z

This is not a valid HF transformers Config

I am not sure what the params.json is in the checkpoint, but vLLM supports the official transformers config.json only

C3po-D2rd2 · 2024-08-21T16:34:42Z

how can I find an exemple?

robertgshaw2-neuralmagic · 2024-08-21T16:37:07Z

how can I find an exemple?

https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3/blob/main/config.json

C3po-D2rd2 · 2024-08-21T16:47:17Z

Thanks a lot!
I am using it now and I still have the same issue. Here is the full trace

$ vllm serve ~/finetuning/models/7B
INFO 08-21 16:45:21 api_server.py:339] vLLM API server version 0.5.4
INFO 08-21 16:45:21 api_server.py:340] args: Namespace(model_tag='/home/barbatus/finetuning/models/7B', host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=[''], allowed_methods=[''], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, model='/home/barbatus/finetuning/models/7B', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None, dispatch_function=<function serve at 0x7fe875ef9510>)
WARNING 08-21 16:45:21 config.py:1454] Casting torch.bfloat16 to torch.float16.
INFO 08-21 16:45:21 llm_engine.py:174] Initializing an LLM engine (v0.5.4) with config: model='/home/barbatus/finetuning/models/7B', speculative_config=None, tokenizer='/home/barbatus/finetuning/models/7B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=/home/barbatus/finetuning/models/7B, use_v2_block_manager=False, enable_prefix_caching=False)
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the legacy (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set legacy=False. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in huggingface/transformers#24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the legacy (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set legacy=False. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in huggingface/transformers#24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.
INFO 08-21 16:45:21 model_runner.py:720] Starting to load model /home/barbatus/finetuning/models/7B...
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
Process Process-1:
Traceback (most recent call last):
File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/home/barbatus/.local/lib/python3.10/site-packages/vllm/entrypoints/openai/rpc/server.py", line 217, in run_rpc_server
server = AsyncEngineRPCServer(async_engine_args, usage_context, port)
File "/home/barbatus/.local/lib/python3.10/site-packages/vllm/entrypoints/openai/rpc/server.py", line 25, in init
self.engine = AsyncLLMEngine.from_engine_args(async_engine_args,
File "/home/barbatus/.local/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 471, in from_engine_args
engine = cls(
File "/home/barbatus/.local/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 381, in init
self.engine = self._init_engine(*args, **kwargs)
File "/home/barbatus/.local/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 552, in _init_engine
return engine_class(*args, **kwargs)
File "/home/barbatus/.local/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 249, in init
self.model_executor = executor_class(
File "/home/barbatus/.local/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 47, in init
self._init_executor()
File "/home/barbatus/.local/lib/python3.10/site-packages/vllm/executor/gpu_executor.py", line 36, in _init_executor
self.driver_worker.load_model()
File "/home/barbatus/.local/lib/python3.10/site-packages/vllm/worker/worker.py", line 139, in load_model
self.model_runner.load_model()
File "/home/barbatus/.local/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 722, in load_model
self.model = get_model(model_config=self.model_config,
File "/home/barbatus/.local/lib/python3.10/site-packages/vllm/model_executor/model_loader/init.py", line 21, in get_model
return loader.load_model(model_config=model_config,
File "/home/barbatus/.local/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 327, in load_model
model.load_weights(
File "/home/barbatus/.local/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 513, in load_weights
param = params_dict[name]
KeyError: 'layers.0.attention.wk.weight'
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]

C3po-D2rd2 · 2024-08-21T16:50:11Z

I had to rename the tokenizer from tokenizer.model.v3 to tokenizer.model othewise it does not find it

robertgshaw2-neuralmagic · 2024-08-21T16:52:45Z

Original checkpoint had two copies

one with merged qkv and merged gate_up
one with unmerged

vLLM supports loading the unmerged one. It seems like your checkpoint has the merged weights.

I’m not sure how you saved or made this checkpoint, it seems like you’re not saving it in the hugging face format via saved_pretrained()

C3po-D2rd2 · 2024-08-21T16:57:40Z

I am using the tarball from mistral, I found here: https://models.mistralcdn.com/mistral-7b-v0-3/mistral-7B-Instruct-v0.3.tar
I used it to run some fine-tuning but it does not modify the base model (the lora checkpoint is in an other directory), so it might be an issue from mistral directly?

robertgshaw2-neuralmagic · 2024-08-21T17:00:32Z

I am using the tarball from mistral, I found here: https://models.mistralcdn.com/mistral-7b-v0-3/mistral-7B-Instruct-v0.3.tar
I used it to run some fine-tuning but it does not modify the base model (the lora checkpoint is in an other directory), so it might be an issue from mistral directly?

Not sure - don’t know anything about how you did the finetuning. Either way, you need to save the model in the hf format with unfused linear layers to use it with vLLM

C3po-D2rd2 · 2024-08-22T15:24:19Z

Thank you for your help, I finally downloaded all from hugging face and be able to make it works. I will try to finetune from here and see

timbmg added the bug Something isn't working label May 22, 2024

ShukantPal mentioned this issue May 24, 2024

[Bugfix] Fix Mistral v0.3 Weight Loading #5005

Merged

mgoin closed this as completed May 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Loading mistral-7B-instruct-v03 KeyError: 'layers.0.attention.wk.weight' #4989

[Bug]: Loading mistral-7B-instruct-v03 KeyError: 'layers.0.attention.wk.weight' #4989

timbmg commented May 22, 2024

sn-rf commented May 22, 2024

binarycrayon commented May 23, 2024

ckgresla commented May 23, 2024

s-natsubori commented May 24, 2024

Yueeeeeeee commented May 24, 2024

robertgshaw2-neuralmagic commented May 24, 2024

yananchen1989 commented Aug 18, 2024

yananchen1989 commented Aug 21, 2024

C3po-D2rd2 commented Aug 21, 2024

C3po-D2rd2 commented Aug 21, 2024

mgoin commented Aug 21, 2024

yananchen1989 commented Aug 21, 2024

robertgshaw2-neuralmagic commented Aug 21, 2024 •

edited

Loading

C3po-D2rd2 commented Aug 21, 2024

robertgshaw2-neuralmagic commented Aug 21, 2024

C3po-D2rd2 commented Aug 21, 2024

robertgshaw2-neuralmagic commented Aug 21, 2024 •

edited

Loading

C3po-D2rd2 commented Aug 21, 2024

robertgshaw2-neuralmagic commented Aug 21, 2024

C3po-D2rd2 commented Aug 21, 2024

C3po-D2rd2 commented Aug 21, 2024

robertgshaw2-neuralmagic commented Aug 21, 2024

C3po-D2rd2 commented Aug 21, 2024 •

edited

Loading

robertgshaw2-neuralmagic commented Aug 21, 2024

C3po-D2rd2 commented Aug 22, 2024

[Bug]: Loading mistral-7B-instruct-v03 KeyError: 'layers.0.attention.wk.weight' #4989

[Bug]: Loading mistral-7B-instruct-v03 KeyError: 'layers.0.attention.wk.weight' #4989

Comments

timbmg commented May 22, 2024

Your current environment

🐛 Describe the bug

sn-rf commented May 22, 2024

binarycrayon commented May 23, 2024

ckgresla commented May 23, 2024

s-natsubori commented May 24, 2024

Yueeeeeeee commented May 24, 2024

robertgshaw2-neuralmagic commented May 24, 2024

yananchen1989 commented Aug 18, 2024

yananchen1989 commented Aug 21, 2024

C3po-D2rd2 commented Aug 21, 2024

C3po-D2rd2 commented Aug 21, 2024

mgoin commented Aug 21, 2024

yananchen1989 commented Aug 21, 2024

robertgshaw2-neuralmagic commented Aug 21, 2024 • edited Loading

C3po-D2rd2 commented Aug 21, 2024

robertgshaw2-neuralmagic commented Aug 21, 2024

C3po-D2rd2 commented Aug 21, 2024

robertgshaw2-neuralmagic commented Aug 21, 2024 • edited Loading

C3po-D2rd2 commented Aug 21, 2024

robertgshaw2-neuralmagic commented Aug 21, 2024

C3po-D2rd2 commented Aug 21, 2024

C3po-D2rd2 commented Aug 21, 2024

robertgshaw2-neuralmagic commented Aug 21, 2024

C3po-D2rd2 commented Aug 21, 2024 • edited Loading

robertgshaw2-neuralmagic commented Aug 21, 2024

C3po-D2rd2 commented Aug 22, 2024

robertgshaw2-neuralmagic commented Aug 21, 2024 •

edited

Loading

robertgshaw2-neuralmagic commented Aug 21, 2024 •

edited

Loading

C3po-D2rd2 commented Aug 21, 2024 •

edited

Loading