You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
INFO 09-13 19:33:10 importing.py:10] Triton not installed; certain GPU-related functions will not be available.
WARNING 09-13 19:33:13 arg_utils.py:902] Enabled BlockSpaceManagerV2 because it is required for multi-step (--num-scheduler-steps > 1)
WARNING 09-13 19:33:13 config.py:370] Async output processing is only supported for CUDA or TPU. Disabling it for other platforms.
INFO 09-13 19:33:13 llm_engine.py:213] Initializing an LLM engine (v0.6.0) with config: model='facebook/opt-125M', speculative_config=None, tokenizer='facebook/opt-125M', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=1024, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cpu, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=facebook/opt-125M, use_v2_block_manager=True, num_scheduler_steps=8, enable_prefix_caching=False, use_async_output_proc=False)
/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
warnings.warn(
WARNING 09-13 19:33:14 cpu_executor.py:321] float16 is not supported on CPU, casting to bfloat16.
WARNING 09-13 19:33:14 cpu_executor.py:324] CUDA graph is not supported on CPU, fallback to the eager mode.
WARNING 09-13 19:33:14 cpu_executor.py:350] Environment variable VLLM_CPU_KVCACHE_SPACE (GB) for CPU backend is not set, using 4 by default.
INFO 09-13 19:33:14 selector.py:183] Cannot use _Backend.FLASH_ATTN backend on CPU.
INFO 09-13 19:33:14 selector.py:128] Using Torch SDPA backend.
INFO 09-13 19:33:15 selector.py:183] Cannot use _Backend.FLASH_ATTN backend on CPU.
INFO 09-13 19:33:15 selector.py:128] Using Torch SDPA backend.
INFO 09-13 19:33:15 weight_utils.py:235] Using model weights format ['*.bin']
Loading pt checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/weight_utils.py:417: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
state = torch.load(bin_file, map_location="cpu")
Loading pt checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 5.83it/s]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 5.82it/s]
INFO 09-13 19:33:16 cpu_executor.py:208] # CPU blocks: 7281
Processed prompts: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:03<00:00, 3.86s/it, est. speed input: 1.55 toks/s, output: 4.14 toks/s]
_________
### Text
_________
Method
However accordingly to #8198 prompt logs causes vllm to crash when using multi-step. Which does not happen in the above log. This was actually a dummy way to check if the feature is actually active. Moreover checking the code, CPU has several specific classes that make a parallel implementation for CPU backend and it looks like it is not using the parameters of multi-step scheduling. There is also no warning in the log that inform the feature is not workig.
Expectation
Add a checking in the code to raise an exception or warning to inform the user that the feature is not supported.
Before submitting a new issue...
Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
The text was updated successfully, but these errors were encountered:
Your current environment
The output of `python collect_env.py`
Model Input Dumps
No response
🐛 Describe the bug
I tested the following script running on a CPU backend which set num_scheduler_steps > 1 to force use mult-step:
Got the following output:
However accordingly to #8198 prompt logs causes vllm to crash when using multi-step. Which does not happen in the above log. This was actually a dummy way to check if the feature is actually active. Moreover checking the code, CPU has several specific classes that make a parallel implementation for CPU backend and it looks like it is not using the parameters of multi-step scheduling. There is also no warning in the log that inform the feature is not workig.
Expectation
Add a checking in the code to raise an exception or warning to inform the user that the feature is not supported.
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: