-
-
Notifications
You must be signed in to change notification settings - Fork 5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Prototype of vLLM execution on CPU-only devices #1028
Conversation
Hi do you intent to expose this feature on api_server.py ? from here |
Hi, is this feature going to merge soon? Or is there any problem preventing that? Please let me know about the current status. |
0286d57
to
960ef34
Compare
@bigPYJ1151 This is great - tried running the benchmark w/
I did notice that CPU blocks is 2048 on
|
@derekelewis Thanks for your attention! Mistral is using Sliding Window Attention, which we haven't adopted and verified. Removing the assert statement in For CPU cache size, it can be specified by |
@bigPYJ1151 again, thanks for the contribution and that was helpful. Still exciting this works with llama-2 models. FYI, I did remove the assert and got this with
|
8631864
to
874f364
Compare
Another Question that I have is that is this prototype just for Benchmarking for a CPU Based Device OR Can we build your Repo from Source and perform an actual inference using a Llama Model ? @bigPYJ1151 |
874f364
to
44f15f6
Compare
@Deepansharora27 Thanks for your attention!
|
@bigPYJ1151 `/home/deepanshu/ai-tooling/vllm/csrc/cpu/cpu_types.hpp: In constructor ‘vec_op::FP32Vec8::FP32Vec8(__m128bh)’:
note: This error originates from a subprocess, and is likely not a problem with pip. |
@Deepansharora27 Seems it is due to the GCC version. This branch requires at least GCC-12. |
@bigPYJ1151 Okay Let me See |
Seems like I already have gcc-12 @bigPYJ1151 |
@Deepansharora27 Seems you also have |
4eca588
to
8529fb7
Compare
Hello, I'd be very nice to have a docker image to run cpu vllm for local development! |
@sd3ntato Thanks for your attention! Actually, |
8fe9250
to
8edf6b5
Compare
Co-authored-by: Kunshang Ji <[email protected]>
Hi @kruna6111 , it seems the source code is not compiled and the operation library is not generated. Please try again with the following:
|
@bigPYJ1151 Thanks for the help, I Implemented steps mentioned by you and it solved the error. however, a new error has popped up. just to confirm,
|
@kruna6111 seems you didn't specify the device type as |
Oh I see, @bigPYJ1151 Anyways Thanks for your help and support. Although, Is there any specific reason why it needs specifically |
@kruna6111 You can use |
@bigPYJ1151 I am using float32 datatype as a parameter in |
@kruna6111 yes, you should pass |
hey @bigPYJ1151 , Implemented the steps mentioned, the issue was that there is a folder named |
@kruna6111 seems your llm = LLM(
model=model,
tokenizer=tokenizer,
dtype="float32",
enforce_eager=True,
device="cpu",
swap_space=swap_space,
...
) |
@bigPYJ1151 Thank you for your help, I am able to run inference on CPU with vllm. you are a Genius. |
RUN --mount=type=cache,target=/root/.cache/pip \ | ||
pip install -r requirements-cpu.txt | ||
|
||
FROM vllm-base AS vllm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vllm-base is not available if build cpu.Dockerfile only
@bigPYJ1151 Apologies for the very x100 significant delay. Could you please update this PR for review? vLLM has undergone numerous changes since your last update. Also, I'm curious about the current status. Is there any progress in TP and FP16 support? |
link to the new patch: #3634 |
Closing this PR as we merged #3634 |
Hi, vLLM genius @WoosukKwon @zhuohan123. Motivated by some requirements to execute vLLM on the CPU (e.g., #176 ), we recently implemented an initial prototype for CPU-only execution on the x86 CPU platform.
What we have right now:
device ('cuda' or 'cpu', 'cuda' by default)
to specify the main device to execute vLLM..cuda()
) to.to(device=device)
, or with the contextset_default_device
, to support vLLM execution on different device types.CacheEngine
to allocate blocks from the CPU cache Tensor (used for swapping originally) under CPU-only mode. The size of the CPU cache can be specified with--swap-space
.AVX512_BF16
inst. set.VLLM_BUILD_CPU_OPS
, which is disabled by default.gcc-12
andg++-12
to supportAVX512_BF16
inst. set.Install Instruction
gcc/g++
is 12PyTorch
withpip install torch==2.1.2+cpu --index-url https://download.pytorch.org/whl/cpu
VLLM_BUILD_CPU_ONLY=1 MAX_JOBS=8 pip install --no-build-isolation -v -e .
Known Limits:
Sliding window attention is not verified right now.Model Support:
LlamaForCausalLM
,MistralForCausalLM
,OPTForCausalLM
related models currently.Performance
We used the following commands to evaluate the performance with
vicuna-7b-v1.5
on Intel (R) Xeon (R) CPU Max 9462 platform with 32 physical cores:OMP_NUM_THREADS=32 numactl --physcpubind=0-31 --membind=0 python benchmark_throughput.py --backend=vllm --dataset=/root/ShareGPT_V3_unfiltered_cleaned_split.json --model=/root/vicuna-7b-v1.5/ --n=1 --num-prompts=1000 --dtype=bfloat16 --trust-remote-code --device=cpu --swap-space=40
The implementation achieved good throughput on the CPU platform:
Throughput: 0.76 requests/s, 358.22 tokens/s
Throughput: 1.00 requests/s, 479.15 tokens/s
The performance still has much improvement space, and we will optimize the performance and add remaining features continuously, hoping to be helpful for the users want to deploy vLLM on the CPU.
Please help to review the code and welcome any feedbacks, thanks!