-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CPU] Support Intel CPU inference #3041
Conversation
Signed-off-by: Cao, Zhong Z <[email protected]>
Signed-off-by: Cao, Zhong Z <[email protected]>
…use it contains function call with side-effect
@delock, thanks for the discussion. Perhaps we should handle datatype support and OP support differently in the UTs.
In other words, going back to your original option 1 and 2 proposals. We shall use option 1 for datatypes and option 2 for training/inference. What do you think? |
Sounds good. We will add the new interface and enable more UT in a seperate pull request. |
This seems to still be failing in CI. Are you running the new release of AVX2 instruction set detection fix yet? I believe this is remaining issue to merge this PR. |
We will have a new release for Intel Extension for PyTorch for AVX2 detection very soon, will update install link to fix the CI workflow for CPU.
|
Intel Extension for PyTorch had been updated to lastest version and AVX2 detection issue had been resolved.
|
Summary:
This PR provides adds Intel CPU support to DeepSpeed by extending DeepSpeedAccelerator Interface. It allows user to run LLM inference with Intel CPU with Auto Tensor Parallelism or kernel injection (currently kernel injection is experimental and only support BLOOM). This PR comes with the following features:
Reduce peak memory usage during AutoTP model loading stage. (Also in seperte PR add sharded checkpoint loading for AutoTP path to reduce the peak mem… #3102)BF16 inference datatype support.LLAMA auto tensor parallel.(Also in PR Enable auto TP policy for llama model #3170)Efficient distributed inference on host with multiple CPU NUMA cluster, including multiple CPU socket or sub NUMA cluster (SNC). (Also in seperate PR [CPU support] Optionally bind each rank to different cores on host #2881)BLOOM AutoTP and/or tensor parallelism through user specified policy. (Also in PR Enable autoTP for bloom #3035)Note: this PR keeps all the pieces for testing. The other PRs mentioned here are supposed to be reviewed seperatedly to allow small steps of work.
Major components:
CPU Inference example
This section is work in progress