[CPU] Support Intel CPU inference #3041

delock · 2023-03-17T02:58:47Z

Summary:

This PR provides adds Intel CPU support to DeepSpeed by extending DeepSpeedAccelerator Interface. It allows user to run LLM inference with Intel CPU with Auto Tensor Parallelism or kernel injection (currently kernel injection is experimental and only support BLOOM). This PR comes with the following features:

CPUAccelerator
Low latency communication backend CCLBackend that call oneCCL collective communication library directly instead of calling through torch.distributed.
Kernel injection fallback path to aid development of DeepSpeed accelerator. Allowing kernel injection path to function before kernels fully developed.
(Merged) ~~Reduce peak memory usage during AutoTP model loading stage. (Also in seperte PR add sharded checkpoint loading for AutoTP path to reduce the peak mem… #3102)~~
(Merged) ~~BF16 inference datatype support.~~
(Merged) ~~LLAMA auto tensor parallel.(Also in PR Enable auto TP policy for llama model #3170)~~
(Merged) Efficient distributed inference on host with multiple CPU NUMA cluster, including multiple CPU socket or sub NUMA cluster (SNC). (Also in seperate PR [CPU support] Optionally bind each rank to different cores on host #2881)
(Merged) ~~BLOOM AutoTP and/or tensor parallelism through user specified policy. (Also in PR Enable autoTP for bloom #3035)~~

Note: this PR keeps all the pieces for testing. The other PRs mentioned here are supposed to be reviewed seperatedly to allow small steps of work.

Major components:

core binding support -- changes under deepspeed/launcher
CCLBackend -- changes under deepspeed/com
CPUAccelerator -- files from the following places:

accelerator/cpu_accelerator.py
op_builder/cpu/
csrc/cpu/

Accelerator selection logic -- accelerator/real_accelerator.py
Kernel injection fallback path -- changes from deepspeed/ops/transformer/inference/op_binding/
bfloat16 inference support -- changes scattered around

CPU Inference example

This section is work in progress

Signed-off-by: Cao, Zhong Z <[email protected]>

…use it contains function call with side-effect

tjruwase · 2023-05-08T16:31:48Z

@delock, thanks for the discussion. Perhaps we should handle datatype support and OP support differently in the UTs.

For datatype support, let us use the proposed abstract accelerator interface: def supported_dtypes(self) -> List[torch.dtype]
For training/inference, let us use accelerator name to detect support.

In other words, going back to your original option 1 and 2 proposals. We shall use option 1 for datatypes and option 2 for training/inference. What do you think?

delock · 2023-05-09T23:48:38Z

@delock, thanks for the discussion. Perhaps we should handle datatype support and OP support differently in the UTs.

For datatype support, let us use the proposed abstract accelerator interface: def supported_dtypes(self) -> List[torch.dtype]

For training/inference, let us use accelerator name to detect support.

In other words, going back to your original option 1 and 2 proposals. We shall use option 1 for datatypes and option 2 for training/inference. What do you think?

Sounds good. We will add the new interface and enable more UT in a seperate pull request.

tjruwase · 2023-05-11T12:10:41Z

The Intel Extension for Pytorch we used here is a hotfix for the AVX2 instruction set detection. In this particulary run seems the public wheel is downloaded from pypi. We will have a new public release containing the AVX2 instruction set detection fix, then default installation of Intel Extension for Pytorch won't show this error.

Currently we are using a hotfix for Intel Extension for Pytorch to get around this issue by not relying on XCR. So Intel Extension for Pytorch could be more robust. However it is not effective for this specific run. Will need some investigation to understand why.

This seems to still be failing in CI. Are you running the new release of AVX2 instruction set detection fix yet?
https://github.com/microsoft/DeepSpeed/actions/runs/4936158586/jobs/8823357797?pr=3041

I believe this is remaining issue to merge this PR.

delock · 2023-05-12T02:25:43Z

We will have a new release for Intel Extension for PyTorch for AVX2 detection very soon, will update install link to fix the CI workflow for CPU.
A reference to the issue:
intel/intel-extension-for-pytorch#326

The Intel Extension for Pytorch we used here is a hotfix for the AVX2 instruction set detection. In this particulary run seems the public wheel is downloaded from pypi. We will have a new public release containing the AVX2 instruction set detection fix, then default installation of Intel Extension for Pytorch won't show this error.

Currently we are using a hotfix for Intel Extension for Pytorch to get around this issue by not relying on XCR. So Intel Extension for Pytorch could be more robust. However it is not effective for this specific run. Will need some investigation to understand why.

This seems to still be failing in CI. Are you running the new release of AVX2 instruction set detection fix yet? https://github.com/microsoft/DeepSpeed/actions/runs/4936158586/jobs/8823357797?pr=3041

I believe this is remaining issue to merge this PR.

delock · 2023-05-15T15:08:02Z

Intel Extension for PyTorch had been updated to lastest version and AVX2 detection issue had been resolved.

We will have a new release for Intel Extension for PyTorch for AVX2 detection very soon, will update install link to fix the CI workflow for CPU. A reference to the issue: intel/intel-extension-for-pytorch#326

The Intel Extension for Pytorch we used here is a hotfix for the AVX2 instruction set detection. In this particulary run seems the public wheel is downloaded from pypi. We will have a new public release containing the AVX2 instruction set detection fix, then default installation of Intel Extension for Pytorch won't show this error.

Currently we are using a hotfix for Intel Extension for Pytorch to get around this issue by not relying on XCR. So Intel Extension for Pytorch could be more robust. However it is not effective for this specific run. Will need some investigation to understand why.

This seems to still be failing in CI. Are you running the new release of AVX2 instruction set detection fix yet? https://github.com/microsoft/DeepSpeed/actions/runs/4936158586/jobs/8823357797?pr=3041
I believe this is remaining issue to merge this PR.

delock and others added 30 commits February 17, 2023 07:49

add fallback path for kernels used in megatron

edf1c12

temporary numactl WA for SPR 56core

9a89405

adapt core allocation according to number of ranks

d1b8f13

add switch to turn on numactl

e31439e

detect number of cores on the system

c5828f7

allow select a subset of the cores on the system to bind

6b9dcd2

Merge branch 'up-master' into gma/numactl

4031a6e

remove unneeded changes

893c18d

Merge branch 'up-master' into gma/bf16_kernel

ad71233

Merge branch 'master' into gma/numactl

3551850

Merge branch 'gma/numactl' into gma/cpu_support

04d17e8

add ccl backend

1369eda

change nccl to ccl

3c927c7

remove unused code

e1eecd2

add comm/ccl to ops

c005399

initial ccl comm support

b7d455e

first broadcast case passed

6f2a73e

add CCL_Backend to DeepSpeed

2c012fd

support comm timer for CPU

3435185

support barrier for comm backend

92cc50e

support specify master address from deepspeed command line

62c53f7

support pytorch 2.0

9ce6fce

remove 'block' from api

f4e1d3c

Tweak for debug

1e583fa

Signed-off-by: Cao, Zhong Z <[email protected]>

Remove unecessary directory

a363f01

Signed-off-by: Cao, Zhong Z <[email protected]>

Add bf16 kernel support for inference

bb29b1a

Add temporary torch implement for cpu inference

5e07174

Add softmax ops cpu fallback for inference

076d699

bind cores to numa domain as well

0a35d9c

Merge branch 'up-master' into gma/cpu_support

c1d78b9

delock and others added 3 commits May 8, 2023 17:07

do not refractory kernel injection fallback path in residual_add beca…

f7dd940

…use it contains function call with side-effect

guard residual_add fallback path with environ DS_KI_FALLBACK=True

816e354

Merge branch 'master' into gma/cpu_support

803b17f

tjruwase and others added 3 commits May 8, 2023 14:10

Merge branch 'master' into gma/cpu_support

2b59496

fix format error

764d027

add test for allreduce on CPU workflow

869724c

delock and others added 2 commits May 10, 2023 12:19

fix format error

f678b90

Merge branch 'master' into gma/cpu_support

828dce0

tjruwase approved these changes May 10, 2023

View reviewed changes

delock and others added 11 commits May 12, 2023 18:03

Merge branch 'up-master' into gma/cpu_support

8c3444b

Fallback to TorchBackend if CCLBackend kernel are not implemented

1b9cd90

Merge branch 'master' into gma/cpu_support

bf5657f

Update Intel Extension for Pytorch installation link

c6c6eb2

Don't specify version number of Intel Extension for PyTorch

e7fba93

install oneCCL for CCLBackend

b2c7cb7

fix link path for CPU comm kernels

8749366

fix source oneCCL environment

782a624

source oneCCL env before run UT

06f6815

Merge branch 'master' into gma/cpu_support

24a5a52

Give more specific instruction when CCL_ROOT not defined

f63cbbd

delock and others added 3 commits May 15, 2023 23:08

Merge branch 'master' into gma/cpu_support

06b55a0

Merge branch 'master' into gma/cpu_support

1fc50be

Merge branch 'master' into gma/cpu_support

5f49686

tjruwase merged commit 1f72082 into microsoft:master May 16, 2023

lekurile mentioned this pull request Aug 2, 2023

Fix Stable Diffusion Injection #4078

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CPU] Support Intel CPU inference #3041

[CPU] Support Intel CPU inference #3041

delock commented Mar 17, 2023 •

edited

Loading

tjruwase commented May 8, 2023

delock commented May 9, 2023

tjruwase commented May 11, 2023

delock commented May 12, 2023 •

edited

Loading

delock commented May 15, 2023

[CPU] Support Intel CPU inference #3041

[CPU] Support Intel CPU inference #3041

Conversation

delock commented Mar 17, 2023 • edited Loading

Summary:

Major components:

CPU Inference example

tjruwase commented May 8, 2023

delock commented May 9, 2023

tjruwase commented May 11, 2023

delock commented May 12, 2023 • edited Loading

delock commented May 15, 2023

delock commented Mar 17, 2023 •

edited

Loading

delock commented May 12, 2023 •

edited

Loading