Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CPU] Support Intel CPU inference #3041

Merged
merged 179 commits into from
May 16, 2023
Merged

Conversation

delock
Copy link
Collaborator

@delock delock commented Mar 17, 2023

Summary:

This PR provides adds Intel CPU support to DeepSpeed by extending DeepSpeedAccelerator Interface. It allows user to run LLM inference with Intel CPU with Auto Tensor Parallelism or kernel injection (currently kernel injection is experimental and only support BLOOM). This PR comes with the following features:

  1. CPUAccelerator
  2. Low latency communication backend CCLBackend that call oneCCL collective communication library directly instead of calling through torch.distributed.
  3. Kernel injection fallback path to aid development of DeepSpeed accelerator. Allowing kernel injection path to function before kernels fully developed.
  4. (Merged) Reduce peak memory usage during AutoTP model loading stage. (Also in seperte PR add sharded checkpoint loading for AutoTP path to reduce the peak mem… #3102)
  5. (Merged) BF16 inference datatype support.
  6. (Merged) LLAMA auto tensor parallel.(Also in PR Enable auto TP policy for llama model #3170)
  7. (Merged) Efficient distributed inference on host with multiple CPU NUMA cluster, including multiple CPU socket or sub NUMA cluster (SNC). (Also in seperate PR [CPU support] Optionally bind each rank to different cores on host #2881)
  8. (Merged) BLOOM AutoTP and/or tensor parallelism through user specified policy. (Also in PR Enable autoTP for bloom #3035)

Note: this PR keeps all the pieces for testing. The other PRs mentioned here are supposed to be reviewed seperatedly to allow small steps of work.

Major components:

  1. core binding support -- changes under deepspeed/launcher
  2. CCLBackend -- changes under deepspeed/com
  3. CPUAccelerator -- files from the following places:
  • accelerator/cpu_accelerator.py
  • op_builder/cpu/
  • csrc/cpu/
  1. Accelerator selection logic -- accelerator/real_accelerator.py
  2. Kernel injection fallback path -- changes from deepspeed/ops/transformer/inference/op_binding/
  3. bfloat16 inference support -- changes scattered around

CPU Inference example

This section is work in progress

@tjruwase
Copy link
Contributor

tjruwase commented May 8, 2023

@delock, thanks for the discussion. Perhaps we should handle datatype support and OP support differently in the UTs.

  1. For datatype support, let us use the proposed abstract accelerator interface: def supported_dtypes(self) -> List[torch.dtype]
  2. For training/inference, let us use accelerator name to detect support.

In other words, going back to your original option 1 and 2 proposals. We shall use option 1 for datatypes and option 2 for training/inference. What do you think?

@delock
Copy link
Collaborator Author

delock commented May 9, 2023

@delock, thanks for the discussion. Perhaps we should handle datatype support and OP support differently in the UTs.

  1. For datatype support, let us use the proposed abstract accelerator interface: def supported_dtypes(self) -> List[torch.dtype]
  2. For training/inference, let us use accelerator name to detect support.

In other words, going back to your original option 1 and 2 proposals. We shall use option 1 for datatypes and option 2 for training/inference. What do you think?

Sounds good. We will add the new interface and enable more UT in a seperate pull request.

@tjruwase
Copy link
Contributor

The Intel Extension for Pytorch we used here is a hotfix for the AVX2 instruction set detection. In this particulary run seems the public wheel is downloaded from pypi. We will have a new public release containing the AVX2 instruction set detection fix, then default installation of Intel Extension for Pytorch won't show this error.

Currently we are using a hotfix for Intel Extension for Pytorch to get around this issue by not relying on XCR. So Intel Extension for Pytorch could be more robust. However it is not effective for this specific run. Will need some investigation to understand why.

This seems to still be failing in CI. Are you running the new release of AVX2 instruction set detection fix yet?
https://github.com/microsoft/DeepSpeed/actions/runs/4936158586/jobs/8823357797?pr=3041

I believe this is remaining issue to merge this PR.

@delock
Copy link
Collaborator Author

delock commented May 12, 2023

We will have a new release for Intel Extension for PyTorch for AVX2 detection very soon, will update install link to fix the CI workflow for CPU.
A reference to the issue:
intel/intel-extension-for-pytorch#326

The Intel Extension for Pytorch we used here is a hotfix for the AVX2 instruction set detection. In this particulary run seems the public wheel is downloaded from pypi. We will have a new public release containing the AVX2 instruction set detection fix, then default installation of Intel Extension for Pytorch won't show this error.

Currently we are using a hotfix for Intel Extension for Pytorch to get around this issue by not relying on XCR. So Intel Extension for Pytorch could be more robust. However it is not effective for this specific run. Will need some investigation to understand why.

This seems to still be failing in CI. Are you running the new release of AVX2 instruction set detection fix yet? https://github.com/microsoft/DeepSpeed/actions/runs/4936158586/jobs/8823357797?pr=3041

I believe this is remaining issue to merge this PR.

@delock
Copy link
Collaborator Author

delock commented May 15, 2023

Intel Extension for PyTorch had been updated to lastest version and AVX2 detection issue had been resolved.

We will have a new release for Intel Extension for PyTorch for AVX2 detection very soon, will update install link to fix the CI workflow for CPU. A reference to the issue: intel/intel-extension-for-pytorch#326

The Intel Extension for Pytorch we used here is a hotfix for the AVX2 instruction set detection. In this particulary run seems the public wheel is downloaded from pypi. We will have a new public release containing the AVX2 instruction set detection fix, then default installation of Intel Extension for Pytorch won't show this error.

Currently we are using a hotfix for Intel Extension for Pytorch to get around this issue by not relying on XCR. So Intel Extension for Pytorch could be more robust. However it is not effective for this specific run. Will need some investigation to understand why.

This seems to still be failing in CI. Are you running the new release of AVX2 instruction set detection fix yet? https://github.com/microsoft/DeepSpeed/actions/runs/4936158586/jobs/8823357797?pr=3041
I believe this is remaining issue to merge this PR.

@tjruwase tjruwase merged commit 1f72082 into microsoft:master May 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants