Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow bfloat16 computations on compatible CPUs with Intel Extension for PyTorch #3649

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

AngryLoki
Copy link

@AngryLoki AngryLoki commented Jun 4, 2024

Modern CPUs have native AVX512 BF16 instructions, which significantly improves matmul and conv2d operations.

With Bfloat16 instructions UNET steps are 40-50% faster on both AMD and Intel CPUs.
There are minor visible changes with bf16, but no avalanche effects, so this feature is enabled by default with new --use-cpu-bf16=auto option.
It can be disabled with --use-cpu-bf16=no.

With the following command (note: ComfyUI never mention this, but setting correct environment variables is highly important, see this page), KSampler node is almost 2 times faster (also memory usage is proportionally smaller):

LD_PRELOAD=libtrick.so:/src/oneapi/compiler/2024.0/lib/libiomp5.so:/usr/lib64/libtcmalloc.so \
KMP_AFFINITY=granularity=fine,compact,1,0 KMP_BLOCKTIME=1 OMP_NUM_THREADS=16 \
numactl -C 0-15 -m 0 python main.py --cpu --bf16-vae --bf16-unet
--use-cpu-bf16=no - 1.68s/it --use-cpu-bf16=auto - 1.22it/s
image image

@AngryLoki AngryLoki requested a review from comfyanonymous as a code owner June 4, 2024 23:18
Copy link
Contributor

@simonlui simonlui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going to chime in here since I did significant work on the XPU side of IPEX for ComfyUI. This patch basically turns on CPU mode for IPEX, doesn't it? I have been meaning to write a patch for something like this for a while so thanks for doing the work to enable this. Had a few comments and nudges on things that could be improved but nothing else looks terribly wrong and I think this will improve everyone's experience with running the project although I am not sure if the bar to get that speed is enough to make it a default option for people to try, IPEX does have a minimum requirement of AVX2 needed on the CPU in order to even work. I would also suggest changing the README too to note this is available. Hopefully, when @comfyanonymous is less busy with things, he can take a look at the PR.

comfy/model_management.py Outdated Show resolved Hide resolved
comfy/model_management.py Outdated Show resolved Hide resolved
comfy/model_management.py Outdated Show resolved Hide resolved
comfy/model_management.py Outdated Show resolved Hide resolved
@AngryLoki

This comment was marked as outdated.

@mcmonkey4eva mcmonkey4eva added the Feature A new feature to add to ComfyUI. label Jun 28, 2024
@mcmonkey4eva mcmonkey4eva added the Needs Testing Please test this issue and report results label Jun 28, 2024
@AngryLoki

This comment was marked as resolved.

@AngryLoki AngryLoki marked this pull request as draft August 6, 2024 05:16
Modern CPUs have native AVX512 BF16 instructions, which significantly improves
matmul and conv2d operations.

With Bfloat16 instructions UNET steps are 40-50% faster on both AMD and Intel CPUs.
There are minor visible changes with bf16, but no avalanche effects, so this feature
is enabled by default with new `--use-cpu-bf16=auto` option.
It can be disabled with `--use-cpu-bf16=no`.

Signed-off-by: Sv. Lockal <[email protected]>
@AngryLoki
Copy link
Author

While testing with Flux, I discovered few interesting things:

  1. ipex allocates extra memory (even with weight_prepack=False) so that with ipex Flux + OS does not fit 64GB.
  2. ipex focuses on models with forward() method; for other models most of optimizations are not available
  3. new pytorch builds can perform the most heavy bf16 ops on CPU (read: matmul) without ipex.

So I reworked patch so that there is no requirement for ipex-for-cpu anymore.

After checking with flux-schnell (which is already distributed in bf16-format):

  • Without patch: 54GB ram, prompt executed in 242.54 seconds
  • With patch: 35GB ram, prompt executed in 118.42 seconds

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature A new feature to add to ComfyUI. Needs Testing Please test this issue and report results
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants