Flux fp16 inference fix #9097

latentCall145 · 2024-08-06T06:59:07Z

What does this PR do?

Fixes #9096
Flux can now run inference with torch.half (instead of just torch.bfloat16), allowing faster inference for Turing GPUs. There are two spots where the pretrained weights overflows in fp16 and clipping the activations there results in coherent image results.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline?
Did you read our philosophy doc (important for complex PRs)?
Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

sayakpaul

Thanks. Left a comment.

src/diffusers/models/transformers/transformer_flux.py

HuggingFaceDocBuilderDev · 2024-08-06T14:49:45Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

latentCall145 · 2024-08-06T22:52:45Z

As this issue mentions, FP16 significantly changes the result of the images. This issue surprisingly has to do with the text encoders (and not the clipping). Specifically, some activations in the text encoders have to be clipped when running in FP16 (it's a dynamic range problem, not a precision one). Forcing FP32 inference on the text encoders thus allows FP16 DiT + VAE inference to be similar to FP32/BF16.

Reproduction

from diffusers import FluxPipeline
import matplotlib.pyplot as plt
import torch
import time
torch.backends.cudnn.benchmark = True

DTYPE = torch.float16

ckpt_id = "black-forest-labs/FLUX.1-schnell"
pipe = FluxPipeline.from_pretrained(
    ckpt_id,
    torch_dtype=torch.bfloat16,
)
pipe.enable_sequential_cpu_offload()
pipe.vae.enable_tiling()
pipe.to(DTYPE)

images = pipe(
    'A laptop whose screen displays a picture of a black forest gateau cake spelling out the words "FLUX SCHNELL". The laptop screen, keyboard, and the table is on fire. no watermark, photograph',
    num_inference_steps=1,
    num_images_per_prompt=1,
    guidance_scale=0.0,
    height=1024,
    width=1024,
    generator=torch.Generator(device='cuda').manual_seed(0), # device='cpu' results in different random tensors across different dtypes?
).images

plt.imshow(images[0])
plt.show()

Prompt

A laptop whose screen displays a picture of a black forest gateau cake spelling out the words "FLUX SCHNELL". The laptop screen, keyboard, and the table is on fire. no watermark, photograph

Other

num_inference_steps = 1
height = width = 1024

Outputs (clipped)

left to right: fp32, bf16, fp16

Outputs (clipped, fp32 text encoders)

left to right: fp32, bf16, fp16

sayakpaul · 2024-08-07T01:12:53Z

Thank you for this investigation. Would you be able to put this analysis in the Flux documentation we have here? I believe this will be extremely valuable to the community. Cc: @DN6 @yiyixuxu

sayakpaul

Thanks!

latentCall145 · 2024-08-07T01:40:16Z

Thank you for this investigation. Would you be able to put this analysis in the Flux documentation we have here? I believe this will be extremely valuable to the community

Sure. Which part of the investigation would you want in docs, just the difference between fp16 + bf16 inference and what causes it?
On a side note, should I also include an option to force fp32 inference for the text encoders when running the Flux pipeline in fp16?

sayakpaul · 2024-08-07T01:44:38Z

Sure. Which part of the investigation would you want in docs, just the difference between fp16 + bf16 inference and what causes it?

Apologies for not being clear. I think the investigation you presented in #9097 (comment) could be wrapped under a section in the Flux document with the heading "Running FP16 Inference".

On a side note, should I also include an option to force fp32 inference for the text encoders when running the Flux pipeline in fp16?

As long as it's documented like we're discussing, it should be fine IMO. This way, users have all the information to fix the problems rather than us having to silently fix it for them.

docs/source/en/api/pipelines/flux.md

sayakpaul

Thanks! Just a single comment.

sayakpaul · 2024-08-07T04:18:30Z

Ah, we now have a conflict to resolve. Sorry about that.

…flux-fp16-fix

sayakpaul · 2024-08-07T05:24:30Z

Thank you!

Roman-dem · 2024-10-23T16:02:02Z

Great job!
Is it possible to make such an optimization for inference on 2 gpu v100?

latentCall145 · 2024-10-23T16:09:07Z

Great job!
Is it possible to make such an optimization for inference on 2 gpu v100?

Diffusers has documentation on how to do distributed inference on multiple GPUs, this’ll probably work for you: https://huggingface.co/docs/diffusers/training/distributed_inference

There’s even a section for Flux.1 inference (model sharding) although if you have 32GB V100s, I don’t think you’ll need to do model sharding as long as you enable model CPU offloading because Flux.1 can fit within 32 GB (although I don’t know the behavior of offloading for distributed inference).

Roman-dem · 2024-10-23T19:42:32Z

Great job!
Is it possible to make such an optimization for inference on 2 gpu v100?

Diffusers has documentation on how to do distributed inference on multiple GPUs, this’ll probably work for you: https://huggingface.co/docs/diffusers/training/distributed_inference

There’s even a section for Flux.1 inference (model sharding) although if you have 32GB V100s, I don’t think you’ll need to do model sharding as long as you enable model CPU offloading because Flux.1 can fit within 32 GB (although I don’t know the behavior of offloading for distributed inference).

In fact, the base flax dev does not fit entirely on 32 Gb and if you connect a processor, the inference speed drops sharply. The problem is that after pipe.to(“dtype”) i can't send model to the gpu. The reverse order doesn't work either.

* clipping for fp16 * fix typo * added fp16 inference to docs * fix docs typo * include link for fp16 investigation --------- Co-authored-by: Sayak Paul <[email protected]>

latentCall145 added 2 commits August 6, 2024 01:43

clipping for fp16

949aad3

fix typo

f214ad4

latentCall145 mentioned this pull request Aug 6, 2024

FLUX's inference speed is so slow #9095

Closed

AaronWard mentioned this pull request Aug 6, 2024

flux does not work on MPS devices #9047

Closed

yiyixuxu requested a review from sayakpaul August 6, 2024 13:38

sayakpaul approved these changes Aug 6, 2024

View reviewed changes

src/diffusers/models/transformers/transformer_flux.py Show resolved Hide resolved

Merge branch 'main' into flux-fp16-fix

3494355

Merge branch 'huggingface:main' into flux-fp16-fix

33c19f3

sayakpaul approved these changes Aug 7, 2024

View reviewed changes

latentCall145 added 2 commits August 6, 2024 22:23

added fp16 inference to docs

5f153a7

fix docs typo

fbe9b8a

sayakpaul reviewed Aug 7, 2024

View reviewed changes

docs/source/en/api/pipelines/flux.md Show resolved Hide resolved

sayakpaul approved these changes Aug 7, 2024

View reviewed changes

bghira mentioned this pull request Aug 7, 2024

[Flux] Dreambooth LoRA training scripts #9086

Merged

include link for fp16 investigation

547f70f

latentCall145 and others added 3 commits August 7, 2024 00:24

Merge branch 'main' of https://github.com/huggingface/diffusers into …

666328b

…flux-fp16-fix

Merge branch 'main' into flux-fp16-fix

8d1d0ca

Merge branch 'main' into flux-fp16-fix

9d099c0

sayakpaul merged commit 9b5180c into huggingface:main Aug 7, 2024
15 checks passed

fursund mentioned this pull request Aug 14, 2024

Add Flux inpainting and Flux Img2Img #9135

Merged

5 tasks

StAlKeR7779 mentioned this pull request Aug 21, 2024

Brandon/flux model loading invoke-ai/InvokeAI#6739

Merged

8 tasks

saeedkhanehgir mentioned this pull request Oct 15, 2024

keep text encoders in fp32 in flux #9677

Closed

sayakpaul added a commit that referenced this pull request Dec 23, 2024

Flux fp16 inference fix (#9097)

f771be1

* clipping for fp16 * fix typo * added fp16 inference to docs * fix docs typo * include link for fp16 investigation --------- Co-authored-by: Sayak Paul <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flux fp16 inference fix #9097

Flux fp16 inference fix #9097

latentCall145 commented Aug 6, 2024

sayakpaul left a comment

HuggingFaceDocBuilderDev commented Aug 6, 2024

latentCall145 commented Aug 6, 2024 •

edited

Loading

sayakpaul commented Aug 7, 2024

sayakpaul left a comment

latentCall145 commented Aug 7, 2024

sayakpaul commented Aug 7, 2024

sayakpaul left a comment

sayakpaul commented Aug 7, 2024

sayakpaul commented Aug 7, 2024

Roman-dem commented Oct 23, 2024

latentCall145 commented Oct 23, 2024 •

edited

Loading

Roman-dem commented Oct 23, 2024

Flux fp16 inference fix #9097

Flux fp16 inference fix #9097

Conversation

latentCall145 commented Aug 6, 2024

What does this PR do?

Before submitting

Who can review?

sayakpaul left a comment

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Aug 6, 2024

latentCall145 commented Aug 6, 2024 • edited Loading

Reproduction

Prompt

Other

Outputs (clipped)

Outputs (clipped, fp32 text encoders)

sayakpaul commented Aug 7, 2024

sayakpaul left a comment

Choose a reason for hiding this comment

latentCall145 commented Aug 7, 2024

sayakpaul commented Aug 7, 2024

sayakpaul left a comment

Choose a reason for hiding this comment

sayakpaul commented Aug 7, 2024

sayakpaul commented Aug 7, 2024

Roman-dem commented Oct 23, 2024

latentCall145 commented Oct 23, 2024 • edited Loading

Roman-dem commented Oct 23, 2024

latentCall145 commented Aug 6, 2024 •

edited

Loading

latentCall145 commented Oct 23, 2024 •

edited

Loading