Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FLUX #11

Merged
merged 46 commits into from
Aug 14, 2024
Merged

FLUX #11

merged 46 commits into from
Aug 14, 2024

Conversation

atiorh
Copy link
Contributor

@atiorh atiorh commented Aug 3, 2024

FLUX is a great stress test for our MMDiT implementation which was built and tested only for SD3. Here are the additional flexibilities we need to implement before DiffusionKit can load FLUX checkpoints and generate images:

  • token_level_text_embeddings: FLUX uses google/t5-v1_1-xxl while SD3 used CLIP L/14, G/14 and the same T5 version as FLUX.
  • pooled_text_embeddings: FLUX uses openai/clip-vit-large-patch14 (L/14) only while SD3 used CLIP L/14 + G/14.
  • RoPE Embeddings: FLUX uses pre-sdpa rope embeddings while SD3 used learned positional embeddings that are added as preprocessing
  • config.depth = config.num_heads does not hold true for FLUX. It held true for SD3. We need to unhardcode this.
  • UniModalTransformerBlock that processes a single sequence (text and image tokens merged) is used in FLUX in addition to MultiModalTransformerBlock that processes image and text tokens heterogenously. We need to implement this.
  • QKNorm which does RMSNorm right before query and key projections pre-sdpa. This is present in FLUX but not in SD3.
  • Parallel FFN and Attention: FLUX parallelizes these subblock while SD3 has them in serial order

First-pass Optimizations:

@QueryType
Copy link

With this change, will it be possible to load and run on an M2, 24GB unified system?

@atiorh
Copy link
Contributor Author

atiorh commented Aug 4, 2024

@QueryType Yes!

@mgierschdev
Copy link

mgierschdev commented Aug 5, 2024

Up voting, since I am interested on running FLUX on apple devices

@atiorh
Copy link
Contributor Author

atiorh commented Aug 6, 2024

Almost done, @arda-argmax is doing final variable name mappings and correctness tests

@mgierschdev
Copy link

waiting to test it :-)

@arda-argmax
Copy link
Collaborator

arda-argmax commented Aug 10, 2024

Code is functional and outputs are correct using bfloat16. You can test it with a CLI command similar to this:

diffusionkit-cli --prompt "a photo of a cat" --height 768 --width 1360 --seed 0 --output-path flux_out.png --model-size flux --step 4 --shift 1 --cfg 0 --w16 --a16

We will clean up and merge by early next week. Thanks for your patience everyone 🙌

@atiorh
Copy link
Contributor Author

atiorh commented Aug 10, 2024

we'll be able to run black-forest-labs/flux-dev too ?

It is only tested with FLUX.1-schnell at the moment but shouldn't be too hard to test it for dev next.

@mgierschdev
Copy link

Code is functional and outputs are correct using bfloat16. You can test it with a CLI command similar to this:

diffusionkit-cli --prompt "a photo of a cat" --height 768 --width 1360 --seed 0 --output-path flux_out.png --model-size flux --step 4 --shift 1 --cfg 0 --w16 --a16

We will clean up and merge by early next week. Thanks for your patience everyone 🙌

Thanks 🙏

@mgierschdev
Copy link

we'll be able to run black-forest-labs/flux-dev too ?

It is only tested with FLUX.1-schnell at the moment but shouldn't be too hard to test it for dev next.

What flag do we need to pass, in order to test with FLUX-dev ?

@QueryType
Copy link

QueryType commented Aug 10, 2024

Code is functional and outputs are correct using bfloat16. You can test it with a CLI command similar to this:

diffusionkit-cli --prompt "a photo of a cat" --height 768 --width 1360 --seed 0 --output-path flux_out.png --model-size flux --step 4 --shift 1 --cfg 0 --w16 --a16

We will clean up and merge by early next week. Thanks for your patience everyone 🙌

First of all, thanks a million for the efforts. It works, here is the elusive flux-schnell cat.

flux_out

I am running on 24GB M2 Mac Mini. Here is the overall summary. The machine had to sweat it out!

Screenshot 2024-08-10 at 7 29 43 PM

Some comments:

  1. Possible to further quantize the model for bfloat16?
  2. Possible to specify local_ckpt paths not only for the model, but also the t5_encoder and the vae?
  3. Possible to load the quantised 8-bit version of t5?

It triggered a 7 GB swap, which we I avoid, if we quantize the model and also reduce the memory used by t5.

Once again, thanks a million for the efforts.

@mgierschdev
Copy link

mgierschdev commented Aug 10, 2024

Working well on my side too, any estimate for FLUX-DEV ?

@atiorh
Copy link
Contributor Author

atiorh commented Aug 11, 2024

I am running on 24GB M2 Mac Mini. Here is the overall summary. The machine had to sweat it out!

Thanks for testing @QueryType and @mgierschdev! The current commit uses bfloat16 and yields a peak memory usage of ~27GB for 768x1360 generation with FLUX.1-schnell. We are working on two improvements to pull this down to ~18 GB without compression. We are likely to share compressed variants after that if the community doesn't publish them by then.

Once we avoid hitting swap, I get 3-8 seconds/step (512x512-768x1360) on M3 Max.

@atiorh atiorh changed the title WIP: FLUX FLUX Aug 11, 2024
@mark-lord
Copy link

Getting 6-20 seconds / step (512x512 - 768x1360) on M1 Max 64gb. Much, much faster than using the mps implementation on comfyui! Much easier first time set-up too 😅

@mgierschdev
Copy link

The only thing left apart from the optimization fix, would be to support flux-dev cannot wait to test even higher definition on my Mac

@mark-lord
Copy link

Not sure if this is a problem on main repo as well, but changing the number of steps seems to have no effect?

diffusionkit-cli --prompt "a photo of a cat" --height 768 --width 1360 --seed 0 --output-path flux_out.png --model-size flux --step 4 --shift 1 --cfg 0 --w16 --a16

diffusionkit-cli --prompt "a photo of a cat" --height 768 --width 1360 --seed 0 --output-path flux_out.png --model-size flux --step 8 --shift 1 --cfg 0 --w16 --a16

Both produced the exact same image in the exact same time and only had 4 iterations reported. I might just be using the flags wrong perhaps, but thought I'd flag it up anyway (pun not intended)

@atiorh
Copy link
Contributor Author

atiorh commented Aug 13, 2024

Thanks @mark-lord for testing. The steps are currently hardcoded for FLUX.1-schnell to be the 4 step schedule. @arda-argmax is fixing these things before the final merge.

Getting 6-20 seconds / step (512x512 - 768x1360) on M1 Max 64gb.

Interesting! Care to share the MPS baseline and the command to reproduce? For reference, M3 Max is 3-9 seconds/step for the same resolution range.

@atiorh atiorh merged commit 4c2bde3 into main Aug 14, 2024
1 check passed
@QueryType
Copy link

Thanks. I have tested the optimizations and now it triggers a 7GB swap (instead of the earlier 30GB) on my 24GB M2 Mac Mini.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants