-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FLUX #11
Conversation
With this change, will it be possible to load and run on an M2, 24GB unified system? |
@QueryType Yes! |
Up voting, since I am interested on running FLUX on apple devices |
Almost done, @arda-argmax is doing final variable name mappings and correctness tests |
waiting to test it :-) |
Code is functional and outputs are correct using bfloat16. You can test it with a CLI command similar to this:
We will clean up and merge by early next week. Thanks for your patience everyone 🙌 |
It is only tested with FLUX.1-schnell at the moment but shouldn't be too hard to test it for dev next. |
Thanks 🙏 |
What flag do we need to pass, in order to test with FLUX-dev ? |
Working well on my side too, any estimate for FLUX-DEV ? |
Thanks for testing @QueryType and @mgierschdev! The current commit uses bfloat16 and yields a peak memory usage of ~27GB for 768x1360 generation with FLUX.1-schnell. We are working on two improvements to pull this down to ~18 GB without compression. We are likely to share compressed variants after that if the community doesn't publish them by then. Once we avoid hitting swap, I get 3-8 seconds/step (512x512-768x1360) on M3 Max. |
Getting 6-20 seconds / step (512x512 - 768x1360) on M1 Max 64gb. Much, much faster than using the mps implementation on comfyui! Much easier first time set-up too 😅 |
The only thing left apart from the optimization fix, would be to support flux-dev cannot wait to test even higher definition on my Mac |
Not sure if this is a problem on main repo as well, but changing the number of steps seems to have no effect? diffusionkit-cli --prompt "a photo of a cat" --height 768 --width 1360 --seed 0 --output-path flux_out.png --model-size flux --step 4 --shift 1 --cfg 0 --w16 --a16 diffusionkit-cli --prompt "a photo of a cat" --height 768 --width 1360 --seed 0 --output-path flux_out.png --model-size flux --step 8 --shift 1 --cfg 0 --w16 --a16 Both produced the exact same image in the exact same time and only had 4 iterations reported. I might just be using the flags wrong perhaps, but thought I'd flag it up anyway (pun not intended) |
Thanks @mark-lord for testing. The steps are currently hardcoded for FLUX.1-schnell to be the 4 step schedule. @arda-argmax is fixing these things before the final merge.
Interesting! Care to share the MPS baseline and the command to reproduce? For reference, M3 Max is 3-9 seconds/step for the same resolution range. |
Thanks. I have tested the optimizations and now it triggers a 7GB swap (instead of the earlier 30GB) on my 24GB M2 Mac Mini. |
FLUX is a great stress test for our MMDiT implementation which was built and tested only for SD3. Here are the additional flexibilities we need to implement before DiffusionKit can load FLUX checkpoints and generate images:
google/t5-v1_1-xxl
while SD3 used CLIP L/14, G/14 and the same T5 version as FLUX.openai/clip-vit-large-patch14
(L/14) only while SD3 used CLIP L/14 + G/14.UniModalTransformerBlock
that processes a single sequence (text and image tokens merged) is used in FLUX in addition toMultiModalTransformerBlock
that processes image and text tokens heterogenously. We need to implement this.QKNorm
which does RMSNorm right before query and key projections pre-sdpa. This is present in FLUX but not in SD3.First-pass Optimizations: