FLUX #11

atiorh · 2024-08-03T00:52:05Z

FLUX is a great stress test for our MMDiT implementation which was built and tested only for SD3. Here are the additional flexibilities we need to implement before DiffusionKit can load FLUX checkpoints and generate images:

token_level_text_embeddings: FLUX uses google/t5-v1_1-xxl while SD3 used CLIP L/14, G/14 and the same T5 version as FLUX.
pooled_text_embeddings: FLUX uses openai/clip-vit-large-patch14 (L/14) only while SD3 used CLIP L/14 + G/14.
RoPE Embeddings: FLUX uses pre-sdpa rope embeddings while SD3 used learned positional embeddings that are added as preprocessing
config.depth = config.num_heads does not hold true for FLUX. It held true for SD3. We need to unhardcode this.
UniModalTransformerBlock that processes a single sequence (text and image tokens merged) is used in FLUX in addition to MultiModalTransformerBlock that processes image and text tokens heterogenously. We need to implement this.
QKNorm which does RMSNorm right before query and key projections pre-sdpa. This is present in FLUX but not in SD3.
Parallel FFN and Attention: FLUX parallelizes these subblock while SD3 has them in serial order

First-pass Optimizations:

Remove the implied sync barrier for parallel FFN and Attention blocks here
Precompute modulation outputs and offload related parameters #12

QueryType · 2024-08-04T16:11:36Z

With this change, will it be possible to load and run on an M2, 24GB unified system?

atiorh · 2024-08-04T20:53:56Z

@QueryType Yes!

mgierschdev · 2024-08-05T12:08:22Z

Up voting, since I am interested on running FLUX on apple devices

atiorh · 2024-08-06T21:04:21Z

Almost done, @arda-argmax is doing final variable name mappings and correctness tests

mgierschdev · 2024-08-06T21:07:05Z

waiting to test it :-)

arda-argmax · 2024-08-10T05:58:02Z

Code is functional and outputs are correct using bfloat16. You can test it with a CLI command similar to this:

diffusionkit-cli --prompt "a photo of a cat" --height 768 --width 1360 --seed 0 --output-path flux_out.png --model-size flux --step 4 --shift 1 --cfg 0 --w16 --a16

We will clean up and merge by early next week. Thanks for your patience everyone 🙌

atiorh · 2024-08-10T06:11:55Z

we'll be able to run black-forest-labs/flux-dev too ?

It is only tested with FLUX.1-schnell at the moment but shouldn't be too hard to test it for dev next.

mgierschdev · 2024-08-10T06:57:58Z

Code is functional and outputs are correct using bfloat16. You can test it with a CLI command similar to this:

diffusionkit-cli --prompt "a photo of a cat" --height 768 --width 1360 --seed 0 --output-path flux_out.png --model-size flux --step 4 --shift 1 --cfg 0 --w16 --a16

We will clean up and merge by early next week. Thanks for your patience everyone 🙌

Thanks 🙏

mgierschdev · 2024-08-10T09:42:57Z

we'll be able to run black-forest-labs/flux-dev too ?

It is only tested with FLUX.1-schnell at the moment but shouldn't be too hard to test it for dev next.

What flag do we need to pass, in order to test with FLUX-dev ?

QueryType · 2024-08-10T13:54:55Z

Code is functional and outputs are correct using bfloat16. You can test it with a CLI command similar to this:

diffusionkit-cli --prompt "a photo of a cat" --height 768 --width 1360 --seed 0 --output-path flux_out.png --model-size flux --step 4 --shift 1 --cfg 0 --w16 --a16

We will clean up and merge by early next week. Thanks for your patience everyone 🙌

First of all, thanks a million for the efforts. It works, here is the elusive flux-schnell cat.

I am running on 24GB M2 Mac Mini. Here is the overall summary. The machine had to sweat it out!

Some comments:

Possible to further quantize the model for bfloat16?
Possible to specify local_ckpt paths not only for the model, but also the t5_encoder and the vae?
Possible to load the quantised 8-bit version of t5?

It triggered a 7 GB swap, which we I avoid, if we quantize the model and also reduce the memory used by t5.

Once again, thanks a million for the efforts.

mgierschdev · 2024-08-10T16:53:16Z

Working well on my side too, any estimate for FLUX-DEV ?

atiorh · 2024-08-11T06:06:39Z

I am running on 24GB M2 Mac Mini. Here is the overall summary. The machine had to sweat it out!

Thanks for testing @QueryType and @mgierschdev! The current commit uses bfloat16 and yields a peak memory usage of ~27GB for 768x1360 generation with FLUX.1-schnell. We are working on two improvements to pull this down to ~18 GB without compression. We are likely to share compressed variants after that if the community doesn't publish them by then.

Once we avoid hitting swap, I get 3-8 seconds/step (512x512-768x1360) on M3 Max.

mark-lord · 2024-08-12T08:00:52Z

Getting 6-20 seconds / step (512x512 - 768x1360) on M1 Max 64gb. Much, much faster than using the mps implementation on comfyui! Much easier first time set-up too 😅

mgierschdev · 2024-08-12T08:23:29Z

The only thing left apart from the optimization fix, would be to support flux-dev cannot wait to test even higher definition on my Mac

mark-lord · 2024-08-12T08:58:18Z

Not sure if this is a problem on main repo as well, but changing the number of steps seems to have no effect?

diffusionkit-cli --prompt "a photo of a cat" --height 768 --width 1360 --seed 0 --output-path flux_out.png --model-size flux --step 4 --shift 1 --cfg 0 --w16 --a16

diffusionkit-cli --prompt "a photo of a cat" --height 768 --width 1360 --seed 0 --output-path flux_out.png --model-size flux --step 8 --shift 1 --cfg 0 --w16 --a16

Both produced the exact same image in the exact same time and only had 4 iterations reported. I might just be using the flags wrong perhaps, but thought I'd flag it up anyway (pun not intended)

atiorh · 2024-08-13T04:38:27Z

Thanks @mark-lord for testing. The steps are currently hardcoded for FLUX.1-schnell to be the 4 step schedule. @arda-argmax is fixing these things before the final merge.

Getting 6-20 seconds / step (512x512 - 768x1360) on M1 Max 64gb.

Interesting! Care to share the MPS baseline and the command to reproduce? For reference, M3 Max is 3-9 seconds/step for the same resolution range.

QueryType · 2024-08-15T13:47:55Z

Thanks. I have tested the optimizations and now it triggers a 7GB swap (instead of the earlier 30GB) on my 24GB M2 Mac Mini.

arda-argmax and others added 9 commits August 2, 2024 13:28

editing MMDiT for flux models

e0b66c2

compatibility updates for flux

0d55e93

add pe_embedder, not wired yet

bac40c2

config list -> tuple

c624140

Use RMSNorm for QKNorm

46a78cc

Patchify via reshape

ecbe2dd

Remove unused config and model attributes

38e8ba6

WIP RoPE embeddings, not integrated with UniModalTransformerBlock yet

8fdec91

wIP: Text encoder scheme differences

5c3a9f5

atiorh force-pushed the flux branch from 274c814 to 5c3a9f5 Compare August 3, 2024 19:16

atiorh mentioned this pull request Aug 4, 2024

Support - FLUX black-forest-labs/FLUX.1-schnell apple/ml-stable-diffusion#347

Open

mzbac mentioned this pull request Aug 4, 2024

[FeatureRequest] support FLUX.1 text to image model ml-explore/mlx-examples#916

Closed

mgierschdev mentioned this pull request Aug 5, 2024

flux does not work on MPS devices huggingface/diffusers#9047

Closed

arda-argmax and others added 10 commits August 5, 2024 10:59

mmdit hidden_size is configurable

ee2dc25

update hidden_size config for mmdit

94b13e1

Fix style and typing

d5c2165

clean up layer idx counts

dba0ed0

Parallel mlp and attn blocks

c771a0c

Fix style

43c9b02

WIP: flux state dict adjustments

5a0feb1

added text conditioning for flux

e0a58ad

small fix for flux model init and black formatting

0e98141

FIXME reminder for flux text encoding

5490132

atiorh and others added 2 commits August 6, 2024 15:00

Fix forward pass

f95fe3b

remove o_proj from UnifiedTransformerBlock

2d093fc

arda-argmax added 5 commits August 9, 2024 15:53

fix x_pos_embedder for sd3

b7c4fad

fix flux layer sdpa

628c0c1

SD3 works again

8de2028

flux working

de5c542

flux repo name change and wire up flux pipeline to cli

0c3aaf8

bfloat16 activation and weight loading support

411ca34

atiorh changed the title ~~WIP: FLUX~~ FLUX Aug 11, 2024

num steps and sampler fix

221aeed

arda-argmax and others added 8 commits August 13, 2024 11:01

float16 for sd3, bfloat16 for flux support

3e6efb9

divided flux pipeline from diffusion pipeline

f97b61a

can count number of model downloads

8d610b8

low memory mode reduces peak memory for sdpa

925cc51

change sdpa flash attn threshold

ca313d7

remove FIXMEs

f05ac99

update version

49fd435

Clean up

eb8ae9a

atiorh force-pushed the flux branch from 67f4043 to eb8ae9a Compare August 14, 2024 03:59

atiorh merged commit 4c2bde3 into main Aug 14, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FLUX #11

FLUX #11

atiorh commented Aug 3, 2024 •

edited

Loading

QueryType commented Aug 4, 2024

atiorh commented Aug 4, 2024

mgierschdev commented Aug 5, 2024 •

edited

Loading

atiorh commented Aug 6, 2024

mgierschdev commented Aug 6, 2024

arda-argmax commented Aug 10, 2024 •

edited by atiorh

Loading

atiorh commented Aug 10, 2024

mgierschdev commented Aug 10, 2024

mgierschdev commented Aug 10, 2024

QueryType commented Aug 10, 2024 •

edited

Loading

mgierschdev commented Aug 10, 2024 •

edited

Loading

atiorh commented Aug 11, 2024 •

edited

Loading

mark-lord commented Aug 12, 2024

mgierschdev commented Aug 12, 2024

mark-lord commented Aug 12, 2024

atiorh commented Aug 13, 2024 •

edited

Loading

QueryType commented Aug 15, 2024

FLUX #11

FLUX #11

Conversation

atiorh commented Aug 3, 2024 • edited Loading

QueryType commented Aug 4, 2024

atiorh commented Aug 4, 2024

mgierschdev commented Aug 5, 2024 • edited Loading

atiorh commented Aug 6, 2024

mgierschdev commented Aug 6, 2024

arda-argmax commented Aug 10, 2024 • edited by atiorh Loading

atiorh commented Aug 10, 2024

mgierschdev commented Aug 10, 2024

mgierschdev commented Aug 10, 2024

QueryType commented Aug 10, 2024 • edited Loading

mgierschdev commented Aug 10, 2024 • edited Loading

atiorh commented Aug 11, 2024 • edited Loading

mark-lord commented Aug 12, 2024

mgierschdev commented Aug 12, 2024

mark-lord commented Aug 12, 2024

atiorh commented Aug 13, 2024 • edited Loading

QueryType commented Aug 15, 2024

atiorh commented Aug 3, 2024 •

edited

Loading

mgierschdev commented Aug 5, 2024 •

edited

Loading

arda-argmax commented Aug 10, 2024 •

edited by atiorh

Loading

QueryType commented Aug 10, 2024 •

edited

Loading

mgierschdev commented Aug 10, 2024 •

edited

Loading

atiorh commented Aug 11, 2024 •

edited

Loading

atiorh commented Aug 13, 2024 •

edited

Loading