Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

flux does not work on MPS devices #9047

Closed
bghira opened this issue Aug 2, 2024 · 41 comments
Closed

flux does not work on MPS devices #9047

bghira opened this issue Aug 2, 2024 · 41 comments
Labels
bug Something isn't working stale Issues that haven't received updates

Comments

@bghira
Copy link
Contributor

bghira commented Aug 2, 2024

Describe the bug

    scale = torch.arange(0, dim, 2, dtype=torch.float64, device=pos.device) / dim
TypeError: Cannot convert a MPS Tensor to float64 dtype as the MPS framework doesn't support float64. Please use float32 instead.

Reproduction

import torch
from diffusers import  FluxPipeline

pipe = FluxPipeline.from_pretrained("black-forest-labs/FLUX.1-schnell", torch_dtype=torch.bfloat16, revision='refs/pr/1')
#pipe.enable_model_cpu_offload()
pipe.to(device='mps')

prompt = "A cat holding a sign that says hello world"
out = pipe(
    prompt=prompt, 
    guidance_scale=0., 
    height=768, 
    width=1360, 
    num_inference_steps=4, 
    max_sequence_length=256,
).images[0]
out.save("image.png")

it also doesn't work with cpu offload.

Logs

scale = torch.arange(0, dim, 2, dtype=torch.float64, device=pos.device) / dim
TypeError: Cannot convert a MPS Tensor to float64 dtype as the MPS framework doesn't support float64. Please use float32 instead.

System Info

Git master

Who can help?

@sayakpaul

@bghira bghira added the bug Something isn't working label Aug 2, 2024
@bghira
Copy link
Contributor Author

bghira commented Aug 2, 2024

image
can't switch to fp32 😢

@bghira
Copy link
Contributor Author

bghira commented Aug 2, 2024

@pcuenca i've been looking at workarounds and there's really nothing, and this model is too big to run on CPU, it just never really completes the first step.

@sayakpaul
Copy link
Member

No idea how to get around this problem :(

@bghira
Copy link
Contributor Author

bghira commented Aug 2, 2024

it also doesn't work on ROCm, as the dimensions of the operations overflow the ROCm kernel limits, so, it has to run layer-wise and takes about 2 minutes for one image

@bghira
Copy link
Contributor Author

bghira commented Aug 2, 2024

maybe @rromb or @pesser have some ideas

@pcuenca
Copy link
Member

pcuenca commented Aug 2, 2024

Great investigation @bghira! It's a bit surprising that it degrades so much with float32. Also unfortunate:

RuntimeError: "arange_mps" not implemented for 'BFloat16'

@Vargol
Copy link

Vargol commented Aug 2, 2024

Has anyone tried running just the arrange on the CPU for MPS if its supported and pushing the results back on the GPU, @bghira when you said flux didn't run on the CPU is that what you meant or were you referring to running the whole model ?

Nevermind had a proper look at the code now and can that's a load of rubbish :-)

@mgierschdev
Copy link

in my case:
RuntimeError: MPS backend out of memory (MPS allocated: 81.54 GB, other allocations: 384.00 KB, max allowed: 81.60 GB). Tried to allocate 72.00 MB on private pool. Use PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 to disable upper limit for memory allocations (may cause system failure).

any idea ? on how to run on a Mac M2

@bghira
Copy link
Contributor Author

bghira commented Aug 3, 2024

you cannot run it at all on any mac series but especially not m1 as it doesnt even have bf16 in hardware

@Vargol
Copy link

Vargol commented Aug 4, 2024

image can't switch to fp32 😢

Try it with torch 2.3.1

flux_image

@bghira
Copy link
Contributor Author

bghira commented Aug 4, 2024

i am on 2.4 for bf16 and its fixes

@Vargol
Copy link

Vargol commented Aug 4, 2024

If only there was a way you could have two python environments at once <joke> , but it shows that the noisey image from float32 isn't a fundamental issue with flux

@Vargol
Copy link

Vargol commented Aug 4, 2024

Interesting, if you monkey patch the rope function to move pos to the CPU you still get noise with 2.4.0 with both float32 and float64, the fp64 runs fine on CPU with 2.3.1.

import torch
from diffusers import  FluxPipeline
import diffusers

_flux_rope = diffusers.models.transformers.transformer_flux.rope
def new_flux_rope(pos: torch.Tensor, dim: int, theta: int) -> torch.Tensor:
    assert dim % 2 == 0, "The dimension must be even."

    if pos.device.type == "mps":
        print("I got called")
        return _flux_rope(pos.to("cpu"), dim, theta).to(device=pos.device)
    else:
        print("I should not be called")
        return _flux_rope(pos, dim, theta)

diffusers.models.transformers.transformer_flux.rope = new_flux_rope

pipe = FluxPipeline.from_pretrained("black-forest-labs/FLUX.1-schnell", revision='refs/pr/1',  torch_dtype=torch.bfloat16).to("mps")

prompt = "A cat holding a sign that says hello world"
out = pipe(
     prompt=prompt,
     guidance_scale=0.,
     height=1024,
     width=1024,
     num_inference_steps=4,
     max_sequence_length=256,
).images[0]
out.save("flux_image.png")

@bghira
Copy link
Contributor Author

bghira commented Aug 4, 2024

:/ the trainer i'm using requires torch 2.4, and the model is basically useless to Apple users without torch 2.4 as that is also required to run it in lower precision levels

@Vargol
Copy link

Vargol commented Aug 5, 2024

If you need 2.4.0 to work raise an issue over on the PyTorch Repo

@bghira
Copy link
Contributor Author

bghira commented Aug 5, 2024

then it will never be solved as the pytorch team often simply ignores mps issues. thanks for the suggestion.

@mgierschdev
Copy link

mgierschdev commented Aug 5, 2024

I will be supported by WIP, difussion kit argmaxinc/DiffusionKit#11 , upvote the topic if interested

@bghira
Copy link
Contributor Author

bghira commented Aug 5, 2024

that is totally different - they are making use of MLX, not Pytorch+MPS.

@mgierschdev
Copy link

mgierschdev commented Aug 5, 2024

You are correct but it will ultimately run on Apple devices, more importantly efficiently

@bghira
Copy link
Contributor Author

bghira commented Aug 5, 2024

yes... but it has nothing to do with Diffusers and remains to be seen whether it works with the correct outputs on this platform - once it's in DiffusionKit, we still don't have any way to use it in the Diffusers pipeline ecosystem.

@bghira
Copy link
Contributor Author

bghira commented Aug 5, 2024

CoreML may be efficient on the surface, but last I tried only supported square images and every custom model has to be converted manually, which will take hours for Flux.D or Flux.S - it already took hours to convert SDXL models on a 128G M3 Max. it's going to need a lot of system memory to quantise Flux using CoreML.

i just don't see it as a very useful thing - it's more like a toy

@Vargol
Copy link

Vargol commented Aug 6, 2024

On the ComfyUI equivalent issue, someone is suggesting the it works with the torch nightlies on the beta version of macOS 15.

comfyanonymous/ComfyUI#4165 (comment)

I'm not running it so can't confirm if its true to not, I do know that is at least some macOS 15 code in Torch (or was assuming its not been rolled back) so there is hope

@AaronWard
Copy link

image

On M3 Macbook pro using the _flux_rope hack with bfloat16, the model is returning only grainy results.


Saw this PR on diffusers with a potential fix. haven't tried it myself yet.

@Vargol
Copy link

Vargol commented Aug 6, 2024

image

On M3 Macbook pro using the _flux_rope hack with bfloat16, the model is returning only grainy results.

Saw this PR on diffusers with a potential fix. haven't tried it myself yet.

What version of pytorch are you using ?
If you read the rest of the issue you'll see that that the noise image is due to issues with PyTorch 2.4 on MacOS 14

@bghira
Copy link
Contributor Author

bghira commented Aug 7, 2024

image

well the same issue occurs with pytorch nightly on MacOS 14. i don't really think upgrading to a beta OS release is the way to resolve it, but that's good to know.

@Vargol
Copy link

Vargol commented Aug 7, 2024

@AaronWard I've tried the fp16 fix, and while it fixes running the model for inference on float16, it still gives a noisy image as the end result in torch 2.4.0.

It would be nice if someone with proper torch skills could get to the bottom of this but I suspect we're going to have to wait for MacOS 15.

@bghira
Copy link
Contributor Author

bghira commented Aug 9, 2024

i've updated to macos 15:

image

@Vargol
Copy link

Vargol commented Aug 9, 2024

With a PyTorch nightly ?

@bghira
Copy link
Contributor Author

bghira commented Aug 9, 2024

yes

@bghira
Copy link
Contributor Author

bghira commented Aug 9, 2024

also, pytorch 2.3.1 has about a 30% speed reduction vs 2.4.1 on MPS

cocktailpeanut added a commit to peanutcocktail/optimum-quanto that referenced this issue Aug 9, 2024
the decorator syntax added for quantize_symmetric and quantize_affine are only supported on 2.4+ but that version is not working for MPS, producing grainy images. Use the old syntax and allow lower version of torch, so MPS users can install torch 2.3.1 and get models like FLUX working huggingface/diffusers#9047
@hvaara
Copy link
Contributor

hvaara commented Aug 15, 2024

float64 is not supported on MPS. #9133 proposes to fix that issue.

The noisy output image is a separate bug in PyTorch. Follow pytorch/pytorch#133520 for updates.

@bghira
Copy link
Contributor Author

bghira commented Aug 15, 2024

is fp64 a hw limitation? or just MPS limitation? i guess i could check metal docs..

@bghira
Copy link
Contributor Author

bghira commented Aug 15, 2024

@Vargol
Copy link

Vargol commented Aug 22, 2024

Seems to be working with diffusers from git main and a torch nightly (from today at least torch-2.5.0.dev20240821) and macOS 14.6.1 without any hacks.

flux_image

Performance has tanked for me though from 80s/i to 130 s/i, that could be small change as I'm swapping a Gb or two bit of memory once macOS has swapped out the t5 model.

@bghira
Copy link
Contributor Author

bghira commented Aug 22, 2024

torch nightly has some real performance regressions since they refactored the sdpa backends for mps

@bauerwer
Copy link

agree, I am seeing a fairly big performance degradation on nightly torch as well (on MPS)

Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot added the stale Issues that haven't received updates label Sep 16, 2024
@sayakpaul
Copy link
Member

We recently added support for this. So, inference, should at least work. Closing, hence. Feel free to reopen.

@hvaara
Copy link
Contributor

hvaara commented Sep 16, 2024

xref #9133 #9074

@Vargol
Copy link

Vargol commented Sep 20, 2024

Release 0.30.3 seems to have the old version of the flux code with the torch.float64 reference in the rope version was that expected ?

(Diffusers) M3iMac:Diffusers davidburnett$ pip show Diffusers
Name: diffusers
Version: 0.30.3
Summary: State-of-the-art diffusion in PyTorch and JAX.
Home-page: https://github.com/huggingface/diffusers
Author: The Hugging Face team (past and future) with the help of all our contributors (https://github.com/huggingface/diffusers/graphs/contributors)
Author-email: [email protected]
License: Apache 2.0 License
Location: /Volumes/SSD2TB/AI/Diffusers/lib/python3.11/site-packages
Requires: filelock, huggingface-hub, importlib-metadata, numpy, Pillow, regex, requests, safetensors
Required-by: 
  File "/Volumes/SSD2TB/AI/Diffusers/lib/python3.11/site-packages/diffusers/models/transformers/transformer_flux.py", line 65, in <listcomp>
    [rope(ids[..., i], self.axes_dim[i], self.theta) for i in range(n_axes)],
     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Volumes/SSD2TB/AI/Diffusers/lib/python3.11/site-packages/diffusers/models/transformers/transformer_flux.py", line 41, in rope
    scale = torch.arange(0, dim, 2, dtype=torch.float64, device=pos.device) / dim
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: Cannot convert a MPS Tensor to float64 dtype as the MPS framework doesn't support float64. Please use float32 instead.

@sayakpaul
Copy link
Member

Cc: @yiyixuxu @a-r-r-o-w ^.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working stale Issues that haven't received updates
Projects
None yet
Development

No branches or pull requests

8 participants