Reduce StableDiffusion memory usage #147

josevalim · 2023-01-12T21:31:14Z

A list of ideas to explore:

Lazy transfers (so we don't load data into the GPU at once)
FP16 on load
FP16 policies on Axon
~~Attention slicing~~ (no longer applicable Remove attention slicing from docs huggingface/diffusers#4487)
~~Flash attention (JAX version)~~ (see notes in Refactor attention implementation #300)
DPM-Solver++ (more schedulers here, here, and in the comments below) (another PyTorch implementation)
TokenMerging
LCM+LoRA
~~DeepCache~~ (not applicable Reduce StableDiffusion memory usage #147 (comment))

josevalim · 2023-10-13T21:58:47Z

More on attention: https://pytorch.org/blog/flash-decoding/

bfolkens · 2023-11-13T14:47:46Z

I'd also suggest FlashAttention-2 and Medusa

josevalim · 2023-11-29T01:35:59Z

Alternative to DPM Solver: https://arxiv.org/abs/2311.05556

josevalim · 2023-12-12T20:16:20Z

More notes on optimizations here:

jonatanklosko · 2023-12-19T09:00:25Z

I tested SD v1-4 on a GPU using the new lower precision options params_variant: "fp16", type: :bf16. Here are a couple runs:

Type	Steps	Batch, Images	Time	Memory	Lazy transfers
bf16	20	1, 1	0.7s	4669MiB	No
bf16	20	1, 4	2.2s	8769MiB	No
f32	20	1, 1	1.3s	8759MiB	No
f32	20	1, 4	4.3s	16951MiB	No
bf16	20	1, 1	3.7s	6957MiB	Yes
f32	20	1, 1	8.2s	13379MiB	Yes

Note that the reported memory is just the final memory after using preallocate: false, so it's not ideally reliable. XLA even does memory reservations at compilation time, my guess is that it runs some example operations to pick preferable algorithm or fine tune algorithm parameters. That said, it seems clear that bf16 reduces both memory and time roughly by a factor of 2. Weirdly, lazy transfers seem to bump the memory usage (but it doesn't mean that much memory is required in practice, it's just XLA bumping the reservation, see below).

Source (first row)

# Stable Diffusion testing

```elixir
Mix.install([
  {:nx, github: "elixir-nx/nx", sparse: "nx", override: true},
  {:exla, github: "elixir-nx/nx", sparse: "exla", override: true},
  {:axon, github: "elixir-nx/axon", override: true},
  {:kino, "~> 0.11.3"},
  {:bumblebee, github: "elixir-nx/bumblebee"}
])

Application.put_env(:exla, :clients,
  host: [platform: :host],
  cuda: [platform: :cuda, preallocate: false]
  # cuda: [platform: :cuda, memory_fraction: 0.3]
  # cuda: [platform: :cuda]
)

Application.put_env(:exla, :preferred_clients, [:cuda, :host])

Nx.global_default_backend({EXLA.Backend, client: :host})
```

## init

```elixir
with {output, 0} <- System.shell("nvidia-smi --query-gpu=memory.total,memory.used --format=csv") do
  IO.puts(output)
end
```

<!-- livebook:{"branch_parent_index":0} -->

## Stable Diffusion fp16

```elixir
repository_id = "CompVis/stable-diffusion-v1-4"

{:ok, tokenizer} = Bumblebee.load_tokenizer({:hf, "openai/clip-vit-large-patch14"})

{:ok, clip} =
  Bumblebee.load_model({:hf, repository_id, subdir: "text_encoder"},
    params_variant: "fp16",
    type: :bf16
  )

{:ok, unet} =
  Bumblebee.load_model({:hf, repository_id, subdir: "unet"},
    params_variant: "fp16",
    type: :bf16
  )

{:ok, vae} =
  Bumblebee.load_model({:hf, repository_id, subdir: "vae"},
    architecture: :decoder,
    params_variant: "fp16",
    type: :bf16
  )

{:ok, scheduler} = Bumblebee.load_scheduler({:hf, repository_id, subdir: "scheduler"})

clip = update_in(clip.params, &Nx.backend_copy(&1, {EXLA.Backend, client: :cuda}))
unet = update_in(unet.params, &Nx.backend_copy(&1, {EXLA.Backend, client: :cuda}))
vae = update_in(vae.params, &Nx.backend_copy(&1, {EXLA.Backend, client: :cuda}))

serving =
  Bumblebee.Diffusion.StableDiffusion.text_to_image(clip, unet, vae, tokenizer, scheduler,
    num_steps: 20,
    num_images_per_prompt: 1,
    compile: [batch_size: 1, sequence_length: 60],
    defn_options: [compiler: EXLA]
  )

Kino.start_child({Nx.Serving, name: SD, serving: serving})
```

```elixir
prompt = "numbat, forest, high quality, detailed, digital art"

output = Nx.Serving.batched_run(SD, prompt)

for result <- output.results do
  Kino.Image.new(result.image)
end
|> Kino.Layout.grid(columns: 2)
```

jonatanklosko · 2023-12-19T10:51:49Z

I experimented with different values of memory_fraction as an upper limit. For the first entry in the table above:

lazy_transfers: :always - 3GiB (3.4s)
manual backend_copy - 4.6GiB (0.7s)
preallocate_params: true it's 6.2GiB (0.7s)

So lazy transfers do help a bit, but imply a significant slowdown.

What's interesting though is that preallocate_params requires more memory than manual backend_copy. It's even more surprising given that the OOM happens at serving runtime, not during the params preallocation.

josevalim · 2023-12-19T11:55:27Z

preallocate/jit will transfer the data twice, one as arguments, one as return type. So we probably need a new callback/abstraction to make this easier :D

jonatanklosko · 2023-12-19T16:09:02Z

FTR fixed in #317, now preallocate_params: true effectively does backend_copy :)

josevalim · 2024-01-20T18:06:35Z

I have added an entry for LCM+Lora, @wtedw may have input here (and we may need to update/release a Axon before). /cc @seanmor5

seanmor5 · 2024-01-21T14:23:24Z

I think we should update Axon to better support LoRA, I have a draft in place right now but I have to revisit it to make it work as I intend :)

wtedw · 2024-01-21T17:05:33Z

LCM just adapts these nodes in the unet model: https://github.com/wtedw/lorax/blob/main/lib/lorax/lcm.ex#L121-L139
The weights can be found here: https://huggingface.co/latent-consistency/lcm-lora-sdv1-5

For Bumblebee, (if trying to make it compatible w/ most LoRA files in HuggingFace)

Needs to manually parse through the lora file to infer how to adapt the model layers This includes knowing which layers to inject, the lora rank, and the lora alpha. Unfortunately LCM's HF page doesn't come with a "lora_config" file, but from a quick glance, some models come with this "adapter_config" file. Not sure how common this is though: https://huggingface.co/IlyaGusev/saiga_13b_lora/blob/main/adapter_config.json.
The LCM lora was trained with something called Kohya: https://github.com/bmaltais/kohya_ss. It has this layer naming scheme: https://github.com/wtedw/lorax/blob/main/lib/lorax/lcm.ex#L147. I believe this is the most common trainer that's used.

If you guys need any PRs, lmk!

josevalim · 2024-02-23T18:07:34Z

Just a heads up that Stability AI just announced Stable Diffusion 3, so that makes us wonder how much effort we should pour into SD vs SDXL vs SD3. It still probably makes sense to support LoRA on Stable Diffusion, because that will require improvements in Axon and elsewhere that we could use for other models, but custom schedulers and token merging is up to debate at the moment.

jonatanklosko · 2024-02-26T09:44:21Z

Checking off attention slicing, it has actually been removed from diffusers docs (huggingface/diffusers#4487) because of flash attention. Either way, the trick is about slicing a dimension and using a while loop, which is similar to flash attention on defn level (as opposed to custom CUDA kernel), and that didn't turn out to be beneficial.

jonatanklosko · 2024-02-26T10:28:09Z

The main part of StableDiffusion is iterative U-Net model pass, which happens for a specified number of timesteps. DeepCache is about reusing some of the intermediate layer outputs across some diffusion iterations, that is outputs expected to change slowly over time.

This technique is not going to reduce memory usage, because we still need to periodically do a uncached model pass. Given that we need to keep the cached intermediate results, it can increase the usage if anything. It can have a significant speedup, assuming we do a fair amount of steps. For SD Turbo or LCM, where we do 1 or at most a few steps, the caching is not applicable.

So this may be something we want to explore in the future, depending on SD3 and other research going forward, but I don't think it's immediately relevant for us now.

josevalim mentioned this issue Jan 12, 2023

Got OOM message with GTX3060 #101

Closed

josevalim mentioned this issue Feb 11, 2023

XLA run out of memory with Bumblebee example elixir-nx/nx#1093

Closed

jonatanklosko added the kind:chore Internal improvements label Mar 31, 2023

jonatanklosko mentioned this issue Nov 23, 2023

XLA unsupported on GFX1100 elixir-nx/xla#63

Closed

jonatanklosko mentioned this issue Dec 19, 2023

Reduce memory used by :preallocate_params #317

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce StableDiffusion memory usage #147

Reduce StableDiffusion memory usage #147

josevalim commented Jan 12, 2023 •

edited by jonatanklosko

Loading

josevalim commented Oct 13, 2023

bfolkens commented Nov 13, 2023

josevalim commented Nov 29, 2023

josevalim commented Dec 12, 2023

jonatanklosko commented Dec 19, 2023 •

edited

Loading

jonatanklosko commented Dec 19, 2023 •

edited

Loading

josevalim commented Dec 19, 2023

jonatanklosko commented Dec 19, 2023

josevalim commented Jan 20, 2024

seanmor5 commented Jan 21, 2024

wtedw commented Jan 21, 2024

josevalim commented Feb 23, 2024 •

edited

Loading

jonatanklosko commented Feb 26, 2024

jonatanklosko commented Feb 26, 2024

Reduce StableDiffusion memory usage #147

Reduce StableDiffusion memory usage #147

Comments

josevalim commented Jan 12, 2023 • edited by jonatanklosko Loading

josevalim commented Oct 13, 2023

bfolkens commented Nov 13, 2023

josevalim commented Nov 29, 2023

josevalim commented Dec 12, 2023

jonatanklosko commented Dec 19, 2023 • edited Loading

jonatanklosko commented Dec 19, 2023 • edited Loading

josevalim commented Dec 19, 2023

jonatanklosko commented Dec 19, 2023

josevalim commented Jan 20, 2024

seanmor5 commented Jan 21, 2024

wtedw commented Jan 21, 2024

josevalim commented Feb 23, 2024 • edited Loading

jonatanklosko commented Feb 26, 2024

jonatanklosko commented Feb 26, 2024

josevalim commented Jan 12, 2023 •

edited by jonatanklosko

Loading

jonatanklosko commented Dec 19, 2023 •

edited

Loading

jonatanklosko commented Dec 19, 2023 •

edited

Loading

josevalim commented Feb 23, 2024 •

edited

Loading