[Forge 0.0.16] NeverOOM Checkbox #395

lllyasviel · 2024-02-25T03:11:04Z

lllyasviel
Feb 25, 2024
Maintainer

These are some two QoL tricks for making large images.

If you use Enabled for VAE (always tiled) you will always use tiled VAE to encode/decode images.

If you use Enabled for UNet (always maximize offload), the diffusion GPU memory will drop to smaller than 1.5GB for SDXL at 1024x1024 (and even smaller for SD1.5). For example, this is me generating 1024x1024 image using SDXL base model with Enabled for UNet checked

You can see that this GPU memory is always less than 1.5GB even for SDXL at 1024px.

Using this you can make use of all your remaining memory to get super big images.

By using Enabled for UNet (always maximize offload), I am able to generate images at 6553x6553 using 8GB vram with SDXL. Previously this would need multi-diffusion to sacrifice result quality a bit but now you can generate at that resolution natively in one single pass.

Below is screenshots with 8GB vram diffuse at 6553x6553 natively using SDXL

lllyasviel · 2024-02-25T03:13:46Z

lllyasviel
Feb 25, 2024
Maintainer Author

Besides recently we have also added a few cmd flags to tune the performance.

The below texts are taken from Readme:

--always-offload-from-vram (This flag will make things slower but less risky). This option will let Forge always unload models from VRAM. This can be useful if you use multiple software together and want Forge to use less VRAM and give some VRAM to other software, or when you are using some old extensions that will compete vram with Forge, or (very rarely) when you get OOM.
--cuda-malloc (This flag will make things faster but more risky). This will ask pytorch to use cudaMallocAsync for tensor malloc. On some profilers I can observe performance gain at millisecond level, but the real speed up on most my devices are often unnoticed (about or less than 0.1 second per image). This cannot be set as default because many users reported issues that the async malloc will crash the program. Users need to enable this cmd flag at their own risk.
--cuda-stream (This flag will make things faster but more risky). This will use pytorch CUDA streams (a special type of thread on GPU) to move models and compute tensors simultaneously. This can almost eliminate all model moving time, and speed up SDXL on 30XX/40XX devices with small VRAM (eg, RTX 4050 6GB, RTX 3060 Laptop 6GB, etc) by about 15% to 25%. However, this unfortunately cannot be set as default because I observe higher possibility of pure black images (Nan outputs) on 2060, and higher chance of OOM on 1080 and 2060. When the resolution is large, there is a chance that the computation time of one single attention layer is longer than the time for moving entire model to GPU. When that happens, the next attention layer will OOM since the GPU is filled with the entire model, and no remaining space is available for computing another attention layer. Most overhead detecting methods are not robust enough to be reliable on old devices (in my tests). Users need to enable this cmd flag at their own risk.
--pin-shared-memory (This flag will make things faster but more risky). Effective only when used together with --cuda-stream. This will offload modules to Shared GPU Memory instead of system RAM when offloading models. On some 30XX/40XX devices with small VRAM (eg, RTX 4050 6GB, RTX 3060 Laptop 6GB, etc), I can observe significant (at least 20%) speed-up for SDXL. However, this unfortunately cannot be set as default because the OOM of Shared GPU Memory is a much more severe problem than common GPU memory OOM. Pytorch does not provide any robust method to unload or detect Shared GPU Memory. Once the Shared GPU Memory OOM, the entire program will crash (observed with SDXL on GTX 1060/1050/1066), and there is no dynamic method to prevent or recover from the crash. Users need to enable this cmd flag at their own risk.

3 replies

ositoMalvado Jul 7, 2024

3. -

@lllyasviel hi master, I want to know how to use Never OOM options by calling API, any way?

TimmekHW Aug 27, 2024

I support you! I also want to change the status or save Never OOM with API

ositoMalvado Oct 19, 2024

I support you! I also want to change the status or save Never OOM with API

You can do it actually.

patientx · 2024-02-28T00:00:39Z

patientx
Feb 28, 2024

Besides recently we have also added a few cmd flags to tune the performance.

The below texts are taken from Readme:

--always-offload-from-vram (This flag will make things slower but less risky). This option will let Forge always unload models from VRAM. This can be useful if you use multiple software together and want Forge to use less VRAM and give some VRAM to other software, or when you are using some old extensions that will compete vram with Forge, or (very rarely) when you get OOM.

--cuda-malloc (This flag will make things faster but more risky). This will ask pytorch to use cudaMallocAsync for tensor malloc. On some profilers I can observe performance gain at millisecond level, but the real speed up on most my devices are often unnoticed (about or less than 0.1 second per image). This cannot be set as default because many users reported issues that the async malloc will crash the program. Users need to enable this cmd flag at their own risk.

--cuda-stream (This flag will make things faster but more risky). This will use pytorch CUDA streams (a special type of thread on GPU) to move models and compute tensors simultaneously. This can almost eliminate all model moving time, and speed up SDXL on 30XX/40XX devices with small VRAM (eg, RTX 4050 6GB, RTX 3060 Laptop 6GB, etc) by about 15% to 25%. However, this unfortunately cannot be set as default because I observe higher possibility of pure black images (Nan outputs) on 2060, and higher chance of OOM on 1080 and 2060. When the resolution is large, there is a chance that the computation time of one single attention layer is longer than the time for moving entire model to GPU. When that happens, the next attention layer will OOM since the GPU is filled with the entire model, and no remaining space is available for computing another attention layer. Most overhead detecting methods are not robust enough to be reliable on old devices (in my tests). Users need to enable this cmd flag at their own risk.

--pin-shared-memory (This flag will make things faster but more risky). Effective only when used together with --cuda-stream. This will offload modules to Shared GPU Memory instead of system RAM when offloading models. On some 30XX/40XX devices with small VRAM (eg, RTX 4050 6GB, RTX 3060 Laptop 6GB, etc), I can observe significant (at least 20%) speed-up for SDXL. However, this unfortunately cannot be set as default because the OOM of Shared GPU Memory is a much more severe problem than common GPU memory OOM. Pytorch does not provide any robust method to unload or detect Shared GPU Memory. Once the Shared GPU Memory OOM, the entire program will crash (observed with SDXL on GTX 1060/1050/1066), and there is no dynamic method to prevent or recover from the crash. Users need to enable this cmd flag at their own risk.

How can we change tile size with this method ? (especially vae tile size) It seems to be using a very small size and with my tests on other gui's I found that I can enter a higher size without getting memory errors andget faster vae decoding. On comfyui with tiled vae 512 and 768 is around the same speed but if I use 960 (which was also the highest I can generate with sd1.5) the gen time speeds up significantly without oom.

0 replies

RedDestiny · 2024-07-21T03:20:28Z

RedDestiny
Jul 21, 2024

我发现这个NeverOOM的有个bug，就是经常在图生图模式下报错，而在文生图模式下倒没什么问题：
Traceback (most recent call last):
File "D:\SSD\stable-diffusion-webui-forge\modules\call_queue.py", line 57, in f
res = list(func(*args, **kwargs))
TypeError: 'NoneType' object is not iterable

我跑的图比较大，主要是vae解码的时候用的显存特别多，用这个可以节省很多显存，加快解码速度。特别需要这个。

1 reply

RedDestiny Oct 19, 2024

我发现这个NeverOOM的有个bug，就是经常在图生图模式下报错，而在文生图模式下倒没什么问题： Traceback (most recent call last): File "D:\SSD\stable-diffusion-webui-forge\modules\call_queue.py", line 57, in f res = list(func(*args, **kwargs)) TypeError: 'NoneType' object is not iterable

我跑的图比较大，主要是vae解码的时候用的显存特别多，用这个可以节省很多显存，加快解码速度。特别需要这个。

我已经知道原因了，是分辨率长宽大小不是SD指定的倍数，每次要比较麻烦地对齐下SD的生成倍数（好像是4还是8的倍数）

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Forge 0.0.16] NeverOOM Checkbox #395

{{title}}

Replies: 3 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

[Forge 0.0.16] NeverOOM Checkbox #395

lllyasviel Feb 25, 2024 Maintainer

Replies: 3 comments · 4 replies

lllyasviel Feb 25, 2024 Maintainer Author

ositoMalvado Jul 7, 2024

TimmekHW Aug 27, 2024

ositoMalvado Oct 19, 2024

patientx Feb 28, 2024

RedDestiny Jul 21, 2024

RedDestiny Oct 19, 2024

lllyasviel
Feb 25, 2024
Maintainer

Replies: 3 comments 4 replies

lllyasviel
Feb 25, 2024
Maintainer Author

patientx
Feb 28, 2024

RedDestiny
Jul 21, 2024