Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VRAM info #4

Open
C00reNUT opened this issue Nov 22, 2024 · 22 comments
Open

VRAM info #4

C00reNUT opened this issue Nov 22, 2024 · 22 comments

Comments

@C00reNUT
Copy link

Small passage about VRAM info would be nice :)

@nitinmukesh
Copy link

Yeah. I also want to know how much VRAM required for inference.

@i-amgeek
Copy link

Same question. Would be good to know VRAM usage for various dimensions.

@ivanstepanovftw
Copy link

8 GiB is not enough 😿

@x4080
Copy link

x4080 commented Nov 24, 2024

even 16GB is not enough

@DsnTgr
Copy link

DsnTgr commented Nov 24, 2024

even 24GB is not enough

@joseph16388
Copy link

need a 8-bit version

@WangRongsheng
Copy link

reference it:

image

@x4080
Copy link

x4080 commented Nov 24, 2024

Needs 32 GB at least ? Quant anyone ?

@KT313
Copy link

KT313 commented Nov 25, 2024

I modified the inference script, i made it run with max usage of 15264 MiB of Vram (according to nvtop, inference done with resolution 512x768 and 100 frames). You may need to turn off anything else that uses vram if you're using a 16GiB gpu, but it should work.

i put the modified files here: https://github.com/KT313/LTX_Video_better_vram

it should work if you just drag and drop the files into your LTX-Video folder.

it works by basically offloading everything that is not needed in vram to cpu memory during each of the inference steps.

@x4080
Copy link

x4080 commented Nov 25, 2024

@KT313 cool, I'll try your solution
Edit : It works, will it need more VRAM if more frames generated ?
Edit2 : It only works 1st time and then it shows error :

ValueError: Cannot generate a cpu tensor from a generator of type cuda.

Edit3 : Now it works again if using suggested resolution (previously I was testing at 384x672, works at 512x768 30 frames and repeated it, dont know why the error above though

Edit4: Error above appears again when using 60 frames, maybe OOM error then

@KT313
Copy link

KT313 commented Nov 26, 2024

@x4080
i made some modifications here so the tensors should get generated on the generators device (cuda): https://github.com/KT313/LTX_Video_better_vram/tree/test
I cannot test it currently though, let me know if that works better

and regarding your first edit: yes, since the size of the latent tensor (that basically contains the video) depends on the resolution (height x width x frames (+ a bit extra from padding)), increasing frames will make the tensor larger which will need more vram. But actually i think that compared to the vram needed for the unet model, the tensor itself is quite small so you might be able to increase the frames a bit without issues

@MarcosRodrigoT
Copy link

MarcosRodrigoT commented Nov 26, 2024

@x4080 i made some modifications here so the tensors should get generated on the generators device (cuda): https://github.com/KT313/LTX_Video_better_vram/tree/test I cannot test it currently though, let me know if that works better

and regarding your first edit: yes, since the size of the latent tensor (that basically contains the video) depends on the resolution (height x width x frames (+ a bit extra from padding)), increasing frames will make the tensor larger which will need more vram. But actually i think that compared to the vram needed for the unet model, the tensor itself is quite small so you might be able to increase the frames a bit without issues

First of all, thank you for implementing this so that it takes less VRAM. I have tried it out a couple of times (with resolution of 704x480 and for 257 frames) and it works like a charm using only around 16 GB of a 4090 GPU. However, it randomly throws the an error related to "cpu" and "cuda" tensors. Re-running the script usually works, so it is not a big deal.

This was the error:

Traceback (most recent call last):
  File "/home/mrt/Projects/LTX-Video/inference.py", line 452, in <module>
    main()
  File "/home/mrt/Projects/LTX-Video/inference.py", line 356, in main
    images = pipeline(
  File "/home/mrt/Projects/LTX-Video/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/home/mrt/Projects/LTX-Video/ltx_video/pipelines/pipeline_ltx_video.py", line 1039, in __call__
    noise_pred = self.transformer(
  File "/home/mrt/Projects/LTX-Video/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/mrt/Projects/LTX-Video/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/mrt/Projects/LTX-Video/ltx_video/models/transformers/transformer3d.py", line 419, in forward
    encoder_hidden_states = self.caption_projection(encoder_hidden_states)
  File "/home/mrt/Projects/LTX-Video/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/mrt/Projects/LTX-Video/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/mrt/Projects/LTX-Video/venv/lib/python3.10/site-packages/diffusers/models/embeddings.py", line 1607, in forward
    hidden_states = self.linear_1(caption)
  File "/home/mrt/Projects/LTX-Video/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/mrt/Projects/LTX-Video/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/mrt/Projects/LTX-Video/venv/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 125, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument mat1 in method wrapper_CUDA_addmm)

@x4080
Copy link

x4080 commented Nov 26, 2024

@MarcosRodrigoT Do you use the new test file from @KT313 ? Or the previous one ?
@KT313 is your new test code for multiple GPUs ?

Edit : I tried the test file and it works more frames then previous, but see the same error and retry it and somehow it works, what really is going on - why restarting the command works

Edit2: @KT313 maybe this line is making CUDA and cpu inconsistencies ? (in inference.py)

    if torch.cuda.is_available() and args.disable_load_needed_only:
        pipeline = pipeline.to("cuda")

Edit 4 : I think it works better if above replaced with just

pipeline = pipeline.to("cuda")

to prevent

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument mat1 in method wrapper_CUDA_addmm)

@KT313
Copy link

KT313 commented Nov 27, 2024

@x4080
i changed the code on the test branch to

    if torch.cuda.is_available():
        pipeline = pipeline.to("cuda")

as you suggested. you might be able to get away with less than 16GiB if you don't load the whole pipeline to cuda in the beginning and first load only the text encoder, then unload it and then load the unet, but that would require more trying around so if your suggestion works it's the easiest for now.

I tried it on single-gpu only (4090). not sure about multi-gpu, but the original code also doesn't have anything that specifically hints towards multi-gpu, at least not in the parts that i modified.

@x4080
Copy link

x4080 commented Nov 27, 2024

@KT313 thanks

@KT313
Copy link

KT313 commented Nov 28, 2024

btw just for future readers, you might be able to get away with something as low as 8 or 6 GB if the text embedding gets done on cpu or separately somehow. the generation model itself should only need about 4-5GiB if loaded in bfloat16 (2 bytes per parameter) + some extra for the latent video tensor.
Most of the vram currently gets clogged up by the text_embedding model which is comparatively huge. If the text gets embedded to tensors on cpu it might be pretty slow though.

@anujsinha72094
Copy link

anujsinha72094 commented Nov 28, 2024

@KT313 I tried with width:1280,
height:704,
num_frames:201,
fps = 16
The video is fine till 160 frames but after 41 frames it's not good, having noise in frames
why??

@KT313
Copy link

KT313 commented Nov 28, 2024

@anujsinha72094
pretty unlikely to be related to the changes i made lol

@able2608
Copy link

able2608 commented Dec 1, 2024

btw just for future readers, you might be able to get away with something as low as 8 or 6 GB if the text embedding gets done on cpu or separately somehow. the generation model itself should only need about 4-5GiB if loaded in bfloat16 (2 bytes per parameter) + some extra for the latent video tensor. Most of the vram currently gets clogged up by the text_embedding model which is comparatively huge. If the text gets embedded to tensors on cpu it might be pretty slow though.

It seems that under the hood this uses Pixart alpha's text encoder, which is t5 XXL version 1.1. There currently exists gguf versions of these models, which flux from blackforest also uses. I have been able to generate images with flux with such setup (loading t5 in gguf mode and offload it after text encoding) successfully on a laptop GPU with 6G VRAM and 16G RAM. Perhaps using such method could reduce the memory requirements by a lot (to at least be able to run it on limit resources).

PS: technically t5 XXL and t5 XXL V1.1 has some differences beside training strategies, mainly on activation and parameter sharing between embedding and classification. I have not tested out on whether this will increase memory usage, but since the aforementioned changes are relatively minor, I do think that the experience on t5 XXL can be extrapolated.

Edit: It seems that the comfyui integration uses separate nodes for text encoder loading and diffuser loading. Perhaps a good point to start would be to replace the text encoder loader from the official repo with the gguf clip loader provided by city96's GGUF nodes and see whether it works or not. For those who have problem finding the gguf loaders, the repo's link is as follows: https://github.com/city96/ComfyUI-GGUF

@x4080
Copy link

x4080 commented Dec 3, 2024

Is it possible if we use this model ? https://huggingface.co/Symphone/ltx-video-2b-v0.9-fp8

@able2608
Copy link

able2608 commented Dec 4, 2024

Is it possible if we use this model ? https://huggingface.co/Symphone/ltx-video-2b-v0.9-fp8

From my current testing, this is probably not needed if you have at least 6GB of VRAM. I have been able to successfully generate 512x768 videos with 97 frames at a reasonable speed (if recalled correctly under 2 minutes), and the generation bottleneck was (still) the clip encode step.
For anyone trying to run on constrained hardware, I would recommend running with ComfyUI (that has built-in memory management and allows for easy swapping between different model loaders) using quantized clip to save resource (and time) on condition encoding. (The version of t5 xxl I am using is https://huggingface.co/comfyanonymous/flux_text_encoders/blob/main/t5xxl_fp8_e4m3fn.safetensors). If you can get things like flux or SD3.5 running on your hardware, then the quantized DiT is probably not needed.
As a side note, the official ComfyUI LTX loaders seems to have not been optimized for memory. The workflow that I (and many others) found working the best comes from comfyanonymous (https://comfyanonymous.github.io/ComfyUI_examples/ltxv/). It replaces the loader with generic ComfyUI loaders and it just works like a charm. If loading the full precision clip is impossible, replace the Load Clip node with the GGUF version from city96 (https://github.com/city96/ComfyUI-GGUF) and you're good to go.

Summary (steps to run on constrained hardware):

@x4080
Copy link

x4080 commented Dec 4, 2024

@able2608 Thanks for the advice, I'll try it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests