-
Notifications
You must be signed in to change notification settings - Fork 124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VRAM info #4
Comments
Yeah. I also want to know how much VRAM required for inference. |
Same question. Would be good to know VRAM usage for various dimensions. |
8 GiB is not enough 😿 |
even 16GB is not enough |
even 24GB is not enough |
need a 8-bit version |
Needs 32 GB at least ? Quant anyone ? |
I modified the inference script, i made it run with max usage of 15264 MiB of Vram (according to nvtop, inference done with resolution 512x768 and 100 frames). You may need to turn off anything else that uses vram if you're using a 16GiB gpu, but it should work. i put the modified files here: https://github.com/KT313/LTX_Video_better_vram it should work if you just drag and drop the files into your LTX-Video folder. it works by basically offloading everything that is not needed in vram to cpu memory during each of the inference steps. |
@KT313 cool, I'll try your solution
Edit3 : Now it works again if using suggested resolution (previously I was testing at 384x672, works at 512x768 30 frames and repeated it, dont know why the error above though Edit4: Error above appears again when using 60 frames, maybe OOM error then |
@x4080 and regarding your first edit: yes, since the size of the latent tensor (that basically contains the video) depends on the resolution (height x width x frames (+ a bit extra from padding)), increasing frames will make the tensor larger which will need more vram. But actually i think that compared to the vram needed for the unet model, the tensor itself is quite small so you might be able to increase the frames a bit without issues |
First of all, thank you for implementing this so that it takes less VRAM. I have tried it out a couple of times (with resolution of 704x480 and for 257 frames) and it works like a charm using only around 16 GB of a 4090 GPU. However, it randomly throws the an error related to "cpu" and "cuda" tensors. Re-running the script usually works, so it is not a big deal. This was the error: Traceback (most recent call last):
File "/home/mrt/Projects/LTX-Video/inference.py", line 452, in <module>
main()
File "/home/mrt/Projects/LTX-Video/inference.py", line 356, in main
images = pipeline(
File "/home/mrt/Projects/LTX-Video/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/home/mrt/Projects/LTX-Video/ltx_video/pipelines/pipeline_ltx_video.py", line 1039, in __call__
noise_pred = self.transformer(
File "/home/mrt/Projects/LTX-Video/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/mrt/Projects/LTX-Video/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/home/mrt/Projects/LTX-Video/ltx_video/models/transformers/transformer3d.py", line 419, in forward
encoder_hidden_states = self.caption_projection(encoder_hidden_states)
File "/home/mrt/Projects/LTX-Video/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/mrt/Projects/LTX-Video/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/home/mrt/Projects/LTX-Video/venv/lib/python3.10/site-packages/diffusers/models/embeddings.py", line 1607, in forward
hidden_states = self.linear_1(caption)
File "/home/mrt/Projects/LTX-Video/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/mrt/Projects/LTX-Video/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/home/mrt/Projects/LTX-Video/venv/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 125, in forward
return F.linear(input, self.weight, self.bias)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument mat1 in method wrapper_CUDA_addmm) |
@MarcosRodrigoT Do you use the new test file from @KT313 ? Or the previous one ? Edit : I tried the test file and it works more frames then previous, but see the same error and retry it and somehow it works, what really is going on - why restarting the command works Edit2: @KT313 maybe this line is making CUDA and cpu inconsistencies ? (in inference.py)
Edit 4 : I think it works better if above replaced with just
to prevent
|
@x4080
as you suggested. you might be able to get away with less than 16GiB if you don't load the whole pipeline to cuda in the beginning and first load only the text encoder, then unload it and then load the unet, but that would require more trying around so if your suggestion works it's the easiest for now. I tried it on single-gpu only (4090). not sure about multi-gpu, but the original code also doesn't have anything that specifically hints towards multi-gpu, at least not in the parts that i modified. |
@KT313 thanks |
btw just for future readers, you might be able to get away with something as low as 8 or 6 GB if the text embedding gets done on cpu or separately somehow. the generation model itself should only need about 4-5GiB if loaded in bfloat16 (2 bytes per parameter) + some extra for the latent video tensor. |
@KT313 I tried with width:1280, |
@anujsinha72094 |
It seems that under the hood this uses Pixart alpha's text encoder, which is t5 XXL version 1.1. There currently exists gguf versions of these models, which flux from blackforest also uses. I have been able to generate images with flux with such setup (loading t5 in gguf mode and offload it after text encoding) successfully on a laptop GPU with 6G VRAM and 16G RAM. Perhaps using such method could reduce the memory requirements by a lot (to at least be able to run it on limit resources). PS: technically t5 XXL and t5 XXL V1.1 has some differences beside training strategies, mainly on activation and parameter sharing between embedding and classification. I have not tested out on whether this will increase memory usage, but since the aforementioned changes are relatively minor, I do think that the experience on t5 XXL can be extrapolated. Edit: It seems that the comfyui integration uses separate nodes for text encoder loading and diffuser loading. Perhaps a good point to start would be to replace the text encoder loader from the official repo with the gguf clip loader provided by city96's GGUF nodes and see whether it works or not. For those who have problem finding the gguf loaders, the repo's link is as follows: https://github.com/city96/ComfyUI-GGUF |
Is it possible if we use this model ? https://huggingface.co/Symphone/ltx-video-2b-v0.9-fp8 |
From my current testing, this is probably not needed if you have at least 6GB of VRAM. I have been able to successfully generate 512x768 videos with 97 frames at a reasonable speed (if recalled correctly under 2 minutes), and the generation bottleneck was (still) the clip encode step. Summary (steps to run on constrained hardware):
|
@able2608 Thanks for the advice, I'll try it |
Small passage about VRAM info would be nice :)
The text was updated successfully, but these errors were encountered: