-
Notifications
You must be signed in to change notification settings - Fork 107
similar implementation to Nivida VideoLDM? #79
Comments
Hey @Maki9009 .
I don't have it implemented in this repository yet, but you should be able to fine tune any current SD model on video. While that paper does do a bit more, the concepts are the same (add temporal attention and convolution layers after each pre-trained spatial layer).
That implementation is not complete.
You should be able to merge two models trained on video data, but if you're talking about training the pre-trained layers trained on images, you still may have to fine tune them to pick up the temporal information. |
Hi @ExponentialML So just to get it correct its possible to finetune any Dreambooth SD model to make into a txt2vid model, is ur implementation not ready? or could I attempt to do it right now? Im just wondering what the process/guide would be to do that. Also on you're last point, i wouldnt be able to merge an img model to video model. i would need to first finetune it with for the temporal layers than i can merge? |
No problem. Yes that's correct. The UNet3DConditionModel takes a UNet2DConditionModel, which this implementation uses. You would have to fine tune the temporal layers from scratch, which may take time (this is why people start from modelscope's, it starts as a great base). You may be able to replace the spatial layers with another SD model and keep modelscope's temporal layers, but again I haven't tested it / implemented it as of yet. On your last question, it's a bit tricky. The attempt I would do is merge the image model layers, then fine tune on arbitrary video data so the temporal layers could pick up the newly added data. |
yeah, thats what I was looking into, replacing the spatial layers with another SD model and keeping temporal layers of model scope or zero scope, in a sense that would make it faster right, rather than finetuning a new model. on average, how long does it take to finetune the modelscope model with lets say images 20-30 images.. something similar to how Dreambooth works? or would that not be possible at all or pointless? let's say i want to implement my cat into the txt2vid model / modelscope, similar to nvidia's example |
It depends on how you're training the Dreambooth. If it's just the spatial layers, the same amount of time as other Dreambooth methods. If doing a full fine tune, it's dependent on how many frames you're training (acts similar to a large batch size). |
Any Possible way to have the Same Nvidia implementation of using a the SD models / dreambooth models as a base for Txt2vid model?
https://research.nvidia.com/labs/toronto-ai/VideoLDM/
i saw this unofficial implementation, but not sure where it goes?
https://github.com/srpkdyy/VideoLDM
is there no way to use the modelscope model or zeroscope model and idk merge em together something like that? or do some training or fine-tuning ontop of a dreambooth model?
The text was updated successfully, but these errors were encountered: