Skip to content
This repository has been archived by the owner on Dec 14, 2023. It is now read-only.

similar implementation to Nivida VideoLDM? #79

Open
Maki9009 opened this issue Jun 25, 2023 · 5 comments
Open

similar implementation to Nivida VideoLDM? #79

Maki9009 opened this issue Jun 25, 2023 · 5 comments

Comments

@Maki9009
Copy link

Any Possible way to have the Same Nvidia implementation of using a the SD models / dreambooth models as a base for Txt2vid model?
https://research.nvidia.com/labs/toronto-ai/VideoLDM/

i saw this unofficial implementation, but not sure where it goes?
https://github.com/srpkdyy/VideoLDM

is there no way to use the modelscope model or zeroscope model and idk merge em together something like that? or do some training or fine-tuning ontop of a dreambooth model?

@ExponentialML
Copy link
Owner

ExponentialML commented Jun 25, 2023

Hey @Maki9009 .

Any Possible way to have the Same Nvidia implementation of using a the SD models / dreambooth models as a base for Txt2vid model?

I don't have it implemented in this repository yet, but you should be able to fine tune any current SD model on video. While that paper does do a bit more, the concepts are the same (add temporal attention and convolution layers after each pre-trained spatial layer).

i saw this unofficial implementation, but not sure where it goes?

That implementation is not complete.

is there no way to use the modelscope model or zeroscope model and idk merge em together something like that? or do some training or fine-tuning ontop of a dreambooth model?

You should be able to merge two models trained on video data, but if you're talking about training the pre-trained layers trained on images, you still may have to fine tune them to pick up the temporal information.

@Maki9009
Copy link
Author

Hi @ExponentialML
Thanks for responding

So just to get it correct its possible to finetune any Dreambooth SD model to make into a txt2vid model, is ur implementation not ready? or could I attempt to do it right now?

Im just wondering what the process/guide would be to do that.
or are you still working on it currently?

Also on you're last point, i wouldnt be able to merge an img model to video model. i would need to first finetune it with for the temporal layers than i can merge?

@ExponentialML
Copy link
Owner

No problem. Yes that's correct. The UNet3DConditionModel takes a UNet2DConditionModel, which this implementation uses.

You would have to fine tune the temporal layers from scratch, which may take time (this is why people start from modelscope's, it starts as a great base).

You may be able to replace the spatial layers with another SD model and keep modelscope's temporal layers, but again I haven't tested it / implemented it as of yet.

On your last question, it's a bit tricky. The attempt I would do is merge the image model layers, then fine tune on arbitrary video data so the temporal layers could pick up the newly added data.

@Maki9009
Copy link
Author

yeah, thats what I was looking into, replacing the spatial layers with another SD model and keeping temporal layers of model scope or zero scope, in a sense that would make it faster right, rather than finetuning a new model.

on average, how long does it take to finetune the modelscope model with lets say images 20-30 images.. something similar to how Dreambooth works? or would that not be possible at all or pointless? let's say i want to implement my cat into the txt2vid model / modelscope, similar to nvidia's example

@ExponentialML
Copy link
Owner

on average, how long does it take to finetune the modelscope model with lets say images 20-30 images.. something similar to how Dreambooth works? or would that not be possible at all or pointless? let's say i want to implement my cat into the txt2vid model / modelscope, similar to nvidia's example

It depends on how you're training the Dreambooth. If it's just the spatial layers, the same amount of time as other Dreambooth methods. If doing a full fine tune, it's dependent on how many frames you're training (acts similar to a large batch size).

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants