We are using this dataset which you need to extact and place all the files in a file named data.
$ python3 main.py --epochs 40000
NOTE: on Colab Notebook use following command:
!git clone link-to-repo
%run main.py --epochs 40000
usage: main.py [-h] [--batch-size BATCH_SIZE] [--epochs EPOCHS]
[--pre-train PRE_TRAIN] [--img_size IMG_SIZE] [--data DATA]
[--channel CHANNEL] [--hidden HIDDEN] [--dc DC] [--de DE]
[--lr LR] [--beta BETA] [--trained_path TRAINED_PATH] [--T T]
Start trainning MoCoGAN.....
optional arguments:
-h, --help show this help message and exit
--batch-size BATCH_SIZE
set batch_size
--epochs EPOCHS set num of iterations
--pre-train PRE_TRAIN
set 1 when you use pre-trained models
--img_size IMG_SIZE set the input image size of frame
--data DATA set the path for the direcotry containing dataset
--channel CHANNEL set the no. of channel of the frame
--hidden HIDDEN set the hidden layer size for gru
--dc DC set the size of motion vector
--de DE set the size of randomly generated epsilon
--lr LR set the learning rate
--beta BETA set the beta for the optimizer
--trained_path TRAINED_PATH
set the path were to trained models are saved
--T T set the no. of frames to be selected
- Title: MoCoGAN: Decomposing Motion and Content for Video Generation
- Authors: Sergey Tulyakov, Ming-Yu Liu, Xiaodong Yang, Jan Kautz
- Link: https://arxiv.org/pdf/1707.04993.pdf
- Year: 2017
Visual signals in a video can be divided into content and motions. Content specifies which object is in the video, motion describes their dynamics. Based on this MoCoGAN framework was proposed. This proposed framework generates a video by mapping a sequence of randomly generated vectors to a sequence of video frames. Each randomly generated vector consists of a motion part, and a content part.
To learn motion and content in an unsupervised manner we introduce an adverserial learning scheme utilizing both image and video discriminator.
Generative adversarial nets were recently introduced as a novel way to train a generative model. They consists of two ‘adversarial’ models: a generative model G that captures the data distribution, and a discriminative model D that estimates the probability that a sample came from the training data rather than G. Both G and D could be a non-linear mapping function, such as a multi-layer perceptron.
In MoCoGAN, we assume a latent space of images ZI≡Rd where each point z ∈ ZI represents an image, and a video of K frames is represented by a path of length K in the latent space, [z(1), ..., z(K)]. By adopting this formulation, videos of different lengths can be generated by paths of different lengths. We further assume that ZI is decomposed into the content ZC and motion ZM subspace. The content subspace models motion-independent appearance in videos, while the motion subspace models motion-dependent appearance in videos.
For a video, the content vector, zC, is sampled once and fixed. Then, a series of random variables[e(1), ..., e(K)] is sampled and mapped to a series of motioncodes [z(1)M,...,z(K)M] via the recurrent neural network RM. We implement RM using a one-layer GRU network. A generator GI produces a frame, x˜(k), using the content and the motion vectors {zC, z(K)M }. The discriminators, DI and DV, are trained on real and fake images and videos, respectively, sampled from the training set v and the generated set v˜. The function S1 samples a single frame from a video, ST samples T consequtive frames.
We train MoCoGAN using the alternating gradient update algorithm as in. In one step, we update DI and DV while fixing GI and RM. In the alternating step, we update GI and RM while fixing DI and DV using a min-max game with value function FV(DI,DV,GI,RM)
In this objective function the first and second terms helps to train the Image Discriminator so that it can generate 1 for images samples from real videos and zero for those from fake videos. Similarly the third and fourth term help us to train the Video Discriminator.
We implement this model on Weizmann database.
- We train our model for 40000 epoch
- We use BCE loss(Binary Crossentropy loss) with a learning rate of 0.0002
- We test the model by generating videos from a randomly generated set of epsilon and ZC
----------------------------------------------------------------
Layer (type) Output Shape Param #
================================================================
ConvTranspose2d-1 [-1, 512, 6, 6] 1,105,920
BatchNorm2d-2 [-1, 512, 6, 6] 1,024
ReLU-3 [-1, 512, 6, 6] 0
ConvTranspose2d-4 [-1, 256, 12, 12] 2,097,152
BatchNorm2d-5 [-1, 256, 12, 12] 512
ReLU-6 [-1, 256, 12, 12] 0
ConvTranspose2d-7 [-1, 128, 24, 24] 524,288
BatchNorm2d-8 [-1, 128, 24, 24] 256
ReLU-9 [-1, 128, 24, 24] 0
ConvTranspose2d-10 [-1, 64, 48, 48] 131,072
BatchNorm2d-11 [-1, 64, 48, 48] 128
ReLU-12 [-1, 64, 48, 48] 0
ConvTranspose2d-13 [-1, 3, 96, 96] 3,072
Tanh-14 [-1, 3, 96, 96] 0
================================================================
Total params: 3,863,424
Trainable params: 3,863,424
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.00
Forward/backward pass size (MB): 6.75
Params size (MB): 14.74
Estimated Total Size (MB): 21.49
----------------------------------------------------------------
----------------------------------------------------------------
Layer (type) Output Shape Param #
================================================================
Conv2d-1 [-1, 64, 48, 48] 3,072
LeakyReLU-2 [-1, 64, 48, 48] 0
Conv2d-3 [-1, 128, 24, 24] 131,072
LeakyReLU-4 [-1, 128, 24, 24] 0
Conv2d-5 [-1, 256, 12, 12] 524,288
LeakyReLU-6 [-1, 256, 12, 12] 0
Conv2d-7 [-1, 512, 6, 6] 2,097,152
LeakyReLU-8 [-1, 512, 6, 6] 0
Conv2d-9 [-1, 1, 1, 1] 18,432
Sigmoid-10 [-1, 1, 1, 1] 0
================================================================
Total params: 2,774,016
Trainable params: 2,774,016
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.11
Forward/backward pass size (MB): 4.22
Params size (MB): 10.58
Estimated Total Size (MB): 14.91
----------------------------------------------------------------
----------------------------------------------------------------
Layer (type) Output Shape Param #
================================================================
Conv3d-1 [-1, 64, 8, 48, 48] 12,288
LeakyReLU-2 [-1, 64, 8, 48, 48] 0
Conv3d-3 [-1, 128, 4, 24, 24] 524,288
BatchNorm3d-4 [-1, 128, 4, 24, 24] 256
LeakyReLU-5 [-1, 128, 4, 24, 24] 0
Conv3d-6 [-1, 256, 2, 12, 12] 2,097,152
BatchNorm3d-7 [-1, 256, 2, 12, 12] 512
LeakyReLU-8 [-1, 256, 2, 12, 12] 0
Conv3d-9 [-1, 512, 1, 6, 6] 8,388,608
BatchNorm3d-10 [-1, 512, 1, 6, 6] 1,024
LeakyReLU-11 [-1, 512, 1, 6, 6] 0
Linear-12 [-1, 1] 18,433
Sigmoid-13 [-1, 1] 0
================================================================
Total params: 11,042,561
Trainable params: 11,042,561
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 1.69
Forward/backward pass size (MB): 26.86
Params size (MB): 42.12
Estimated Total Size (MB): 70.67
----------------------------------------------------------------
Some samples of the generated videos are as follows: