As detailed in the book Advanced Deep Learning with Python: Design and implement advanced next-generation AI solutions using TensorFlow and PyTorch by Ivan Vasilev. The DCGAN stems from the landmark paper introduced in 2014 titled Generative Adversarial Nets The implementation of the paper's algorithm comes from a tensorflow tutorial titled dcgan
To learn the generator’s distribution pg over data x, we define a prior on input noise variable pz(z), then represent a mapping to data space as G(z;θg), where G is a differential function represented by a multilayer perceptron with parameters θg 1. This is represented by the following code:
# function that builds the generator
def build_generator(latent_input, weight_initialization, channel):
model = Sequential(name='generator')
# first fully connected layer to take in 1D latent vector/tensor z
# and output a 1D tensor of size 12,544
model.add(keras.layers.Dense(7*7*256, input_shape=(latent_input,)))
# applies a transformation that maintains the mean output close to 0 and the output standard deviation close to 1
# stabilizes training after the conv layer and before the activation function
model.add(keras.layers.BatchNormalization())
# activation function
model.add(keras.layers.ReLU())
# reshape previous layer into a 3D tensor
model.add(keras.layers.Reshape((7, 7, 256)))
# first layer of upsampeling(i.e. deconvolution) of the 3D tensor to output a 7x7 feature map as determined by the stride
model.add(keras.layers.Conv2DTranspose(filters=128, kernel_size=(5,5), strides=(1,1), padding='same', kernel_initializer=weight_initialization))
model.add(keras.layers.BatchNormalization())
model.add(keras.layers.ReLU())
# second layer of upsampeling(i.e. deconvolution) in which the volume depth is reduced to 64
# and outputs a feature map of size 14x14 as determined by the stride
model.add(keras.layers.Conv2DTranspose(filters=64, kernel_size=(5,5), strides=(2,2), padding='same', kernel_initializer=weight_initialization))
model.add(keras.layers.BatchNormalization())
model.add(keras.layers.ReLU())
# third layer upsampeling(i.e. deconvolution) in which the volume depth is reduced to 1 and the image is output as 28x28x1
model.add(keras.layers.Conv2DTranspose(filters=channel, kernel_size=(5,5), strides=(2,2), padding='same', activation='tanh'))
return model
Which is represented by the below image as found in the 2016 paper titled UNSUPERVISED REPRESENTATION LEARNING WITH DEEP CONVOLUTIONAL GENERATIVE ADVERSARIAL NETWORKS 2.
We also define a second multilayer perceptron D(x;θd) that outputs a single scalar. D(x) represents the probability that x came from the data rather than pg.1. This is represented by the following code:
def build_discriminator(width, height, depth, alpha=0.2):
model = Sequential(name='discriminator')
input_shape = (height, width, depth)
# first layer of discriminator network that downsamples image to 14x14 as determined by stride and
# increases depth by 64
model.add(keras.layers.Conv2D(filters=64, kernel_size=(5, 5), strides=(2,2), padding='same', input_shape = input_shape))
model.add(keras.layers.BatchNormalization())
model.add(keras.layers.LeakyReLU(alpha=alpha))
# second layer of discriminator network that downsamples image to 7x7 and increaes depth to 128
model.add(keras.layers.Conv2D(filters=128, kernel_size=(5, 5), strides=(2,2), padding='same'))
model.add(keras.layers.BatchNormalization())
model.add(keras.layers.LeakyReLU(alpha=alpha))
# flatten 3D tensor to 1D tensor of size 7*7*128 = 6727
model.add(keras.layers.Flatten())
# apply dropout of 30% before feeding it to the dense layer
model.add(keras.layers.Dropout(0.3))
model.add(keras.layers.Dense(1, activation='sigmoid'))
return model
We train D to maximize the probability of assigning the correct label to both training examples and samples from G. We simultaneously train G to minimize log(1 − D(G(z))). In other words, D and G play the following two-player minimax game with value function V(G, D): 1. However as the authors of the paper note this objective function does not perform in practice, since it may not provide sufficient gradients for the generator to acutally learn, especially during the early stages of learning when the discriminator is very accurate (i.e. outputing 0 rather than 1 so the gradient will be 0 and the weights of the generator will not move). So rather than training the generator to minimize log(1-D(G(z))), training is done to maximize log D(G(z)).1.
Using a deep convolution GAN to create fashion clothes from a Gaussian distribution trained using the MINST fashion data set. Click below on the picture of Daphne to show the video of the transformation from random noise into actual fashion clothes that I think Daphne would include in her wardrobe! 👗 (espically if it is a White Party)
The sorce code for this example can be found here: DCGAN-MINST Fashion
Using a deep convolution GAN to create new faces from a Gaussian distribution trained using the Celeb-A Faces dataset. After just training for five epochs these were fake faces that were generated(note that some of them look realistic, especially the women with the blond hair, lol):
The sorce code for this example can be found here: DCGAN-Celeb-A Faces
As detailed in the book Advanced Deep Learning with Python: Design and implement advanced next-generation AI solutions using TensorFlow and PyTorch by Ivan Vasilev. The Pix2Pix paper Image-to-Image Translation with Conditional Adversarial Networks develops upon ideas from the paper titled Conditional Generative Adversarial Nets introduced in 2014. The implementation of the Pix2Pix algorithm comes from a tensorflow tutorial titled pix2pix: Image-to-image translation with a conditional GAN.
Generative adversarial nets can be extended to a conditional model if both the generator and discriminator are conditioned on some extra information y. y could be any kind of auxiliary information, such as class labels or data from other modalities. We can perform the conditioning by feeding y into the both the discriminator and generator as additional input layer. In the generator the prior distribution of input noise pz(z), and y are combined in joint hidden representation/distribution, and the adversarial training framework allows for considerable flexibility in how this joint hidden representation/distribution is composed. 3.
4.
D and G play the following two-player minimax game with the following value function V(G, D):
3.
Supervised Pix2Pix is a conditional GAN with an additional loss constraining the generator, which the paper outlines in section 3.1 is a L1 loss rather than the traidtional L2 loss. This helps with blurring.5.
5.
5.
The architecture for the generator is described by the paper as the following:
To give the generator a means to circumvent the bottleneck for information like this, we add skip connections, following the general shape of a “U-Net”... Specifically, we add skip connections between each layer i and layer n − i, where n is the total number of layers. Each skip connection simply concatenates all channels at layer i with those at layer n − i.5. The architecture for the discriminator is described by the paper as the following: [To motivate the GAN discriminator to only model high-frequency structures in image that are generated] it is sufficient to restrict our attention to the structure in local image patches. Therefore, we design a discriminator architecture – which we term a PatchGAN – that only penalizes structure at the scale of patches. This discriminator tries to classify if each N ×N patch in an image as real or fake. We run this discriminator convolutionally across the image, averaging all responses to provide the ultimate output of D.5.
The objective of this task is to transform a set of real-world images from the Cityscapes dataset6 into semantic segmentations. The dataset contains 5,000 finely annotated images split into training, and validation sets (i.e. 2975/500 split). The dense annotation contains 30 common classes of road, person, car, etc. as detailed by the following figure 7:
Pix2Pix Model: The Pix2Pix generator was trained for 25,000 steps and used a lambada value of 1000 for the l1 loss function. Since the L1 loss regularizes the generator model to output predicted images that are plausible translations of the source image, I decided to weight it 1 order of magnitude higher than 5 especially when it came to segmenting riders(seemed to help). The following five test results were outputted detailing the some of the preditions of the semantic segmentaion generator i.e. predicted image 8.
The following colab notebook can be found here: Pix2Pix The segmentation generator model weights can be found here: Segmentation Generator Model
Unlike Pix2Pix in which paired training was required i.e. need both input and target pairs, CycleGan works on unpaired data i.e. no information is provided as to which input matches to which target.9
Adversarial loss
Our objective contains two types of terms: adversarial losses for matching the distribution of generated images to the data distribution in the target domain; and cycle consistency losses to prevent the learned mappings G and F from contradicting each other
We apply an adversarial losses to both mapping functions. For the mapping function G : X → Y and its discriminator DY , we express the objective as:
where G tries to generate images G(x) that look similar to images from domain Y , while DY aims to distinguish between translated samples G(x) and real samples y. G aims to minimize this objective against an adversary D that tries to maximize it, i.e., minGmaxDY LGAN(G, DY , X, Y). We introduce a similar adversarial loss for the mapping function F : Y → X and its discriminator DX as well: i.e., minFmaxDX LGAN(F, DX, Y, X).
Cycle loss
Adversarial training can, in theory, learn mappings G and F that produce outputs identically distributed as target domains Y and X respectively. However, with large enough capacity, a network can map the same set of input images to any random permutation of images in the target domain, where any of the learned mappings can induce an output distribution that matches the target distribution. Thus, adversarial losses alone cannot guarantee that the learned function can map an individual input xi to a desired output yi Thus to reduce the space of possible mapping functions, a constraint is introduced in which the mapping functions should be cycle-consistent in the forward direction i.e. x → G(x) → F(G(x)) ≈ x and in the backward direction y → F(y) → G(F(y)) ≈ y
Identity loss
Furthermore, for mapping paintings to photos (and thus also, photos to paintings), we find that it is helpful to introduce an additional loss to encourage the mapping to preserve color composition between the input and output. In particular, we adopt the technique of Taigman et al. and regularize the generator to be near an identity mapping when real samples of the target domain are provided as the input to the generator:
Without Lidentity, the generator G and F are free to change the tint of input images when there is no need to. For example, when learning the mapping between Monet’s paintings and Flickr photographs, the generator often maps paintings of daytime to photographs taken during sunset, because such a mapping may be equally valid under the adversarial loss and cycle consistency loss
The objective of this task is to transform a set of real-world images from the Cityscapes dataset6 into semantic segmentations. The dataset contains 5,000 finely annotated images split into training, and validation sets (i.e. 2975/500 split). The dense annotation contains 30 common classes of road, person, car, etc. as detailed by the following figure 7:
CyleGAN Model: After reading and implementing the tensorflow tutorial on CycleGan I decided implement CycleGAN with a ResNet backbone as was done in 9. The implementation basically follows the Jason Brownlee's implementation in his article: How to Implement CycleGAN Models From Scratch With Keras Jason Brownlee.The image buffer portion, which was taken from Xiaowei-hu which can be found here: CycleGAN (I could of used Jason's image buffer implementation, but I was working with eagertensors at the time, rather than numpy arrays, and I am lazy lol). At first I wanted to follow 9 and train for 200 epochs, however, I did not realize how intensive the training would be for this model. So,instead I trained the model for 50 epochs using Adam as the optimizer with a learning rate of 0.0002 and 0.5 for the first moment of the exponential rate decay. The following five test images were generated for this portion of the training:
The following image shows the translations from photos to segmentations and vice versa:
And to see which of the two discriminators were less fooled in regards to the image translation task, one 16x16 output patch was generated, in which values closer to one meant that the discriminator was being fooled, while values closer to zero meant that the discriminator was not being fooled by the generator:
Then I trained the model for another 50 epochs using stochastic gradient descent with the same learning rate but used linear rate decay in which the learning rate was decayed over the numer of epochs (i.e.50). The following five test images were generated:
The following image shows the translations from photos to segmentations and vice versa:
And to see which of the two discriminators were less fooled in regards to the image translation task, one 16x16 output patch was generated, in which values closer to one meant that the discriminator was being fooled, while values closer to zero meant that the discriminator was not being fooled by the generator:
The following colab notebook can be found here: CycleGAN
Neural style transfer algorithms use Convolutional Neural Networks (CNN) to do content reconstructions and style reconstructions by computing correlations between the different features in the different layers of the CNN.10 As 10 states the VGG-19 CNN Network was used, but only the 16 convolutional and 5 pooling layers were used and none of the fully connected layers were used.10 Furthermore, the max pooling operations were replaced by average pooling operations, since the authors found that the gradient flow was improved leading to better results.10 Shown in the figure below is the complete VGG-19 CNN network:
Content loss
Style loss
Total loss
After reading and implementing the tensorflow tutorial on Neural style transfer. The implementation basically followed the tutorial and the orginal paper. Instead of using the VGG-19 model from the tutorial, I used the VGG-19 model from the paper; namely, instead of using max pooling, I used average pooling as stated in 10. I also used different weighting than the tutorial and followed the paper for the ratio amounts; namely I used 1 for beta and 1000 for alpha which gave a ratio of 1x10-3. I at first tried to implement LBFGS as the sgd method, but was a lot harder than just using tensorflow's implementation lbfgs_minimize or scipy's implementation fmin_l_bfgs_b, due to the fact that both implementations rely on the data being one dimensional!!! So I just used Adam a first-order sgd method rather than LBFGS which is a second-order sgd method and trained for 10 epochs. This was the content image I used:
These were the style images I used the first one was by Pierre Auguste Renoir, titled Portrait of Claude Monet and the second one is Ip Man from Tekken 7:
These were the resulting images generated by the model after training:
The following colab notebook can be found here: NeuralStyleTransfer
Footnotes
-
UNSUPERVISED REPRESENTATION LEARNING WITH DEEP CONVOLUTIONAL GENERATIVE ADVERSARIAL NETWORKS ↩
-
Advanced Deep Learning with Python: Design and implement advanced next-generation AI solutions using TensorFlow and PyTorch ↩
-
pix2pix: Image-to-image translation with a conditional GAN ↩ ↩2 ↩3 ↩4 ↩5 ↩6
-
Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR. (2016) ↩ ↩2
-
Pretty good results, shows that increasing the l1 loss term provides significant improvements, especally when it comes to identifying pedestrians ↩
-
A Neural Algorithm of Artistic Style ↩ ↩2 ↩3 ↩4 ↩5 ↩6 ↩7 ↩8 ↩9