Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Need help with Keras neural doodle - preliminary results for improvements #3705

Closed
dolaameng opened this issue Sep 6, 2016 · 7 comments
Closed

Comments

@dolaameng
Copy link
Contributor

I was trying to modify the neural_style_transfer example for an neural doodle implementation.

There are original Torch implementations here and here. My Keras implementation can be found here.

Some results on the Monet and Renoir examples can be found below.

monet

The top row are the inputs to the algorithm, and the bottom row are doodle results from original Torch implementation v.s. Keras implementation(60 iterations). There are some differences, e.g., the surface of water? It is probably because I used a very heavy weight for total_variation_loss (5000!), otherwise there seems to be some salt-pepper noise in the generated image.

renoir

Slight differences between original Torch and Keras implementationsare also observed in the renoir example, where a content image is used to help generation, besides style image and doodle (target_mask). But both doodle results are better than results from neural_style_transfer example, e.g., in drawing the clean sky.

My implementation is using VGG19 for images and a series of AveragePooling for masks. The image is generated by minimizing the combination of content_loss, style_loss, and total_variation_loss. It's very similar to the Torch versions but with some differences based on my understanding - I am still struggling with reading the Torch codes...

I appreciate it if someone can help read the code, explain the difference with the Torch and suggest potential improvements. Thanks!

@fchollet
Copy link
Collaborator

fchollet commented Sep 6, 2016

Looks very cool. I don't have much in the way of practical suggestions, but maybe modulating the contributions of the 3 losses with a weighted average could help achieve different results?

As for the differences with the Torch implementation: results look pretty similar to me. Maybe simply changing the range or distribution of the random inputs you start with would fix it.

@titu1994
Copy link
Contributor

titu1994 commented Sep 6, 2016

@dolaameng Really cool implementation!

I made a few modifications to your code (a few improvements from the paper Improving the Neural Algorithm of Artistic Style), and made a few changes here and there. The results seem slightly better, although more iterations will definitely produce better results.

A few of the improvements are :

  • Using all layers of VGG 16 for style inference (I used 16 since I have the weights downloaded, same can be done to 19)
  • Shifting the activations of Gram matrix
  • Geometric weight scaling of style layers
  • Using Conv5_2 layer for content inference (block5_conv2)

The images generated are a tad bit sharper. Also, using such a large TV regularization weight is fine since you can apply a sharpen filter using imfilter later to sharpen images if needed. The below is without sharpening filter applied and 100 epochs.
monet

When using guided style transfer on the Renoit, initialize the image with the content image itself rather than with noise input. The output is far sharper, although the content weight must be altered to adjust for this initialization (I went with 0.1 for content weight). Note that this image is after 60 iterations, and it was still getting roughly 3-4 % improvement in loss every epoch. Larger number of iterations would produce better images.

I found that using high TV regularization weight when provided with a content image to be detrimental to the final output. Instead going the opposite directions and using a TV weight of 8.5e-5 produced the best result - in terms of how visually pleasing the output image is as well as the absolute loss value. This tv weight value was found via cross validation on the original neural style script, but works here as well. Link to the discussion

renoit

@dolaameng
Copy link
Contributor Author

dolaameng commented Sep 7, 2016

@titu1994 : Thanks for your comments! (But I didn't find the link to your modified codes. Appreciate it if you can re-share them). I have tried some of your comments:

  1. Using more layers for style loss is better in creating details, such as the "boundaries" of the tree, rivers, as shown in your generated images. In fact, results generated by using more style layers are more similar to the original results by Dmitry Ulyanov. The downside is the increased running time (up to 1.5x in my experiments).
  2. Using a very small TV weight is also consistent with the comment in Torch implementation. I am not sure if the cross validation discussed here is valid as using a smaller tv weight will always bring down the total loss, right? In my observations the actual scale of tv loss is more related to "sharpness vs. noise". And like what you said, probably the more significant factor is the # of iterations.
  3. I have also tried using 'block5_conv2' for content. I didn't see significant differences, probably because I was using a small content weight (0.025).
  4. The puzzling part to me is how the style loss is defined in the original Torch implementation. Sometime it seems to use an l1 norm, sometime it is an l2 norm. I don't have the knowledge to make a judgement, any comments?

@fchollet : I don't think I have more ideas to try for this example for now, unless getting more comments later. Do you think we can create a PR based on this version? I will tidy up the codes and documentations.

@titu1994
Copy link
Contributor

titu1994 commented Sep 7, 2016

@dolaameng As to point 4, I believe that most implementations of style loss always use MSE (l2 norm). All of the papers I have read related to style transfer state using MSE between gram matrix of style features and gram matrix of output of the jth layer. So I feel that MSE should be used.

As to points 1 and 3, using more layers always comes at the cost of additional time sadly. I too saw a 1.4x increased running time. It is not useful using conv5_2 without a stronger content weight as you have noticed, that's why I went with 4x the normal content weight (0.1).

As to the tv loss comment, I meant that when I used other tv loss (in the range 1e-3 to 1e-8), I found that 8.5e-5 produced the smallest absolute loss value after 100 epochs, similar to the discussion. This was however done on the original neural style transfer script, and then later reproduced on my modified version of that script. I detailed some of these test results in my guide for the script here (see the "Tips for Total Variation Regularization" section. The results are tested on various images using a large variety of styles, so I think it should still apply to this script.

@dolaameng
Copy link
Contributor Author

Thanks @titu1994 will read it up!

@dolaameng
Copy link
Contributor Author

Made the code compatible with both 'th' and 'tf' image_dim_ordering. PR #3724 created.

@bmaltais
Copy link

bmaltais commented Sep 9, 2016

@titu1994 @dolaameng Regarding my original TVloss observations I can say that reducing TVloss value to something really small (like 0.00001) will usually increase style loss over 0.000085. So reducing TVloss will not always result is less style loss... but it is not always the cases.

I noticed that when attempting to do "super resolution" using the content image as the style image applied over the same content image with twice the resolution using a TVloss of 0 was producing the best results over the typical 0.000085.

Food for thoughts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants