A Neural Algorithm of Artistic Style
E4040.2016Fall.NAAH.report
Francis Marcogliese fam2148, Richard Godden rg3047, Yogesh Garg yg2482
Columbia University
Abstract
Several neural algorithms have been implemented to extract style information from one image and content information from another to generate a new image with both elements. This report aims to recreate one of these networks to produce eye catching results, as it is a novel use of CNNs to separate elements such as content and style from images. Implementing the loss function for this project with the hardware limitations required specific effort to process the the CNN’s outputs at various layers which was solved with tools in the Theano library. Several pictures combining style form paintings and photo content were generated with visually appealing results.
1. Introduction
Deep neural networks are a type of convolutional neural network especially adept at feature extraction from image content [2]. Being biologically inspired, some networks even approach human accuracy for some tasks, such a facial recognition [3]. One such network, VGG, was the best classification network in the ImageNet challenge 2014 [4]. VGG-19, one of the networks used by the VGG team, is therefore a useful tool for feature extraction from many types of images. By examining the outputs of the VGG-19 network at various layers, different aspects of the input can be discovered - for example, certain layers carry information that is more tightly associated with style, and others more tightly associated with content [5].
As shown by Gatys et al. [5], the information could be used within a custom loss function with a white noise image, to create a combination of the content with the style of another image. As there is no strict definition for what the human brain distinguishes as content or as style in an image, the main challenges associated with this task are defining which convolutional layers should be chosen for the “style” and which for the “content”. Similarly, when combining these data in the loss function, it is also difficult to choose the how to weight each layer, as the quality of the result is highly dependant on the choice of inputs. From an implementation perspective, it was challenging to correctly define the gradient of a more complex loss function, linking the whole computational graph together through different classes. This was solved through careful use of the built-in Theano functions and logic. Additionally, training VGG-19 proved non trivial due to memory constraints on AWS.
2. Summary of the Original Paper
2.1 Methodology of the Original Paper
The following section describes how the data in the original paper was processed to transfer style from one image to the content of another.
Figure 1: Graph depicting the structure of network to combine style and content into a new image. Inspired from [7]
Figure 1 summarizes succinctly how the original paper [5] processes the input images to produce a new result. The authors identify 5 principal layers that contain the style information and 1 layer that provides the content information. The style image and the new image, which can start off as white noise, fed forward through VGG and 5 activations are kept. The Gram matrix of each activation is calculated for each layer, and the weighted mean squared error is taken between the matrices for style and the new image, to give a style loss. Similarly, the squared difference between Gram matrices for the content layers is taken as the content loss. The weighted sum of the 2 gives the total loss, which is minimized to give a visually interesting generated image. The authors also used average pooling instead of max pooling. They claim that this offered visually superior results to max pooling when doing the reconstructions. The Caffe framework was used to obtain the results described in the original paper and the VGG-19 network was initialized with the weights from the developers of VGG-19 as opposed to retraining the network over ImageNet.
2.2 Key Results of the Original Paper
The first key result of this paper is that style and content can be extracted separately on different layers of a deep convolutional neural network. Higher levels in the convolutional network (deeper into the network) give high level content [5], such as broad brush features or big swirls such as in the Starry Night painting, whereas the more detailed brush strokes is found in lower levels. This is why a higher level (conv4_2) is used to extract content information. This is shown in Figure 2, taken from the original paper [5].
Figure 2: This shows how content and style can be found in the outputs of different layers of the network. Deeper layers show more content. From [5]
For style representations, the authors described a method to get correlations between between different style elements in the input image, for example proximity between certain brush stroke or lines or colours [5][6]. This is done by taking the Gram matrix of the activation at a certain layer.
This has a certain consistency with biological systems, as complex cells would perform the same correlations between stylistic features that are distinguished from the content - which possible what is done by humans’ primary visual cortex.
Additionally, colours are most heavily sourced in the style image, as well as the local structures and these features are organized in a way that is consistent with the content of the content input image.
3. Methodology
This section presents the methodology used to design the architecture for the code presented in this paper [1]. The differences between this code and the code in the original paper are also presented.
In an effort to recreate the results present by the original paper by Gatys et al., a very similar methodology was used. The same gradient structure and loss function were used as described in Figure 1. However, an important difference is that the Theano library was used as a project requirement, as opposed to the Caffe framework in the original project. VGG-19 was created from scratch for the computational graph to style transfer activation functions.
In addition, this version gave the option of using 3 different optimizers, gradient descent, Adam and L-BFGS to compare the varying results. The original paper uses solely L-BFGS.
The original paper used normalized weights in the VGG-19 so that the mean activations at the desired layers would equal to 1. This paper does not do this, due to coding complexity and time limitations.
Finally, the authors slightly modified VGG-19 by changing the type of pooling the 5 pooling layers form max pooling to average pooling. This significantly slowed down computation and despite the author's claiming that it improved results, it was excluded for efficiency reasons.
3.1. Objectives and Technical Challenges
The principal objective of this project is to recreate the style transfer network shown in the original paper and produce visually appealing images with style from one input and content from another.
One of the greatest challenges in this project was handling the numerous memory errors thrown when doing style transfers on high resolution images (more than 1000x1000 pixels). The AWS instances used (g2.2large and g2.8large) were largely insufficient in terms of memory as they threw errors if more than 4GB of RAM was used at any given time. This was solved by creating a minimum number of VGG-19 classes to minimize memory used by those objects.
Processing power was also a limitation, as training high resolution images could take several hours when using average pooling. This limited some of the methodology choices to have acceptable training times for each iterations.
Another important challenge was finding the correct parameters to obtain visually pleasing images, as this was the metric of success for a style transfer, as no concrete metric of style and content currently exists. There are 3 very important parameters that direct image reconstructions, alpha, which gives the weight of the content loss in the total loss, beta, which gives the weight of the style loss in the total loss and w_l which represents the individual weighting of each style activation in the style loss. The parameters balance how much the generated image matches the content or the style, and fine-tuning them for general cases has proved to be a trial-and-error process, limiting the number of combinations that could be tested efficiently.
Finally, when recreating a personal version of VGG-19, a number of attempts were made to train it to have personal weights to attempt image reconstructions with those. However, training VGG-19 requires a dataset that is not freely accessible by the general public and requires a lot of training to get to similar results to the weights provided by the creators of the network. Therefore, those weights were used.
3.2. Problem Formulation and Design
As the main objective is producing images with style form one input and content from another, the problem is how to extract the relevant data from the inputs and how to use these data to obtain a new, generated image.
The tool used to extract the interesting features from images is a deep convolutional neural network, VGG-19. The numerous convolutions and pooling steps extract different features at different layers of the network.
These features are combined according to the following loss function:
where G is the Gram matrix of the activation of that particular layer.
To minimize the loss as a function of the new image, the grad tool from the Theano library is used to get the gradient of the total loss as a function of the image input to the convolutional neural network. This is what produces a new image according to the losses calculated through VGG-19.
4. Implementation
This section describes the neural network used to extract features, how they were used to compute a loss function and how the loss function as used to produce new images.
4.1. Deep Learning Network
VGG-19 consists of the following of layers:
- conv1_1 - conv1_2 - pool1
- conv2_1 - conv2_2 - pool2
- conv3_1 - conv3_2 - conv3_3 - conv3_4 - pool3
- conv4_1 - conv4_2 - conv4_3 - conv4_4 - pool4
- conv5_1 - conv5_2 - conv5_3 - conv5_4 - pool5
- fc6 - drop6 - fc7 - drop7 - fc8 - prob
Table 1: Layers in VGG-19
The following classes were implemented to create these layers, based on the tutorials and homeworks in class, we created the following layer classes:
The outputs were connected using standard pooling and dropout function provided by Theano library. The final computation graph and functions to train these are made available as a VGG_19 class to allow for modularity and re usage while experimenting in the following sections. Special care was taken to ensure that all the Models and Layers can be parameterized for different sizes of images, different activation functions. VGG_19 class can also take in weights that could be read from mat files and pass them to individual layers.
The training algorithm consisted of using a Scipy function that did function minimization with L-BFGS, using two Python wrapped theano functions that calculate the loss and the grad with an initial guess of white noise. The data used for input were test images from the paper for content and style.
4.2. Software Design
The following figure shows the process the data goes through to produce the new image
Neural network: Feedforward input through neural network and output activations at right layers
Loss: Compute loss for every layer of style and content and output scalar loss
Minimization: Using loss and gradient functions, minimize loss as a function on input, then send input back through CNN
5. Results
5.1. Project Results
There are two principal categories of results produced by this project: the results of training VGG-19 and the style transfer results with the pretrained weights.
VGG-19 Training
Figure 3: VGG-19 Training Initialized with random weights
The graph in figure 3 shows how the testing accuracy of VGG-19 initialized with random weights does not train appreciably after 100 epochs, with 31 mini-batches with 840 images in total across 10 categories . This justifies the use of the pretrained weights.
Style Transfer
Figure 4: Generated Image with Tubingen content and The Starry Night (Van Gogh) style
Figure 4 clearly shows how style from the Van Gogh painting is applied to the photo of the Neckarfront - the outline of the building are clearly visible, as are their reflections in the water. The brush strokes are reminiscent of The Starry Night, as is the colour palette.
Figure 5: Progressive Style Transfer with LBGF-S : Top at iteration 0, middle at iteration 5, bottom at iteration 10
In figure 5, it can be observed that the generated image is in fact getting closer to the content of the content input and the style of the style input as the minimization of the loss occurs. Clear building shapes appear after 4 iterations, along with the brush strokes and colour palette of the style image. A more recognizable image is show after 5 more iterations.
Figure 6: Comparison of Adam and L-BGFS after 10 iterations with Tubingen and The Shipwreck of the Minotaur by Turner
Figure 6 displays the superiority of minimization given by the L-BFGS minimization as compared to the Adam optimizer. The Adam optimizer does not provide an image similar to the one given by 10 iterations of L-BFGS until 50 more iterations. 10 iterations of Adam still produces an image similar to white noise.
Figure 7: Reconstructions at Content Layer conv4_2 of Tubingen
Figure 7 displays the reconstruction of the output of the conv4_2 layer of VGG-19 with the Neckarfront input image. The edges of the buildings and other objects are clearly displayed, but the colours are dulled, indicating that the output is especially content focused as opposed to style.
Figure 8: Comparison of Alpha/Beta ratio - Top : alpha=0.2, beta = 1e-6, Bottom : alpha=0.2, beta = 1e-5
Figure 9: Tubingen content with Femme Nue Assise by Picasso
Figure 9 displays how profoundly the colour of the stye input affects the output. Here the content is hard to perceive because of the low contrast in the input style image.
Figure 10: Tubingen content with Composition VII by Kandinsky
Figure 11 : Progression of loss when training for style loss and content loss with L-BFGS
Fig 11 shows how loss progresses during minimization. Style loss goes down much more quickly than content loss, which only goes down by a small amount.
5.2. Comparison of Results
These are the main images resulting from the style transfer in the paper.
Figure 11: Results of the style transfer from the original paper
These images shown in the paper are much nicer than ones produced by the project. However, similar features can be observed. Both sets of results show transfer of colour and texture elements, such as brush strokes to the content image. The ones from the paper have a much greater level of detail of content, with smaller elements such as windows remaining clearly visible. The difference with the style from Femme nue assise by Picasso is especially striking, as the image from the project lost almost all contrast. These differences can be attributed, to longer training times, better alpha and beta values, weight normalization and the use of average pooling.
5.3. Discussion of Insights Gained
This paper clearly displays the power of CNNs, especially deep CNNs, to mimic biological activity and extract important features from visual content. The different layers of CNNs show important features, similar to complex cells in the visual cortex, and these different features can be used for different all sorts of purposes and manipulation for image generation and classification.
It is also clear that not all applications of CNNs require high computation time - with good optimization, this new content can be generated in minutes.
Finally, this CNN also shows the power of Theano, especially by its ability to compute the grad of complex computational graphs, allowing loss minimization in an elegant manner.
6. Conclusion
In conclusion, this project shows the ability of CNNs to separate style and content from visual representations. Powerful feature extractors allow to identify style elements and content elements and novel loss functions help combine the 2 to create new content sourced from the two. While working around hardware limitations many optimizations in the architecture were made to obtain quick results of combining the style and content. An interesting modification for this paper would be to create a loss function for 3 images separated into foreground and background, where we take the style of one background of image 1, style of foreground of image 2 and apply it to the respective components of image 3.
6. Acknowledgement
Thank you to Mehmet Turkcan and James Lloyd for actively explaining some sticky points in the paper.
Thank you to fchollet (https://github.com/fchollet/keras/blob/master/examples/neural_style_transfer.py) for the Keras implementation and jcjohnson (https://github.com/jcjohnson/neural-style) for the Torch implementation and webeng for L-BFGS with Theano example (https://github.com/Lasagne/Recipes/blob/master/examples/styletransfer/Art%20Style%20Transfer.ipynb) available on Github
7. References
[1] https://bitbucket.org/e_4040_ta/e4040_project_naah
[2] Krizhevsky, A., Sutskever, I. & Hinton, G. E. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems,1097–1105(2012).URL http://papers.nips.cc/paper/4824-imagenet.
[3] Taigman, Y., Yang, M., Ranzato, M. & Wolf, L. Deepface: Closing the gap to human-level performance in face verification. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, 1701–1708 (IEEE, 2014). URL http://ieeexplore. ieee.org/xpls/abs_all.jsp?arnumber=6909616
[4]Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional networks for large-scale image recognition." arXiv preprint arXiv:1409.1556 (2014).
[5] Gatys, Leon A., Alexander S. Ecker, and Matthias Bethge. "A neural algorithm of artistic style." arXiv preprint arXiv:1508.06576 (2015).
[6] Lloyd, James Robert, “Neural Art”, http://jamesrobertlloyd.com/blog-2015-09-01-neural-art.html. Consulted on December 18th 2016
[7] Gatys, Leon A., Alexander S. Ecker, and Matthias Bethge. "Image style transfer using convolutional neural networks." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016.
8. Appendix
8.1 Individual student contributions - table
fam2148 | yg2482 | rg3047 | |
Last Name | Marcogliese | Garg | Godden |
Fraction of (useful) total contribution | 1/3 | 1/3 | 1/3 |
What I did 1 | wrote report | wrote report | VGG-19 perfection |
What I did 2 | LBFG-S implementation | VGG training | Loss functions |
What I did 3 | finalized code to style transfer | initial NN setup | alpha beta optimization |