Neural style transfer using convolutional Neural networks in Art generation

Neural Style Transfer is one of the most exciting applications of Convolutional neural networks. It has been linked to multiple mobile applications to add certain styles to a captured photo. However, many people use it for fun projects that are not on a commercial basis.

Let's say you take an image, like the one on the left and then recreate it in the style of the image on the right. Neural Style Transfer allows you to generate new image like the one below...

Example image with a car

So, we have mainly 2 images here, the content image and the style image. The content image has the original image that we want to transform. The style image has the new style that we want to transfer to the original image. In order to implement Neural Style Transfer, you need to look at the features extracted by ConvNet at various layers, the shallow and the deeper layers of a ConvNet. But let’s first ask about what are all these layers of a ConvNet really computing. What are deep ConvNets really learning?


Lets say you've trained a ConvNet, and you want to visualize what the hidden units in different layers are computing. Let's start with a hidden unit in layer 1 and suppose you scan through your training sets and find out what are the images or what are the image patches that maximize that unit's activation.

So in other words pause your training set through your neural network, and figure out what is the image that maximizes that particular unit's activation. If you noticed, will see only a relatively small portion of the neural network. If you plot what caused this unit's activation, you will see just a small image patches. If you pick one hidden unit and find the input images that maximizes that unit's activation, you might find nine image patches like those on the left.

Example Image Style transfer

You can see that the regions of the image that this particular hidden unit sees are corners and edges, it's looking for an edge or a line that looks like that. So those are the image patches that maximally activate one hidden unit's activation

Example 01

Now, you can pick a different hidden unit in layer 1 and do the same thing. So that's a different hidden unit, and looks like this second one, represented by these images patches that looks like it is looking for a line sort of in that portion of its input region, in this case the eyes and eyebrows.

Example 02

So these are different representative neurons and for each of them the image patches that they maximally activate on. This arrives at a certain conclusion that the hidden units in the first layer are often looking for relatively simple features such as edges or shades of certain colors.

Here is another network with the activations as well for the first layer

Example Layer

But, what if you do this for some of the hidden units in the deeper layers of the neuron network. What is the neural network learning at those deeper layers?

In the deeper layers, the hidden unit usually sees a larger portion of the image. Each pixel could affect the output of later layers of the neural network. So, later units actually see larger image patches.

So, this is a visualization of what maximally activates different hidden units in layer 2. These are patches of the image that cause activation of a certain hidden unit.

Example style transfer 03

The interesting thing is that second layer looks like it's detecting more complex shapes like vertical edges.

Image shapes

So, what about the third layer?

It looks like there is a hidden unit that seems to respond to textures like honeycomb shapes, or square shapes...

How about the next layer?

Well, the fourth layer seems to detect even more complex shapes, you can see that is nearly detecting dogs, of course not by species or breeds but the general shape, remember we are only 4 layers deep.

To build a Neural Style Transfer system, we have to build a network, a cost function to minimize and then conduct the training.

Let’s define a cost function J that measures the quality of a generated image, we'll use gradient descent to minimize J in order to generate this image.


How good is a particular image?

We will define two parts to this cost function. First one is called the content cost which is a function of the content image and of the generated image. It measures how similar the contents of the generated image is to the content of the content image, then it will add that to a style cost function which measures how similar the style of the image generated image is to the style of the style image

Finally, we'll weight these with a hyper parameters alpha to specify the relative weighting between the content costs and the style cost, in the original paper of style transfer, it is proposed that 2 hyper parameters were used, alpha and beta.

So, the way the algorithm should run is that we initialize the generated image randomly, a 100x100x3 for example like shown below or whatever dimension you want it to be. After that, we define the cost function J.

We then use the traditional gradient descent to minimize this function. Denoting the generated image as G. So, G will be equal to G minus the derivative respect to the cost function of J. What you are doing now is that you are actually updating the pixel values of this image G that is a 100x100x3.

The cost function of the neural style transfer algorithm had a content cost component and a style cost component.

So, J = alpha * J_content + beta * J_style


Content cost

So, let us define the content cost component. Let us say that you use hidden layer L to compute the content cost. If L is a very small number like the first layer, it will force your generated image to pixel values very similar to your content image. Similarly, if you use a very deep layer, it is forced to add completely learned features into the generated image.

In practice, layer L chosen somewhere in between that is neither too shallow nor too deep in the neural network.


We will then use a pre-trained ConvNet and start to measure how similar a content image and a generated image are. So, if we compared the activations of a certain layer L on these two images and were found to be similar, then that would seem to imply that both images have similar content.


def content_cost(a_C, a_G):
   #a_C is hidden layer activations representing content of the Content image 
    a_G hidden layer activations representing content of the Generated image
    m, n_H, n_W, n_C = a_G.get_shape().as_list()
    J_content = 1 / (4*n_H*n_W*n_C)*(tf.reduce_sum(tf.square(tf.subtract(a_C,a_G))))    
    return J_content


Style cost

Next, let's move on to the style cost function. What is the style of an image mean?

Let's say you have an input image and you've chosen some layer L to define the measure of the style of an image. The style is the correlation between activations across different channels in this layer L activation. So what you can to do is given an image computes something called a style matrix, which will measure all those correlations between the color channels in the images. But, what does it mean for these two channels to be highly correlated?

Well, if they're highly correlated, this means is whatever part of the image has a certain texture, that part of the image will probably have a corresponding style/color in the generated image. To be uncorrelated means that whenever there is this vertical texture, it's probably won't have that style.

And so the correlation tells you which of these high texture components tend to occur or not occur together in part of an image and that's the degree of correlation that gives you one way of measuring how often these different high level features.


def layer_style_cost(a_S, a_G):
    #a_S hidden layer activations representing style of the image Style image
    a_G hidden layer activations representing style of the image Generated image
    m, n_H, n_W, n_C = a_G.get_shape().as_list()
    a_S = tf.transpose(tf.reshape(a_S,(n_H*n_W,n_C)))
    a_G = tf.transpose(tf.reshape(a_G,(n_H*n_W,n_C)))
    GS = gram_matrix(a_S)
    GG = gram_matrix(a_G)

    J_style_layer = 1 /(4*n_H*n_W*n_C*n_H*n_W*n_C)*(tf.reduce_sum(tf.square(tf.subtract(GS,GG))))
    return J_style_layer


In general, calculating the cost function will be like the following:


def total_cost(J_content, J_style, alpha = 10, beta = 40):
    #alpha -- hyperparameter weighting the importance of the content cost
    beta -- hyperparameter weighting the importance of the style cost
    J = alpha*J_content + beta*J_style
    return J


So, to sum up, transfer learning is used to add one or more styles to an image in the light of a style image. One can do this by calculating costs for both content and style images, calculating the loss and trying to minimize that loss. Spotting the transfer of learned features to the generated image follows. You can find style transfer as the backend of many applications today as Prisma.


Would you like to read more about this topic? More resources or code snippets? Just leave your comment in the box below this post (you need to login or register to comment).



  • A Neural Algorithm of Artistic Style, Leon A. Gatys, Alexander S. Ecker, Matthias Bethge
  • Visualizing and Understanding Convolutional Networks, Matthew D Zeiler, Rob Fergus


Join the community!

Imaginghub: your community ... Show more