All you need for Photorealistic Style Transfer in PyTorch

Kushajveer Singh
7 min readMay 6, 2019

It follows from the paper High-Resolution Network for Photorealistic Style Transfer. I discuss the paper details and the pytorch code. My code implementation can be found in this repo. The official code release for the paper can be found here.

Use this model as your de-facto model for style transfer.

Index

  1. What is Style Transfer?
  2. So why another paper?
  3. Gram Matrix
  4. High-Resolution Models
  5. Style Transfer Details
  6. Hi_Res Generation Network
  7. Loss Functions
  8. Difficult Part
  9. Conclusion

What is Style Transfer?

We have two images as input one is content image and the other is style image.

Fig1: Left: content image and Right: style image

Our aim is to transfer the style from style image to the content image. This looks something like this.

Fig2: Output Image

So why another paper?

Earlier work on style transfer although successful was not able to maintain the structure of the content image. For instance, see Fig2 and then see the original content image in Fig1. As you can see the curves and structure of the content image are not distorted and the output image has the same structure as content image.

Fig3: The results from this paper are shown in (e) and (j) and a comparison is done with other methods as well. Figure is taken from Figure1 in the original paper.

Gram Matrix

The main idea behind the paper is using Gram Matrix for style transfer. It was shown in these 2 papers that Gram Matrix in feature map of convolutional neural network(CNN) can represent the style of an image and propose the neural style transfer algorithm for image stylization.

  1. Texture Synthesis Using Convolution Neural Networks by Gatys et al. 2015
  2. Image Style Transfer Using Convolutional Neural Networks by Gatys et al. 2016

Details about gram matrix can be found on wikipedia. Mathematically, given a vector V gram matrix is computed as

High-Resolution Models

It is a recent research paper accepted at CVPR 2019 paper. So generally what happens in CNNs is we first decrease the image size while increasing the number of filters and then increase the size of the image back to the original size.

Now this forces our model to generate output images from a very small resolution and this results in loss of finer details and structure. To counter this fact High-Res model was introduced.

High-resolution network is designed to maintain high-resolution representations through the whole process and continuously receive information from low-resolution networks. So we train our models on the original resolution.

Example of this model would be covered below. You can refer to the original papers for more details on this. I will cover this topic in detail in my next week blog post.

Style Transfer Details

The general architecture of modern deep learning style transfer algorithms looks something like this.

Fig4: Model architecture for style transfer in the deep learning era. Fig taken from Figure3 of original paper.

There are three things that style transfer model needs

  1. Generating model:- It would generate the output images. In Fig4 this is ‘Hi-Res Generation Network’
  2. Loss functions:- Correct choice of loss functions is very important in case you want to achieve good results.
  3. Loss Network:- You need a CNN model that is pretrained and can extract good features from the images. In our case, it is VGG19 pretrained on ImageNet.

So we load VGG model. The complete code is available at my GitHub repo.

Next we load our images to disk.

My images are stored as src/imgs/content.png and src/imgs/style.png.

Detail:- When we load our images, what sizes should we use? Your content image size should be divisible by 4, as our model would downsample images 2 times. For style images, do not resize them. Use their original resolution.

For the images I am using the size of content image is (500x500x3) and size of style image is (800x800x3).

Hi_Res Generation Network

Fig5: The structure of high-resolution generation network. When we fuse feature maps with different resolution, we directly concatenate these feature images like the inception module, for example, the feature map 4, is concatenated by the feature map 2 and the feature map 3. We use bottleneck residual to ensure that our network can be trained well and speedup the training while preserving good visual effects. Fig taken from Figure2 of original paper.

The model is quite simple we start with 500x500x3 images and maintain this resolution for the complete model. We downsample to 250x250 and 125x125 and then fuse these back together with 500x500 images.

Details:-

  1. No pooling is used (as pooling causes loss of information). Instead strided convolution (i.e. stride=2) are used.
  2. No dropout is used. But if you need regularization you can use weight decay.
  3. 3x3 conv kernels are used everywhere with padding=1.
  4. Zero padding is only used. Reflex padding was tested but the results were not good.
  5. For upsampling,’bilinear’ mode is used.
  6. For downsampling, conv layers are used.
  7. InstanceNorm is used.

Implementation code

Residual connections are used between every block. We use BottleNeck layer from the ResNet architecture. (In Fig5 all the horizontal arrows are bottleneck layers).

Refresher on bottleneck layer.

Fig6: Architecture of BottleneckModule from the ResNet paper.

Now we are ready to implement our style_transfer model, which we call HRNet (based on the paper). Use the Fig5 as reference.

Loss Functions

In style transfer we use feature extraction, to calculate the value of losses. Feature extraction put in simple terms, means you take a pretrained imagenet model and pass your images through it and store the intermediate layer outputs. Generally, VGG model is used for such tasks.

Fig7: Model architecture for VGG network

So you take the outputs from the conv layers. Like for the above fig, you can take the output from the second 3x3 conv 64 layer and then 3x3 conv 128.

To extract features from VGG we use the following code.

We use 5 layers in total for feature extraction. Only conv4_2 is used as layer for content loss.

Refer to Fig4, we pass our output image from HRNet and the original content and style image through VGG.

There are two losses

  1. Content Loss
  2. Style Loss

Content Loss: Content image and the output image should have a similar feature representation as computed by loss network VGG. Because we are only changing the style without any changes to the structure of the image.

For the content loss, we use Euclidean distance as shown by the formula

Phi_j means we are referring to the activations of the j-th layer of loss network. In code it looks like this

Style Loss: We use gram matrix for this. So style of an image is given by its gram matrix. Our aim is to make style of two images close, so we compute the difference of gram matrix of style image and output image and then take their Frobenius norm.

Difficult part

To compute our final losses, we multiply them with some weights.

content_loss = content_weight * content_loss
style_loss = style_weight * style_loss

The difficulty comes in setting these values. If you want some desired output, then you would have to test different values before you get your desired result.

To build your own intuitions you can choose two images and try different range of values. I am working on providing like a summary of this. It will be available in my repo README.

Paper recommends content_weight = [50, 100] and style_weight = [1, 10].

Conclusion

Well, congratulation made it to the end. You can now implement style transfer. Now read the paper for more details on style transfer.

Check out my repo README, it will contain the complete instructions on how to use the code in the repo, along with complete steps on how to train your model. I will be adding video support as well, in the coming week. So you can transfer your style for all frames in a video. I am experimenting with cyclic learning for style transfer. Will add support for fastai as well.

My earlier posts:

  1. SPADE: State of the art in Image-to-Image Translation by Nvidia
  2. Weight Standardization: A new normalization in town
  3. Training AlexNet with tips and checks on how to train CNNs

Connect with me on LinkedIn. And follow me on medium, to get the latest posts directly in your feed. Github, twitter.

--

--

Kushajveer Singh

Software Engineer | Full-Stack Engineer | React | Next.js | PostgreSQL | Python | Trained by Senior Google Engineers in Software Engineering Best Practices