# Perceptual Losses for Real-Time Style Transfer and Super-Resolution

ECCV 2016，实时风格迁移与超分辨率化的感知损失，这篇论文是在cs231n里面看到的，正好最近在研究风格迁移。一作是Justin Johnson，2017春的cs231n的主讲之一。这篇论文的主要内容是对Gatys等人的风格迁移在优化过程中进行了优化，大幅提升了性能。

# Image Transformation Network

## Architecture

We do not use any pooling layers, instead using strided and fractionally strided convolutions for in-network downsampling and upsampling. Our network body consists of five residual blocks using the architecture of http://torch. ch/blog/2016/02/04/resnets.html. All non-residual convolutional layers are followed by spatial batch normalization and ReLU nonlinearities with the exception of the output layer, which instead uses a scaled tanh to ensure that the output image has pixels in the range [0, 255]. Other than the first and last layers with use $9 \times 9$ kernels, all convolutional layers use $3 \times 3$ kernels. The exact architectures of all our networks can be found in the supplementary material.

## Inputs and Outputs

For style transfer the input and output are both color images of shape #3 \times 256 \times 256#.
For super-resolution with an upsampling factor of $f$, the output is a high-resolution patch of shape $3 \times 288/f \times 288/f$. Since the image transformation networks are fully-convolutional, at test-time they can be applied to images of any resolution.

## Downsampling and Upsampling

For super-resolution with an upsampling factor of $f$, we use several residual blocks followed by $log_2f$ convolutional layers with stride $1/2$. This is different from [1] who use bicubic interpolation to upsample the low-resolution input before passing it to the network.

Our style transfer networks use the architecture shown in Table 1 and our super-resolution networks use the architecture shown in Table 2. In these tables “$C \times H \times W$ conv” denotes a convolutional layer with $C$ filters size $H \times W$ which is immeidately followed by spatial batch normalization [1] and a ReLU nonlinearity.
Our residual blocks each contain two $3 \times 3$ convolutional layers with the same number of filters on both layer. We use the residual block design of Gross and Wilber [2] (shown in Figure 1), which differs from that of He et al [3] in that the ReLU nonlinearity following the addition is removed; this modified design was found in [2] to perform slightly better for image classification.
For style transfer, we found that standard zero-padded convolutions resulted in severe artifacts around the borders of the generated image. We therefore remove padding from the convolutions in residual blocks. A $3 \times 3$ convolution with no padding reduces the size of a feature map by 1 pixel on each side, so in this case the identity connection of the residual block performs a center crop on the input feature map. We also add spatial reflection padding to the beginning of the network so that the input and output of the network have the same size.

# Perceptual Loss Functions

We define two perceptual loss functions that measure high-level perceptual and semantic differences between images. They make use of a loss network $\phi$ pretrained for image classification, meaning that these perceptual loss functions are themselves deep convolutional neural networks. In all our experiments $\phi$ is the 16-layer VGG network pretrained on the ImageNet dataset.

## Featue Reconstruction Loss

Rather than encouraging the pixels of the output image $\hat{y} = f_W(x)$ to exactly match the pixels of the target image $y$, we instead encourage them to have similar feature representations as computed by the loss network $\phi$. Let $\phi_j(x)$ be the activations of the jth layer of the network $\phi$ when processing the image $x$; if $j$ is a convolutional layer then $\phi_j(x)$ will be a feature map of shape $C_j \times H_j \times W_j$. The feature reconstruction loss is the (squared, normalized) Euclidean distance between feature representations:

As demonstrated in [6] and reproduced in Figure 3, finding an image $\hat{y}$ that minimizes the feature reconstruction loss for early layers tends to produce images that are visually indistinguishable from $y$.

## Style Reconstruction Loss

The feature reconstruction loss penalizes the output image $\hat{y}$ when it deviates in content from the target $y$. We also wish to penalize differences in style: colors, textures, common patterns, etc. To achieve this effect, Gatys et al propose the following style reconstruction loss.
As above, let $\phi_j(x)$ be the activations at the $j$th layer of the network $\phi$ for the input $x$, which is a feature map of shape $C_j \times H_j \times W_j$. Define the Gram matrix $G^\phi_j(x)$ to be the $C_j \times C_j$ matrix whose elements are given by

# Experiments

## Single-Image Super_Resolution

This is an inherently ill-posed problem, since for each low-resolution image there exist multiple high-resolution images that could have generated it. The ambiguity becomes more extreme as the super-resolution factor grows; for larger factors ($\times 4$, $\times 8$), fine details of the high-resolution image may have little or no evidence in its low-resolution version.
To overcome this problem, we train super-resolution networks not with the per-pixel loss typically used [1] but instead with a feature reconstruction loss to allow transer of semantic knowledge from the pretrained loss network to the super-resolution network. We focus on $\times 4$ and $\times 8$ super-resolution since larger factors require more semantic reasoning about the input.
The traditional metrics used to evaluate super-resolution are PSNR and SSIM, both of which have been found to correlate poorly with human assessment of visual quality. PSNR and SSIM rely only on low-level differences between pixels and operate under the assumption of additive Gasussian noise, which may be invalid for super-resolution. In addition, PSNR is equivalent to the per-pixel loss $\mathcal{l_{pixle}}$, so as measured by PSNR a model trained to minimize feature reconstruction loss should always outperform a model trained to minimize feature reconstruction loss. We therefore emphasize that the goal of these experiments is not to achieve state-of-the art PSNR or SSIM results, but instead to showcase the qualitative difference between models trained with per-pixel and feature reconstruction losses.