Image Super-Resolution Using Deep Convolutional Networks

PAMI 2016,大体思路:把训练集中的所有样本模糊化,扔到三层的卷积神经网络中,把输出和原始图片做一个loss,训练模型即可。原文链接:[Image Super-Resolution Using Deep Convolutional Networks](https://arxiv.org/abs/1501.00092)

PAMI 2016,大体思路:把训练集中的所有样本模糊化,扔到三层的卷积神经网络中,把输出和原始图片做一个loss,训练模型即可。原文链接:Image Super-Resolution Using Deep Convolutional Networks

首先是ill-posed problem,图像的不适定问题 法国数学家阿达马早在19世纪就提出了不适定问题的概念:称一个数学物理定解问题的解存在、唯一并且稳定的则称该问题是适定的(Well Posed).如果不满足适定性概念中的上述判据中的一条或几条,称该问题是不适定的。

Convolutional Neural Networks For Super-Resolution

Formulation

We first upscale a single low-resolution image to the desired size using bicubic interpolation. Let us denote the interpolated image as $Y$. Our goal is to recover from $Y$ an image $F(Y)$ that is as similar as possible to the ground truth high-resolution image $X$. For the ease of presentation, we still call $Y$ a “low-resolution” image, although it has the same size as $X$. We wish to learning a mapping $F$, which conceptually consists of three operations:

  1. Patch extraction and representation. this operation extracts (overlapping) patches from the low-resolution image $Y$ and represents each patch as a high-dimensional vector. These vectors comprise a set of feature maps, of which the number equals to the dimensionality of the vectors.
  2. Non-linear mapping. this operation nonlinearly maps each high-dimensional vector onto another high-dimensional vector. Each mapped vector is conceptually the representation of a high-resolution patch. These vectors comprise another set of feature maps.
  3. Reconstruction. this operation aggregates the above high-resolution patch-wise representations to generate the final high-resolution image. This image is expected to be similar to the ground truth $X$.

Patch Extraction and Representation

$$F\_1(Y)=max(0, W\_1 * Y + B\_1)$$

where $W_1$ and $B_1$ represent the filters and biases respectively, and $*$ denotes the convolution operation. Here, $W_1$ corresponds to $n_1$ filters of support $c \times f_1 \times f_1$, where $c$ is the number of channels in the input image, $f_1$ is the spatial size of a filter. Intuitively, $W_1$ applies $n_1$ convolutions on the image, and each convolution has a kernel size $c \times f_1 \times f_1$. The output is composed of $n_1$ feature maps. $B_1$ is an $n_1$-dimensional vector, whose each element is associated with a filter. We apply the ReLU on the filter responses.

Non-Linear Mapping

$$F\_2(Y) = max(0, W\_2 * F\_1(Y) + B\_2)$$

Here $W_2$ contains $n_2$ filters of size $n_1 \times f_2 \times f_2$, and $B_2$ is $n_2$-dimensional. Each of the output $n_2$-dimensional vectors is conceptually a representation of a high-resolution patch that will be used for reconstruction.

Reconstruction

$$F(Y)=W\_3 * F\_2(Y) + B\_3$$

Here W_3 corresponds to $c$ filters of a size $n_2 \times f_3 \times f_3$, and $B_3$ is a $c$-dimensional vector.

Training

$$L(\Theta ) = \frac{1}{n}\sum^n\_{i=1}\Vert F(Y\_i;\Theta)-X\_i\Vert ^2$$$$PSNR = 10 \times \log\_{10}(\frac{(2^n-1)^2}{MSE})$$$$\Delta\_{i+1}=0.9 \cdot \Delta\_i + \eta \cdot \frac{\partial{L}}{\partial{W^\ell\_i}}, W^\ell\_{i+1}=W^\ell\_{i}+\Delta\_{i+1}$$

where $\ell \in {1,2,3}$ and $i$ are the indices of layers and iterations, $\eta$ is the learning rate, and $\frac{\partial{L}}{\partial{W^\ell_i}}$ is the derivative. The filter weights of each layer are initialized by drawing randomly from a Gaussian distribution with zero mean and standard deviation 0.0001 (and 0 for biases). The learning rate is $10^{-4}$ for the first two layers, and $10^{-5}$ for the last layer. We empirically find that a smaller learning rate in the last layer is important for the network to converge (similar to the denoising case). In the training phase, the ground truth images ${X_i}$ are prepared as $f_{sub} \times f_{sub} \times c$-pixel sub-images randomly cropped from the training images. By “sub-images” we mean these samples are treated as small “images” rather than “patches”, in the sense that “patches” are overlapping and require some averaging as post-processing but “sub-images” need not. To synthesize the low-resolution samples ${Y_i}$, we blur a sub-image by a Gaussian kernel, sub-sample it by the upscaling factor, and upscale it by the same factor via bicubic interpolation. To avoid border effects during training, all the convolutional layers have no padding, and the network produces a smaller output $((f_{sub}-f_1-f_2-f_3+3)^2 \times c)$. The MSE loss function is evaluated only by the difference between the contral pixels of $X_i$ and the network output. Although we use a fixed image size in training, the convolutional nerual network can be applied on images of arbitrary sizes during testing.

使用 Hugo 构建
主题 StackJimmy 设计