# Identity Mappings in Deep Residual Networks

ECCV 2016, ResNet v2, 原文链接：Identity Mappings in Deep Residual Networks

# Identity Mappings in Deep Residual Networks

## Introduction

Deep residual network (ResNets) consist of many stacked “Residual Units”. Each unit (Fig. 1(a)) can be expressed in a general form:

where $x_l$ and $x_{l+1}$ are input and output of the $l$-th unit, and $\mathcal{F}$ is a residual function.$h(x_l)=x_l$ is an identity mapping and $f$ is a ReLU function.
The central idea of ResNets is to learn the additive residual function $\mathcal{F}$ with respect to $h(x_l)$, with a key choice of using an identity mapping $h(x_l)=x_l$. This is realized by attaching an identity skip connection (“shortcut”).
In this paper, we analyze deep residual networks by focusing on creating a “direct” path for propagating information — not only within a residual unit, but through the entire network. Our derivations reveal that if both h(x_l) and f(y_l) are identity mappings, the signal could be directly propagated from one unit to any other units, in both forward and backward passes.
To understand the role of skip connections, we analyse and compare various types of $h(x_l)$. We find that the identity mapping $h(x_l) = x_l$ chosen in achieves the fastest error reduction and lowest training loss among all variants we investigated, whereas skip connections of scaling, gating, and $1 \times 1$ convolutions all lead to higher training loss and error. These experiments suggest that keeping a “clean” information path (indicated by the grey arrows in Fig. 1,2, and 4) is helpful for easing optimization.

Figure 1. Left: (a) original Residual Unit in [1]; (b) proposed Residual Unit. The grey arrows indicate the easiest paths for the information to propagate, corresponding to the additive term “x_l” in Eqn.(4) (forward propagation) and the additive term “1” in Eqn.(5) (backward propagation). Right: training curves on CIFAR-10 of 1001-layer ResNets. Solid lines denote test error (y-axis on the right), and dashed lines denote training loss (y-axis on the left). The proposed unit makes ResNet-1001 easier to train.

To construct an identity mapping $f(y_l)=y_l$, we view the activation functions (ReLU and BN) as “pre-activation” of the weight layers, in constrast to conventional wisdom of “post-activation”. This point of view leads to a new residual unit design, shown in (Fig. 1(b)). Based on this unit, we present competitive results on CIFAR-10/100 with a 1001-layer ResNet, which is much easier to train and generalizes better than the original ResNet in [1]. We further report improved results on ImageNet using a 200-layer ResNet, for which the counterpart of [1] starts to overfit. These results suggest that there is much room to exploit the dimension of network depth, a key to the success of modern deep learning.

## Analysis of Deep Residual Networks

The ResNets developed in [1] are modularized architectures that stack building blocks of the same connecting shape. In this paper we call these blocks “Residual Units”. The original Residual Unit in [1] performs the following computation:

Here $x_l$ is the input feature to the $l$-th Residual Unit. $\mathcal{W_l}=\lbrace W_{l,k} \mid 1 \le k \le K\rbrace$ is a set of weights (and biases) associated with the $l$-th Residual Unit, and $K$ is the number of layers in a Residual Unit ($K$ is 2 or 3 in [1]). $\mathcal{F}$ denotes the residual function, e.g., a stack of two $3 \times 3$ convolutional layers in [1]. The function $f$ is the operation after element-wise addition, and in [1] $f$ is ReLU. The function $h$ is set as an identity mapping: $h(x_l)=x_l$.
If $f$ is also an identity mapping: $x_{l+1} \equiv y_l$, we can put Eqn.(2) into Eqn.(1) and obtain:

Recursively $(x_{l+2}=x_{l+1} + \mathcal{F}(x_{l+1}, \mathcal{W_{l+1}}) = x_l + \mathcal{F}(x_l, \mathcal{W_l}) + \mathcal{F}(x_{l+1},\mathcal{W_{l+1}}), etc.)$ we will have: