1. 神经网络结构


1.1. Multi-Layer Neural Network}

A.3-layer network: Input Layer,Hidden Lyer,Output layer\ Except input units,each unit has a bias.

1.2. preassumption calculation}

Hidden Layer: netj=i=1dxiwji+wj0=i=0dxiwji=wjtx net_{j}=\sum_{i=1}^{d}x_{i}w_{ji}+w_{j0}=\sum_{i=0}^{d}x_{i}w_{ji}=w_{j}^{t}x Specifically, a signal xix_{i} at the input of synapse ii connected to nueron jj us multiplied by the synaptic weight wjiw_{ji}\ ii refers input layer,jj refers hidden layer.wj0w_{j0} is the bias.x0=+1x_{0}=+1\

  • Each neuron is represented by a set of linear synaptic links, an externally applied bias, and a possibly nonlinear activation link.The bias is represented by a synaptic link connected to an input fixed at +1+1.
  • The synaptic links of a neuron weight their respective input signals.
  • The weighted sum of the input signals defines the induced local field of the neuron in question.
  • The activation link squashes the induced local field of the neuron to produce an output.

Output layer: yj=f(netj) y_{j}=f(net_{j}) f()f() is the \emph{activation function}.It defines the output of a neuron in terms of the induced local field netnet .

{% math %}
\xymatrix {
 %x_{0}=+1 \ar[ddr]|(0.6){w_{j0}} &  &\\
 %x_{1} \ar[r]|(0.6){w_{j1}} & B & C\\
 %x_{2} \ar[r]^(0.6){w_{j2}} & net_{j} \ar[r]^(0.6){f()} & y_{j} \\
 x
}
{% endmath %}

For example:

netk=j=1nHyiwkj+wk0=j=0nHxiwji=wkty net_{k}=\sum_{j=1}^{n_{H}}y_{i}w_{kj}+w_{k0}=\sum_{j=0}^{n_{H}}x_{i}w_{ji}=w_{k}^{t}y

nHn_{H}is the number of hidden layers.\

So:

gk(x)=f(j=1nHwkj+f(i=0dxiwji+wj0)+wk0) g_{k}(x)=f(\sum_{j=1}^{n_{H}}w_{kj}+f(\sum_{i=0}^{d}x_{i}w_{ji}+w_{j0})+w_{k0})

The activate function of output layer can be different from hidden layer while each unit can have different activate function.

1.3. BP Algorithm

The popularity of on-line learning for the supervised training of multilayer perceptrons has been further enhanced by the development of the back-propagation algorithm. Backpropagation, an abbreviation for "backward propagation of errors",is the easiest way of supervised training.We need to generate output activations of each hidden layer.\ The partial derivative J/wji\partial J /\partial w_{ji} represents a sensitivity factor, determining the direction of search in weight space for the synaptic weight wji w_{ji} Learning:

T={x(n),d(n)}n=1Nej(n)=dj(n)yj(n) \begin{aligned} \mathcal T =\{ x(n),d(n)\}_{n=1}^{N}\\ e_{j}(n)=d_{j}(n)-y_{j}(n) \end{aligned}

the instantaneous error energy of neuron jj is defined by

J(w)=12k=1c(ek)2=12tδ2 J(w)=\frac 12 \sum_{k=1}^{c}(e_{k})^{2}=\frac 12||t-\delta||^{2} In the batch method of supervised learning, adjustments to the synaptic weights of the multilayer perceptron are performed \emph{after} the presentation of all the %N% examples in the training sample T\mathcal T that constitute one \emph{epoch} of training. In other words, the cost function for batch learning is defined by the average error energy J(w)J(w).

  • firstly define the training bias of output layer:

Δw=ηJ(w)ww(m+1)=w(m)+/deltaw(m) \begin{aligned} \Delta w=-\eta\frac {\partial J(w)}{\partial w} \\ w(m+1)=w(m)+/delta w(m) \end{aligned}

Jwkj=JnetknetkwkjJnetk=JδkδkJ=(tkδk)f(netk)Δwkj=ηJnetk=η(tkδk))f(netk)yj \begin{aligned} \frac {\partial J}{\partial w_{kj}}=\frac {\partial J}{\partial net_{k}}\frac {\partial net_{k}}{\partial w_{kj}} \\ \frac {\partial J}{\partial net_{k}}= \frac {\partial J}{\partial \delta _{k}}\frac {\partial \delta _{k}}{\partial J}=-(t_{k}-\delta _{k})f'(net_{k}) \\ \Delta w_{kj}=\eta \frac {\partial J}{\partial net_{k}}=\eta (t_{k}-\delta _{k}))f'(net_{k})y_{j} \end{aligned}

  • input->hidden

    1.4. 仿射层(Affine Layer)

神经网络中的一个全连接层。仿射(Affine)的意思是前面一层中的每一个神经元都连接到当前层中的每一个神经元。在许多方面,这是神经网络的「标准」层。仿射层通常被加在卷积神经网络或循环神经网络做出最终预测前的输出的顶层。仿射层的一般形式为 y = f(Wx + b),其中 x 是层输入,w 是参数,b 是一个偏差矢量,f 是一个非线性激活函数。

1.5. 多层感知器(MLP:Multilayer Perceptron)

多层感知器是一种带有多个全连接层的前馈神经网络,这些全连接层使用非线性激活函数(activation function)处理非线性可分的数据。MLP 是多层神经网络或有两层以上的深度神经网络的最基本形式。

1.6. 反向传播(Backpropagation)

反向传播是一种在神经网络中用来有效地计算梯度的算法,或更一般而言,是一种前馈计算图(feedforward computational graph)。其可以归结成从网络输出开始应用分化的链式法则,然后向后传播梯度。反向传播的第一个应用可以追溯到 1960 年代的 Vapnik 等人,但论文 Learning representations by back-propagating errors常常被作为引用源。

技术博客:计算图上的微积分学:反向传播(http://colah.github.io/posts/2015-08-Backprop/)

1.7. 通过时间的反向传播(BPTT:Backpropagation Through Time)

通过时间的反向传播是应用于循环神经网络(RNN)的反向传播算法。BPTT 可被看作是应用于 RNN 的标准反向传播算法,其中的每一个时间步骤(time step)都代表一个计算层,而且它的参数是跨计算层共享的。因为 RNN 在所有的时间步骤中都共享了同样的参数,一个时间步骤的错误必然能「通过时间」反向到之前所有的时间步骤,该算法也因而得名。当处理长序列(数百个输入)时,为降低计算成本常常使用一种删节版的 BPTT。删节的 BPTT 会在固定数量的步骤之后停止反向传播错误。

论文:Backpropagation Through Time: What It Does and How to Do It

给定训练集D={(x1,y1),(x2,y2),(xm,ym)}D=\{(x_1,y_1), (x_2,y_2),\ldots,(x_m,y_m)\},xiRdx_i \in \mathbb R^d,yiRcy_i \in \mathbb R^c,即输入实例由d个属性描述,输出cc维实值向量.

下图给出了一个拥有dd个输入神经元、mm个隐层神经元、cc个输出神经元的多层前馈网络结构:

ann.png-36.3kB

  • 隐层第hh个神经元的阈值用γh\gamma _h表示,
  • 输出层第jj个神经元的阈值用θj\theta_j表示。
  • 输入层第ii个神经元与隐层第hh个神经元之间的连接权为vihv_{ih};
  • 隐层第hh个神经元与输出层第jj个神经元之间的连接权为whjw_{hj};
  • 记隐层第hh个神经元接收到的输入为αh=i=1dvihxi\alpha_h=\sum_{i=1}^dv_{ih}x_i;
  • 输出层第jj个神经元接收到的输入为βj=h=1mwhjbh\beta_j=\sum_{h=1}^mw_{hj}b_h, 其中bhb_h为隐层第hh个神经元的输出;
  • 假设隐层和输出层神经元都使用Logistic函数: P(t)=11+et P(t)=\frac 1{1+e^{-t}}

以串行为例,对训练样本(xk,yk)(x_k,y_k),假定神经网络的输出为y^k=(y^1k,y^2k,,y^ck)\hat y_k=(\hat y_1^k,\hat y_2^k,\ldots,\hat y_c^k),即 y^jk=f(βjθj), \hat y_j^k=f(\beta_j-\theta_j), 则网络在(xk,yk)(x_k,y_k)上的均方误差(LMS算法)为 Ek=12j=1c(y^jkyjk)2. E_k=\frac 12\sum_{j=1}^c(\hat y_j^k-y_j^k)^2. 根据广义的感知机学习规则对参数进行更新,任意参数vv的更新估计式为 vv+Δv v \leftarrow v+\Delta v BP算法基于梯度下降策略,对误差EkE_k,给定学习率η\eta,有 Δwhj=ηEkwhj \Delta w_{hj}=-\eta \frac {\partial E_k}{\partial w_{hj}} 考虑到whjw_{hj}先影响到第jj个输出层神经元的输入值βj\beta_j,再影响到其输出值y^jk\hat y_j^k,最后影响到EkE_k,有 Ekwhj=Eky^jky^jkβjβjwhj \frac {\partial E_k}{\partial w_{hj}}=\frac {\partial E_k}{\partial \hat y_j^k}\frac {\partial \hat y_j^k}{\partial \beta _j}\frac {\partial \beta _j}{\partial w_{hj}} 根据βj\beta_j的定义,显然有 βjwhj=bh \frac {\partial \beta _j}{\partial w_{hj}}=b_h 而Logistic函数有一个性质: f(x)=f(x)(1f(x)) f'(x)=f(x)(1-f(x)) 于是有 gj=Eky^jky^jkβj=(y^jkyjk)f(βjθj)=y^jk(1y^jk)(yjky^jk). g_j=-\frac {\partial E_k}{\partial \hat y_j^k}\frac {\partial \hat y_j^k}{\partial \beta _j} =-(\hat y_j^k-y_j^k)f'(\beta _j-\theta _j)=\hat y_j^k(1-\hat y_j^k)(y_j^k-\hat y_j^k). 于是得到了BP算法中关于whjw_{hj}的更新公式 Δwhj=ηgjbh \Delta w_{hj}=\eta g_j b_h

根据前面的推导 Δθj=ηEky^jky^jkθj=η(y^jkyjk)f(βjθj)=ηgj \begin{aligned} \Delta \theta_j&=-\eta\frac {\partial E_k}{\partial \hat y_j^k}\cdot\frac {\partial \hat y_j^k}{\partial \theta_j}\\ &=\eta(\hat y_j^k-y_j^k)f'(\beta _j-\theta _j)\\ &=-\eta g_j \end{aligned} bhb_h为第hh个隐层神经元的输出,即 bh=f(αhγh) b_h=f(\alpha _h-\gamma _h) 考虑到vihv_{ih}先影响到第hh个隐层神经元的输入值αh\alpha_h,再影响到其输出值bhb_h,最后影响到EkE_k,有 Ekvih=Ekbhbhαhαhvih \frac {\partial E_k}{\partial v_{ih}}=\frac {\partial E_k}{\partial b_h}\cdot\frac {\partial b_h}{\partial \alpha _h}\cdot\frac {\partial \alpha_h}{\partial v_{ih}} 类似(2)中的gjg_j,有 eh=Ekbhbhαh=j=1cEkβjβjbhf(αhγh)=j=1cwhjgjf(αhγh)=bh(1bh)j=1cwhjgj \begin{aligned} e_h&=-\frac {\partial E_k}{\partial b_h}\cdot\frac {\partial b_h}{\partial \alpha _h}\\ &=-\sum _{j=1}^c\frac {\partial E_k}{\partial \beta _j}\cdot \frac {\partial \beta _j}{\partial b_h}f'(\alpha_h-\gamma_h)\\ &=\sum _{j=1}^cw_{hj}g_jf'(\alpha_h-\gamma_h)\\ &=b_h(1-b_h)\sum_{j=1}^cw_{hj}g_j \end{aligned} 所以 Δvih=ηehxi \Delta v_{ih}=\eta e_h x_i 另外,类似Δθj\Delta \theta _j Δγh=ηeh \Delta \gamma_h=-\eta e_h

信号流图: BP算法信号流向图

1.8. Siamese Networks

One Shot Learning with Siamese Networks in PyTorch

results matching ""

    No results matching ""