# 1. 优化算法

• 选择合适的learning rate比较困难
• 对所有的参数更新使用同样的learningrate。对于稀疏数据或者特征，有时我们可能想更新快一些对于不经常出现的特征，对于常出现的特征更新慢一些，这时候SGD就不太能满足要求了
• SGD容易收敛到局部最优，并且容易被困在鞍点

### 1.1.1. 动量（Momentum）

• It damps oscillation in directions of high curvature by combining gradients with opposite signs.
• It builds up speed in directions with a gentle but consistent gradient.

$\mathrm v(t)=\alpha \mathrm v(t-1)-\varepsilon \frac {\partial E}{\partial \mathrm w}(t)$

• increment the previous velocity
• decays by momentum $\alpha(\alpha<1)$

$\Delta \mathrm w(t)=\mathrm v(t)=\alpha \Delta \mathrm w(t-1)-\varepsilon \frac {\partial E}{\partial \mathrm w}(t)$

terminal velocity:$\frac 1{1-\alpha}(-\varepsilon \frac{\partial E}{\partial \mathrm w})$

$\alpha=0.9$ correspond to multiplying the maximum speed by 10 relative to the gradient descent algorithm.

It is less important to adapt $\alpha$ over time than to shrink $\varepsilon$ over time.

#### Nesterov Momentum

First make a jump

Then measure the gradient, make a correction.

"It turns out ,if you're going to gamble, it's much better to gamble and then make a correction, than to make a correction and then gamble.

To determine the individual learning rates:

• $+\delta$ : Increase the local gain if the gradient for that weight does not change sign.
• $\times(1-\delta)$ : Use small additive increases and multiplicative decreases.

Tricks:

• Limit the gains to lie in some reasonable range.
• Use full batch learning or very big mini-batches.
• Adaptive learning rates can be combined with momentum.
• Adaptive learning rates only deal with axis-aligned effects.

### 1.1.5. RMSProp

• The magnitude of the gradient can be very different for different weights and can change during learning.
• This makes it hard to choose a single global learning rate.
• For full batch learning, we can deal with this variation by only using the sign of the gradient.

rprop: This combines the idea of only using the sign of the gradient with the idea of adapting the step size separately for each weight.

RMSProp: A mini-batch version of rprop

• Keep a moving average of the squared gradient for each weight

$MeanSquare(w,t)=0.9 MeanSquare(w,t-1)+0.1(\frac {\partial E}{\partial w}(t))^2$

• Dividing the gradient by $\sqrt{MeanSquare(w,t)}$ makes the learning work much better.

PPT：用于机器学习的神经网络 讲座6a

## 1.2. 问题

### 1.2.2. 难例挖掘

import torch as th

class NLL_OHEM(th.nn.NLLLoss):
""" Online hard example mining.
Needs input from nn.LogSotmax() """

def __init__(self, ratio):
super(NLL_OHEM, self).__init__(None, True)
self.ratio = ratio

def forward(self, x, y, ratio=None):
if ratio is not None:
self.ratio = ratio
num_inst = x.size(0)
num_hns = int(self.ratio * num_inst)
x_ = x.clone()