# 1. 优化算法

• 选择合适的learning rate比较困难
• 对所有的参数更新使用同样的learningrate。对于稀疏数据或者特征，有时我们可能想更新快一些对于不经常出现的特征，对于常出现的特征更新慢一些，这时候SGD就不太能满足要求了
• SGD容易收敛到局部最优，并且容易被困在鞍点

### 1.1.1. 动量（Momentum）

• It damps oscillation in directions of high curvature by combining gradients with opposite signs.
• It builds up speed in directions with a gentle but consistent gradient.

• increment the previous velocity
• decays by momentum

terminal velocity:

correspond to multiplying the maximum speed by 10 relative to the gradient descent algorithm.

It is less important to adapt over time than to shrink over time.

#### Nesterov Momentum

First make a jump

Then measure the gradient, make a correction.

"It turns out ,if you're going to gamble, it's much better to gamble and then make a correction, than to make a correction and then gamble.

To determine the individual learning rates:

• : Increase the local gain if the gradient for that weight does not change sign.
• : Use small additive increases and multiplicative decreases.

Tricks:

• Limit the gains to lie in some reasonable range.
• Use full batch learning or very big mini-batches.
• Adaptive learning rates can be combined with momentum.
• Adaptive learning rates only deal with axis-aligned effects.

### 1.1.5. RMSProp

• The magnitude of the gradient can be very different for different weights and can change during learning.
• This makes it hard to choose a single global learning rate.
• For full batch learning, we can deal with this variation by only using the sign of the gradient.

rprop: This combines the idea of only using the sign of the gradient with the idea of adapting the step size separately for each weight.

RMSProp: A mini-batch version of rprop

• Keep a moving average of the squared gradient for each weight

• Dividing the gradient by makes the learning work much better.

PPT：用于机器学习的神经网络 讲座6a

## 1.2. 问题

### 1.2.2. 难例挖掘

import torch as th

class NLL_OHEM(th.nn.NLLLoss):
""" Online hard example mining.
Needs input from nn.LogSotmax() """

def __init__(self, ratio):
super(NLL_OHEM, self).__init__(None, True)
self.ratio = ratio

def forward(self, x, y, ratio=None):
if ratio is not None:
self.ratio = ratio
num_inst = x.size(0)
num_hns = int(self.ratio * num_inst)
x_ = x.clone()
for idx, label in enumerate(y.data):
inst_losses[idx] = -x_.data[idx, label]
#loss_incs = -x_.sum(1)
_, idxs = inst_losses.topk(num_hns)
x_hn = x.index_select(0, idxs)
y_hn = y.index_select(0, idxs)
return th.nn.functional.nll_loss(x_hn, y_hn)