Optimization Procedure
A good choice for the criterion is maximum likelihood regularized with dropout, possibly also with weight decay.
A good choice for the optimization algorithm for a feed-forward network is usually stochastic gradient descent with momentum.
Loss Function and Conditional Log-Likelihood
In the 80’s and 90’s the most commonly used loss function was the squared error
if f is unrestricted (non-parametric),
Replacing the squared error by an absolute value makes the neural network try to estimate not the conditional expectation but the conditional median
分类交叉熵损失(Categorical Cross-Entropy Loss)。
交叉熵(cross entropy)目标函数 : when y is a discrete label, i.e., for classification problems, other loss functions such as the Bernoulli negative log-likelihood have been found to be more appropriate than the squared error. ($y\in{ { 0,1 } }$)
${f}_{\theta}(x)$ to be strictly between 0 to 1: use the sigmoid as non-linearity for the output layer(matches well with the binomial negative log-likelihood cost function)
The mean is halved($\frac 12$)as a convenience for the computation of the gradient descent, as the derivative term of the square function will cancel out the($\frac 12$) term.
Learning a Conditional Probability Model
负对数似然(NLL:Negative Log Likelihood)
loss function as corresponding to a conditional log-likelihood, i.e., the negative log-likelihood (NLL) cost function
Example: if y is a continuous random variable and we assume that, given , it has a Gaussian distribution with mean ${f}_{\theta}(x)$ and variance ${\sigma}^{2}$
Minimizing this negative log-likelihood is therefore equivalent to minimizing the squared error loss.
For discrete variables, the binomial negative log-likelihood cost function corresponds to the conditional log-likelihood associated with the Bernoulli distribution (also known as cross entropy) with probability $p = {f}_{\theta}(x)$ of generating $y = 1$ given$x =x$
分类交叉熵损失(Categorical Cross-Entropy Loss)
分类交叉熵损失也被称为负对数似然(negative log likelihood)。这是一种用于解决分类问题的流行的损失函数,可用于测量两种概率分布(通常是真实标签和预测标签)之间的相似性。它可用 $L = -\sum(y * \log(y{prediction}))$ 表示,其中 y 是真实标签的概率分布(通常是一个one-hot vector),$y{prediction} $是预测标签的概率分布,通常来自于一个 softmax。
Tukeys Loss
Robust Optimization for Deep Regression
Dice Loss
常用于图像分割任务 Pytorch实现
Perceptual Loss
感知损失:可以将卷积神经网络提取出的feature,作为目标函数的一部分,通过比较待生成的图片经过CNN的feature值与目标图片经过CNN的feature值,使得待生成的图片与目标图片在语义上更加相似(相对于Pixel级别的损失函数)。
Focal loss
看ICCV那篇focal loss的论文《Focal Loss for Dense Object Detection》.
不过这个pytorch版detectron还没实现,官方Detectron是集成在Caffe2里。可参考Pytorch实现。
def focal_loss(inputs, targets):
gamma = 2
N = inputs.size(0)
C = inputs.size(1)
P = F.softmax(inputs) # softmax(x)
class_mask = inputs.data.new(N, C).fill_(0)
class_mask = Variable(class_mask)
ids = targets.view(-1, 1)
class_mask.scatter_(1, ids, 1.)
# print(class_mask)
probs = (P * class_mask).sum(1).view(-1, 1)# softmax(x)_class
log_p = probs.log()
# print('probs size= {}'.format(probs.size()))
# print(probs)
batch_loss = -(torch.pow((1 - probs), gamma)) * log_p
# print('-----bacth_loss------')
# print(batch_loss)
loss = batch_loss.mean()
return loss
- (1D Tensor, Variable) : the scalar factor for this criterion
- (float, double) : ; reduces the relative loss for well-classified examples (p > .5), putting more focus on hard, misclassified examples
- size_average(bool): By default, the losses are averaged over observations for each minibatch. However, if the field size_average is set to False, the losses are instead summed for each minibatch.
Huber Loss