Optimization Procedure¶
A good choice for the criterion is maximum likelihood regularized with dropout, possibly also with weight decay.
A good choice for the optimization algorithm for a feed-forward network is usually stochastic gradient descent with momentum.
Loss Function and Conditional Log-Likelihood¶
In the 80’s and 90’s the most commonly used loss function was the squared error
$$ L({ f }_{ \theta }(x),y)={ ||f\theta(x)-y|| }^{ 2 } $$
if f is unrestricted (non-parametric),
$$ f(x) = E[y | x = x] $$
Replacing the squared error by an absolute value makes the neural network try to estimate not the conditional expectation but the conditional median
分类交叉熵损失(Categorical Cross-Entropy Loss)。
交叉熵(cross entropy)目标函数 : when y is a discrete label, i.e., for classification problems, other loss functions such as the Bernoulli negative log-likelihood have been found to be more appropriate than the squared error. (y\in{ { 0,1 } })
{f}_{\theta}(x) to be strictly between 0 to 1: use the sigmoid as non-linearity for the output layermatches well with the binomial negative log-likelihood cost function
The mean is halved(\frac 12)as a convenience for the computation of the gradient descent, as the derivative term of the square function will cancel out the(\frac 12) term.
Learning a Conditional Probability Model¶
负对数似然(NLL:Negative Log Likelihood)
loss function as corresponding to a conditional log-likelihood, i.e., the negative log-likelihood NLL cost function
Example: if y is a continuous random variable and we assume that, given x, it has a Gaussian distribution with mean {f}_{\theta}(x) and variance {\sigma}^{2}
Minimizing this negative log-likelihood is therefore equivalent to minimizing the squared error loss.
For discrete variables, the binomial negative log-likelihood cost function corresponds to the conditional log-likelihood associated with the Bernoulli distribution also known as cross entropy with probability p = {f}_{\theta}(x) of generating y = 1 givenx =x
$$ \begin{aligned} {L}{NLL}=-logP(y|x;\theta)={-1}}{logp-1{y=0}log(1-p)\ =-ylog{f}(x)) \end{aligned} $$}(x)-(1-y)log(1-{f}_{\theta
分类交叉熵损失(Categorical Cross-Entropy Loss)¶
分类交叉熵损失也被称为负对数似然(negative log likelihood)。这是一种用于解决分类问题的流行的损失函数,可用于测量两种概率分布(通常是真实标签和预测标签)之间的相似性。它可用 L = -\sum(y * \log(y_{prediction})) 表示,其中 y 是真实标签的概率分布(通常是一个one-hot vector),$y_{prediction} $是预测标签的概率分布,通常来自于一个 softmax。
Tukeys Loss¶
Robust Optimization for Deep Regression
Dice Loss¶
常用于图像分割任务 Pytorch实现
Perceptual Loss¶
感知损失:可以将卷积神经网络提取出的feature,作为目标函数的一部分,通过比较待生成的图片经过CNN的feature值与目标图片经过CNN的feature值,使得待生成的图片与目标图片在语义上更加相似(相对于Pixel级别的损失函数)。
Focal loss¶
看ICCV那篇focal loss的论文《Focal Loss for Dense Object Detection》.
不过这个pytorch版detectron还没实现,官方Detectron是集成在Caffe2里。可参考Pytorch实现。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
|
- \alpha(1D Tensor, Variable) : the scalar factor for this criterion
- \gamma(float, double) : \gamma > 0; reduces the relative loss for well-classified examples (p > .5), putting more focus on hard, misclassified examples
- size_average(bool): By default, the losses are averaged over observations for each minibatch. However, if the field size_average is set to False, the losses are instead summed for each minibatch.
Huber Loss¶
$$ \begin{aligned} \text{loss}(x, y) = \frac{1}{n} \sum_{i} z_{i} \
z_{i} = \begin{cases} 0.5 (x_i - y_i)^2, & \text{if } |x_i - y_i| < 1 \ |x_i - y_i| - 0.5, & \text{otherwise } \end{cases} \end{aligned} $$
機器/深度學習: 損失函數(loss function)- Huber Loss和 Focal loss
Inbox¶
- Efficient Optimization for Rank-based Loss
- FID:本质上是利用训好的特征提取器的最后一层输出embedding的分布来计算相似度,2-Wassertein distance
- Wassertein distance/EMD:distance metric defined between probability distributions. 基于微积分的“运土”代价计算,起源于optimal transport planning of good and materials(也许是苏联计划经济时期?)