# 1. Optimization Procedure

A good choice for the criterion is maximum likelihood regularized with dropout, possibly also with weight decay.

A good choice for the optimization algorithm for a feed-forward network is usually stochastic gradient descent with momentum.

# 2. Loss Function and Conditional Log-Likelihood

In the 80’s and 90’s the most commonly used loss function was the squared error

$L({ f }_{ \theta }(x),y)={ ||f\theta(x)-y|| }^{ 2 }$

if f is unrestricted (non-parametric),

$f(x) = E[y | x = x]$

Replacing the squared error by an absolute value makes the neural network try to estimate not the conditional expectation but the conditional median

$L({ f }_{ \theta }(x),y)=-ylog{ f }_{ \theta }(x)-(1-y)log(1-{ f }_{ \theta }(x))$

${f}_{\theta}(x)$ to be strictly between 0 to 1: use the sigmoid as non-linearity for the output layer(matches well with the binomial negative log-likelihood cost function)

The mean is halved（$\frac 12$）as a convenience for the computation of the gradient descent, as the derivative term of the square function will cancel out the（$\frac 12$) term.

# 3. Learning a Conditional Probability Model

loss function as corresponding to a conditional log-likelihood, i.e., the negative log-likelihood (NLL) cost function

${L}_{NLL}({f}_{\theta}(x),y)=-logP(y=y|x=x;\theta)$

Example: if y is a continuous random variable and we assume that, given $x$, it has a Gaussian distribution with mean ${f}_{\theta}(x)$ and variance ${\sigma}^{2}$

$-logP(y|x;\theta)=\frac { 1 }{ 2 } { ({ f }_{ \theta }(x)-y) }^{ 1 }/{ \sigma }^{ 2 }+log(2\pi{\sigma}^{ 2 })$

Minimizing this negative log-likelihood is therefore equivalent to minimizing the squared error loss.

For discrete variables, the binomial negative log-likelihood cost function corresponds to the conditional log-likelihood associated with the Bernoulli distribution (also known as cross entropy) with probability $p = {f}_{\theta}(x)$ of generating $y = 1$ given$x =x$

\begin{aligned} {L}_{NLL}=-logP(y|x;\theta)={-1}_{y=1}{logp-1}_{y=0}log(1-p)\\ =-ylog{f}_{\theta}(x)-(1-y)log(1-{f}_{\theta}(x)) \end{aligned}

## 3.2. Tukeys Loss

Robust Optimization for Deep Regression

TukeysBiweight

# 5. Focal loss

$Loss(x, class) = - \alpha (1-softmax(x)_{[class]})^\gamma \log(softmax(x)_{[class]})$

def focal_loss(inputs, targets):
gamma = 2
N = inputs.size(0)
C = inputs.size(1)
P = F.softmax(inputs) # softmax(x)

ids = targets.view(-1, 1)

probs = (P * class_mask).sum(1).view(-1, 1)# softmax(x)_class

log_p = probs.log()
# print('probs size= {}'.format(probs.size()))
# print(probs)

batch_loss = -(torch.pow((1 - probs), gamma)) * log_p
# print('-----bacth_loss------')
# print(batch_loss)

loss = batch_loss.mean()

return loss

• $\alpha$(1D Tensor, Variable) : the scalar factor for this criterion
• $\gamma$(float, double) : $\gamma > 0$; reduces the relative loss for well-classiﬁed examples (p > .5), putting more focus on hard, misclassiﬁed examples
• size_average(bool): By default, the losses are averaged over observations for each minibatch. However, if the field size_average is set to False, the losses are instead summed for each minibatch.

## 5.1. Huber Loss

\begin{aligned} \text{loss}(x, y) = \frac{1}{n} \sum_{i} z_{i} \\ z_{i} = \begin{cases} 0.5 (x_i - y_i)^2, & \text{if } |x_i - y_i| < 1 \\ |x_i - y_i| - 0.5, & \text{otherwise } \end{cases} \end{aligned}