# 1. Cost Functions

• a good choice for the criterion is maximum likelihood regularized with dropout, possibly also with weight decay.

## 1.2. Focal loss

def focal_loss(inputs, targets):
gamma = 2
N = inputs.size(0)
C = inputs.size(1)
P = F.softmax(inputs) # softmax(x)

ids = targets.view(-1, 1)

probs = (P * class_mask).sum(1).view(-1, 1)# softmax(x)_class

log_p = probs.log()
# print('probs size= {}'.format(probs.size()))
# print(probs)

batch_loss = -(torch.pow((1 - probs), gamma)) * log_p
# print('-----bacth_loss------')
# print(batch_loss)

loss = batch_loss.mean()

return loss

• (1D Tensor, Variable) : the scalar factor for this criterion
• (float, double) : ; reduces the relative loss for well-classiﬁed examples (p > .5), putting more focus on hard, misclassiﬁed examples
• size_average(bool): By default, the losses are averaged over observations for each minibatch. However, if the field size_average is set to False, the losses are instead summed for each minibatch.

## 1.4. Tukeys Loss

Robust Optimization for Deep Regression

TukeysBiweight

# 2. Optimization Procedure

• a good choice for the optimization algorithm for a feed-forward network is usually stochastic gradient descent with momentum.

# 3. Loss Function and Conditional Log-Likelihood

• In the 80’s and 90’s the most commonly used loss function was the squared error

• if f is unrestricted (non- parametric),

• Replacing the squared error by an absolute value makes the neural network try to estimate not the conditional expectation but the conditional median

• 交叉熵（cross entropy）目标函数 : when y is a discrete label, i.e., for classification problems, other loss functions such as the Bernoulli negative log-likelihood4 have been found to be more appropriate than the squared error. ()

• to be strictly between 0 to 1: use the sigmoid as non-linearity for the output layer(matches well with the binomial negative log-likelihood cost function)

The mean is halved（）as a convenience for the computation of the gradient descent, as the derivative term of the square function will cancel out the（) term.

##### Learning a Conditional Probability Model
• loss function as corresponding to a conditional log-likelihood, i.e., the negative log-likelihood (NLL) cost function

• example) if y is a continuous random variable and we assume that, given x, it has a Gaussian distribution with mean ${f}_{θ}$(x) and variance ${\sigma}^{2}$

• minimizing this negative log-likelihood is therefore equivalent to minimizing the squared error loss.

• for discrete variables, the binomial negative log-likelihood cost func- tion corresponds to the conditional log-likelihood associated with the Bernoulli distribution (also known as cross entropy) with probability of generating y = 1 given x =