## 第二章 无约束优化（线搜索框架下）

### 1. 梯度法（最速下降法）

1. 负梯度方向，“速”非快

2. 一维搜索求，使得

1. 最速下降法适用于寻优过程的前期迭代或作为间插步骤，当接近极值点时，宜选用收敛快的算法.

### 3. 共轭梯度法

• 线性变换
• 向量 特征向量
• 比例引子 特征值

• 第一步：旋转
• 第二步：对角矩阵，相当于进行缩放
• 第三步：旋转 ，转回去

#### 什么是共轭

• $G$: $n \times n$对称正定矩阵
• ${p_0,p_1,\dots,p_k}$: 非零向量组合

• 所谓共轭方向方法即是所有使用共轭向量作为梯度更新方向的算法。
• 共轭梯度算法则是在迭代过程中利用梯度下降法来更新共轭向量组。

### 2. Newton Method

• On a real quadritic surface it jumps to the minimum in one step.
• Unfortunately, with only a million weights, the curvature matrix has a trillion terms and it is totally infeasible to invert it.

#### Curvature Matrices: H(w)

• Each element in the curvature matrix specifies how the gradient in one direction changes as we move in some other direction.
• The off-diagonal terms correspond to twists in the error surface.
• The reason steepest descent goes wrong is that the gradient for one weight gets messed up by the simultaneous changes to all the other weights.
• The curvature matrix determines the sizes of these interactions.

The genuinely quadratic durface is the quadratic approximation to the true surface.

$g(x)$是quadratic$f(x)$的导数，求$f(x)$的极小点则是求$g(x)$的零点，于是有：

#### 推广至高维

Hessian矩阵$\mathbf H=\nabla^2 f(\mathbf x)$

• 二阶收敛
• 也可能不收敛

#### 缺点

• 要求Hessian矩阵要可逆
• 需要计算二阶导数和逆矩阵，计算量和存储量 开销较大

### 2. Quasi-Newton Method

In the Hessian-free method, we make an approximation to the curvature matrix and then, assuming that approcimation is correct, we minimize the error using an efficient technique called conjugate gradient. Then we make another approximation to the curvature matrix and minimize again.

• For RNNs its important to add a penalty for changing any of the hidden activities too much.

#### BFGS

(Broyden-Fletcher-Goldfarb-Shanno算法) 改进了每一步对Hessian的近似。

L-BFGS:

• Use a sequence of steps each of which finds the minimum along one direction.
• Make sure that each direction is "conjugate" to the previous directions so you do not mess up the minimization you already did.

• conjugate : when you go in the new direction, you do not change the gradients in the previous direction.
• After N step, conjugate gradient is guaranteed to find the minimum of N-dimensional quadratic surface.

• Conjugate gradient can be apllied to a non-quadratic error surface and it usually works quite well.

• The HF optimizer uses CG for minimization on a genuinely quadratic surface where it excels.

### 3.线性搜索

• 搜索方向$d_k$
• 步长因子$\alpha_k$

#### 精确线性搜索

• $A$ : 半正定对称矩阵

## 第三章 有约束优化

• $\mathbf c(x)=(c_1(x),\dots,c_n(x))^\top$
• $\mathbf \lambda=(\lambda_1,\dots,\lambda_n)^\top$
• $\nabla_xL(x,\mathbf \lambda)=\nabla f(x)-\mathbf \lambda^\top \nabla \mathbf c(x)=0$
• $\nabla_\lambda L(x,\mathbf \lambda)=-\mathbf c(x)=0$

$-1 < x_1 < 1$
$-1 < x_2 < 1$

### 1. 信赖域

• 当前点，用二次函数近似当前函数：$q_k(\cdot)\approx f(x)$
• 求解$q_k$极小值点作为下一次试探点
• 先确定区域$\Delta k$，可选不同范数

• 无穷范数：$\max{|x_i|}$
• $s_k^c=-\tau_k \frac{\Delta k}{||g_k||}g_k$

• $\tau_k=\begin{cases} 1, &g_k^\top B_kg_k\leq0\cr \min{\frac{||g_k||^3}{\Delta kg_k^\top B_kg_k},1}, &otherwise \end{cases}$

#### 折线法

$\Delta_k=\frac 12$，第$k$步求解子问题时用双折线法求解

### 2. 积极集法

• $y\in {y|\nabla_ic(x^*)^\top y=0,\forall i}$在$c(x)=0$这个超平面上，等价于在无约束情况$y\in \mathbb R^n$。
• 用脚标的集合表示哪些到达边界约束。
• $x^$处的积极集：$I(x^)={i|ci(x^)=0}\Longleftrightarrow \exists \lambda_0,\dots,\lambda_n\geq0 s.t. \lambda_0\nabla f(x^)-\sum{i=1}^m \lambda_i^ \nabla_ic_i(x^)=0$
• 如果$\nabla ci(x^)(i\in f(x^))$线性无关，则存在向量$\mathbf \lambda^$，使得$\nabla f(x^)-\sum{i=1}^m \lambda_i^ \nabla_ic_i(x^)=0,\lambda_i^_\geq0对偶变量,\lambda_i^_c_i(x^*)=0互补松弛条件$
• 带约束问题$\rightarrow$方程组求解
• 目标函数下降方向集合$F={d|\nabla f(x)^\top d<0}$
• 可行方向：$D={d|\nabla c_i(x)^\top d\geq0,i\in I(x)}$

Gordan定理：$\mathbf A\mathbf x>0$有解$\Longleftrightarrow$不存在$\mathbf y\geq0$使得$\mathbf A^\top \mathbf y=0$

$d_1=\frac 12,d_2=\frac 94,\lambda_1=\frac 34>0$，即$\mathbf \lambda=(\frac 34,0,0)^\top$，$\mathbf x^2=(\frac 12,\frac 94)^\top$是最优解。

### 3. 直接法

1. $x_1\in \mathbb R^n,d_1,\dots,d_n线性无关$
2. $x_1$沿$d_1$线性搜索得$x_2$，即$x_2$是$f(x)$在${z|z=x_1+\alpha_1 d_1}$上的极小值点
3. $\beta_1,\dots,\beta_n>0,\nu^{(1)}=x_2+\beta_2d_2$，令$d_2=\nu^{(2)}-x_2$，$\nu^{(2)}$是$f(x)$在${z|z=\nu^{(1)}+\alpha_1 d_1}$上的极小值点
• $\nabla f(\nu^{(1)})^\top d_1=0,\nabla f(x_2)^\top d_1=0$
• $d_2^\top Hd_1=0$(共轭，$d_2=\nu^{(2)}-x_2$)
4. 令$d_3=\nu^{(3)}-x_3$，则$\nu^{(3)}$是$f(x)$在${z|z=x_1+\alpha_1 d_1+\alpha_2 d_2}$上的极小值点
5. $x{k+1}$是$f(x)$在${z|z=x_1+\sum{i=1}^k\alphai d_i}$上的极小值点，令$d{k+1}=\nu^{(k+1)}-x_{k+1}$

## Reference

• 《最优化理论与方法》陈宝林
• 《非线性优化计算方法》袁亚湘
• 《Numerical Optimization》Jorge Nocedal

## 其他

### KKT条件

In mathematical optimization, the Karush–Kuhn–Tucker (KKT) conditions (also known as the Kuhn–Tucker conditions) are first order necessary conditions for a solution in nonlinear programming to be optimal, provided that some regularity conditions are satisfied. Allowing inequality constraints, the KKT approach to nonlinear programming generalizes the method of Lagrange multipliers, which allows only equality constraints. The system of equations corresponding to the KKT conditions is usually not solved directly, except in the few special cases where a closed-form solution can be derived analytically. In general, many optimization algorithms can be interpreted as methods for numerically solving the KKT system of equations.

## 编程

[Anaconda][17]

footnote1. 《2.7 数学优化：找到函数的最优解