# How to learn word2vec

### 1. NNLM

Bengio et al. \cite{nnlm} first proposed a Neural Network Language Model (NNLM) that simultaneously learns a word embedding and a language model.The language model utilizes several previous words to predict the distribution of the next word.For each sample in the corpus ,we maximize the log-likelihood of the probability of the last word given the previous words.This model uses a concatenation of the previous words’ embeddings as the input.The model structure is a feed-forward neural network with one hidden layer.

### 2. LBL

The Log-Bilinear Language Model(LBL) proposed by Mnih and Hinton combines Bengio’s Hierachical NNLM and Log Bi-Linear.It uses a log-bilinear energy function that is almost equal to that of the NNLM and removes the non-linear activation function tanh.

A previous study \cite{lbl} proposed a widely used model architecture for estimating neural network language model.

## Multi-Layer Neural Network

A. 3-layer network: Input Layer,Hidden Lyer,Output layer. Except input units,each unit has a bias.

### preassumption calculation

$\begin{equation} net_{j} = \sum_{i=1}^{d}x_{i}w_{ji}+w_{j0}=\sum_{i=0}^{d}x_{i}w_{ji}=w_{j}^{t}x \end{equation}$

Specifically, a signal $$x_{i}$$ at the input of synapse $$i$$ connected to nueron $$j$$ us multiplied by the synaptic weight $$w_{ji}$$. $$i$$ refers input layer,$$j$$ refers hidden layer.$$w_{j0}$$ is the bias.$$x_{0}=+1$$.

• Each neuron is represented by a set of linear synaptic links, an externally applied bias, and a possibly nonlinear activation link.The bias is represented by a synaptic link connected to an input fixed at $$+1$$.
• The synaptic links of a neuron weight their respective input signals.
• The weighted sum of the input signals defines the induced local field of the neuron in question.
• The activation link squashes the induced local field of the neuron to produce an output. Output layer:
$\begin{equation} y_{j}=f(net_{j}) \end{equation}$

$$f()$$ is the \emph{activation function}.It defines the output of a neuron in terms of the induced local field $$net$$ .

$\xymatrix { x_{0}=+1 \ar[ddr]|(0.6){w_{j0}} & &\ x_{1} \ar[r]|(0.6){w_{j1}} & B & C\ x_{2} \ar[r]^(0.6){w_{j2}} & net_{j} \ar[r]^(0.6){f()} & y_{j} \ x }$

For example: $$\begin{equation} net_{k}=\sum_{j=1}^{n_{H}}y_{i}w_{kj}+w_{k0}=\sum_{j=0}^{n_{H}}x_{i}w_{ji}=w_{k}^{t}y \end{equation}$$ $$n_{H}$$is the number of hidden layers.

So: $$\begin{equation} g_{k}(x)=f(\sum_{j=1}^{n_{H}}w_{kj}+f(\sum_{i=0}^{d}x_{i}w_{ji}+w_{j0})+w_{k0}) \end{equation}$$ The activate function of output layer can be different from hidden layer while each unit can have different activate function.

%% BP Algorithm %%

### BP Algorithm

The popularity of on-line learning for the supervised training of multilayer perceptrons has been further enhanced by the development of the back-propagation algorithm. Backpropagation, an abbreviation for “backward propagation of errors”,is the easiest way of supervised training.We need to generate output activations of each hidden layer. The partial derivative $\partial J /\partial w_{ji}$ represents a sensitivity factor, determining the direction of search in weight space for the synaptic weight $w_{ji}$. Learning: $$\begin{gather} \mathcal T =\{ x(n),d(n)\}_{n=1}^{N}\\ e_{j}(n)=d_{j}(n)-y_{j}(n) \end{gather}$$ the instantaneous error energy of neuron $$j$$ is defined by $$\begin{gather} J(w)=\frac 12 \sum_{k=1}^{c}(e_{k})^{2}=\frac 12||t-\delta||^{2} \\ \end{gather}$$ In the batch method of supervised learning, adjustments to the synaptic weights of the multilayer perceptron are performed \emph{after} the presentation of all the $$N$$ examples in the training sample $$\mathcal T$$ that constitute one \emph{epoch} of training. In other words, the cost function for batch learning is defined by the average error energy $$J(w)$$.

• firstly define the training bias of output layer: $$\begin{gather} \Delta w=-\eta\frac {\partial J(w)}{\partial w} \\ w(m+1)=w(m)+/Delta w(m) \end{gather}$$ $$\begin{gather} \frac {\partial J}{\partial w_{kj}}=\frac {\partial J}{\partial net_{k}}\frac {\partial net_{k}}{\partial w_{kj}} \\ \frac {\partial J}{\partial net_{k}}= \frac {\partial J}{\partial \delta _{k}}\frac {\partial \delta _{k}}{\partial J}=-(t_{k}-\delta _{k})f'(net_{k}) \\ \Delta w_{kj}=\eta \frac {\partial J}{\partial net_{k}}=\eta (t_{k}-\delta _{k}))f'(net_{k})y_{j} \end{gather}$$
• input->hidden