How to learn word2vec

Choose your language :

1. NNLM

这里看的是Bengio的论文

Bengio et al. \cite{nnlm} first proposed a Neural Network Language Model (NNLM) that simultaneously learns a word embedding and a language model.The language model utilizes several previous words to predict the distribution of the next word.For each sample in the corpus ,we maximize the log-likelihood of the probability of the last word given the previous words.This model uses a concatenation of the previous words’ embeddings as the input.The model structure is a feed-forward neural network with one hidden layer.

2. LBL

The Log-Bilinear Language Model(LBL) proposed by Mnih and Hinton combines Bengio’s Hierachical NNLM and Log Bi-Linear.It uses a log-bilinear energy function that is almost equal to that of the NNLM and removes the non-linear activation function tanh.

A previous study \cite{lbl} proposed a widely used model architecture for estimating neural network language model.

NER

Multi-Layer Neural Network

A. 3-layer network: Input Layer,Hidden Lyer,Output layer. Except input units,each unit has a bias.

preassumption calculation

\[\begin{equation} net_{j} = \sum_{i=1}^{d}x_{i}w_{ji}+w_{j0}=\sum_{i=0}^{d}x_{i}w_{ji}=w_{j}^{t}x \end{equation}\]

Specifically, a signal $x_{i}$ at the input of synapse $i$ connected to nueron $j$ us multiplied by the synaptic weight $w_{ji}$. $i$ refers input layer,$j$ refers hidden layer.$w_{j0}$ is the bias.$x_{0}=+1$.

Each neuron is represented by a set of linear synaptic links, an externally applied bias, and a possibly nonlinear activation link.The bias is represented by a synaptic link connected to an input fixed at $+1$.
The synaptic links of a neuron weight their respective input signals.
The weighted sum of the input signals defines the induced local field of the neuron in question.
The activation link squashes the induced local field of the neuron to produce an output. Output layer:

\[\begin{equation} y_{j}=f(net_{j}) \end{equation}\]

$f()$ is the \emph{activation function}.It defines the output of a neuron in terms of the induced local field $net$ .

\[\xymatrix { x_{0}=+1 \ar[ddr]|(0.6){w_{j0}} & &\ x_{1} \ar[r]|(0.6){w_{j1}} & B & C\ x_{2} \ar[r]^(0.6){w_{j2}} & net_{j} \ar[r]^(0.6){f()} & y_{j} \ x }\]

For example: $\begin{equation} net_{k}=\sum_{j=1}^{n_{H}}y_{i}w_{kj}+w_{k0}=\sum_{j=0}^{n_{H}}x_{i}w_{ji}=w_{k}^{t}y \end{equation}$ $n_{H}$is the number of hidden layers.

So: $\begin{equation} g_{k}(x)=f(\sum_{j=1}^{n_{H}}w_{kj}+f(\sum_{i=0}^{d}x_{i}w_{ji}+w_{j0})+w_{k0}) \end{equation}$ The activate function of output layer can be different from hidden layer while each unit can have different activate function.

%% BP Algorithm %%

BP Algorithm

The popularity of on-line learning for the supervised training of multilayer perceptrons has been further enhanced by the development of the back-propagation algorithm. Backpropagation, an abbreviation for “backward propagation of errors”,is the easiest way of supervised training.We need to generate output activations of each hidden layer. The partial derivative $\partial J /\partial w_{ji}$ represents a sensitivity factor, determining the direction of search in weight space for the synaptic weight $ w_{ji}$. Learning: $\begin{gather} \mathcal T =\{ x(n),d(n)\}_{n=1}^{N}\\ e_{j}(n)=d_{j}(n)-y_{j}(n) \end{gather}$ the instantaneous error energy of neuron $j$ is defined by $\begin{gather} J(w)=\frac 12 \sum_{k=1}^{c}(e_{k})^{2}=\frac 12||t-\delta||^{2} \\ \end{gather}$ In the batch method of supervised learning, adjustments to the synaptic weights of the multilayer perceptron are performed \emph{after} the presentation of all the $N$ examples in the training sample $\mathcal T$ that constitute one \emph{epoch} of training. In other words, the cost function for batch learning is defined by the average error energy $J(w)$.

firstly define the training bias of output layer: $\begin{gather} \Delta w=-\eta\frac {\partial J(w)}{\partial w} \\ w(m+1)=w(m)+/Delta w(m) \end{gather}$ $\begin{gather} \frac {\partial J}{\partial w_{kj}}=\frac {\partial J}{\partial net_{k}}\frac {\partial net_{k}}{\partial w_{kj}} \\ \frac {\partial J}{\partial net_{k}}= \frac {\partial J}{\partial \delta _{k}}\frac {\partial \delta _{k}}{\partial J}=-(t_{k}-\delta _{k})f'(net_{k}) \\ \Delta w_{kj}=\eta \frac {\partial J}{\partial net_{k}}=\eta (t_{k}-\delta _{k}))f'(net_{k})y_{j} \end{gather}$
input->hidden

本文由 Rowl1ng 创作，采用知识共享署名4.0 国际许可协议进行许可
本站文章除注明转载/出处外，均为本站原创或翻译，转载前请务必署名
本文会经常更新，最后编辑时间为:2015-12-25 11:12:00