How to learn word2vec

How to learn word2vec

Bengio et al. \cite{nnlm} first proposed a Neural Network Language Model (NNLM) that simultaneously learns a word embedding and a language model.The language model utilizes several previous words to predict the distribution of the next word.For each sample in the corpus ,we maximize the log-likelihood of the probability of the last word given the previous words.This model uses a concatenation of the previous words’ embeddings as the input.The model structure is a feed-forward neural network with one hidden layer.

2. LBL

The Log-Bilinear Language Model(LBL) proposed by Mnih and Hinton combines Bengio’s Hierachical NNLM and Log Bi-Linear.It uses a log-bilinear energy function that is almost equal to that of the NNLM and removes the non-linear activation function tanh.

A previous study \cite{lbl} proposed a widely used model architecture for estimating neural network language model.


Multi-Layer Neural Network

A. 3-layer network: Input Layer,Hidden Lyer,Output layer. Except input units,each unit has a bias.

preassumption calculation

\[\begin{equation} net_{j} = \sum_{i=1}^{d}x_{i}w_{ji}+w_{j0}=\sum_{i=0}^{d}x_{i}w_{ji}=w_{j}^{t}x \end{equation}\]

Specifically, a signal \(x_{i}\) at the input of synapse \(i\) connected to nueron \(j\) us multiplied by the synaptic weight \(w_{ji}\). \(i\) refers input layer,\(j\) refers hidden layer.\(w_{j0}\) is the bias.\(x_{0}=+1\).

\[\begin{equation} y_{j}=f(net_{j}) \end{equation}\]

\(f()\) is the \emph{activation function}.It defines the output of a neuron in terms of the induced local field \(net\) .

\[\xymatrix { x_{0}=+1 \ar[ddr]|(0.6){w_{j0}} & &\ x_{1} \ar[r]|(0.6){w_{j1}} & B & C\ x_{2} \ar[r]^(0.6){w_{j2}} & net_{j} \ar[r]^(0.6){f()} & y_{j} \ x }\]

For example: \(\begin{equation} net_{k}=\sum_{j=1}^{n_{H}}y_{i}w_{kj}+w_{k0}=\sum_{j=0}^{n_{H}}x_{i}w_{ji}=w_{k}^{t}y \end{equation}\) \(n_{H}\)is the number of hidden layers.

So: \(\begin{equation} g_{k}(x)=f(\sum_{j=1}^{n_{H}}w_{kj}+f(\sum_{i=0}^{d}x_{i}w_{ji}+w_{j0})+w_{k0}) \end{equation}\) The activate function of output layer can be different from hidden layer while each unit can have different activate function.

BP Algorithm

The popularity of on-line learning for the supervised training of multilayer perceptrons has been further enhanced by the development of the back-propagation algorithm. Backpropagation, an abbreviation for “backward propagation of errors”,is the easiest way of supervised training.We need to generate output activations of each hidden layer. The partial derivative $\partial J /\partial w_{ji}$ represents a sensitivity factor, determining the direction of search in weight space for the synaptic weight $ w_{ji}$. Learning: \(\begin{gather} \mathcal T =\{ x(n),d(n)\}_{n=1}^{N}\\ e_{j}(n)=d_{j}(n)-y_{j}(n) \end{gather}\) the instantaneous error energy of neuron \(j\) is defined by \(\begin{gather} J(w)=\frac 12 \sum_{k=1}^{c}(e_{k})^{2}=\frac 12||t-\delta||^{2} \\ \end{gather}\) In the batch method of supervised learning, adjustments to the synaptic weights of the multilayer perceptron are performed \emph{after} the presentation of all the \(N\) examples in the training sample \(\mathcal T\) that constitute one \emph{epoch} of training. In other words, the cost function for batch learning is defined by the average error energy \(J(w)\).