13 KiB
| title | chunk | source | category | tags | date_saved | instance |
|---|---|---|---|---|---|---|
| Word2vec | 2/4 | https://en.wikipedia.org/wiki/Word2vec | reference | science, encyclopedia | 2026-05-05T10:14:12.458984+00:00 | kb-cron |
The idea of CBOW is to represent each word with a vector, such that it is possible to predict a word using the sum of the vectors of its neighbors. Specifically, for each word
w
i
{\displaystyle w_{i}}
in the corpus, the one-hot encoding of the word is used as the input to the neural network. The output of the neural network is a probability distribution over the dictionary, representing a prediction of individual words in the neighborhood of
w
i
{\displaystyle w_{i}}
. The objective of training is to maximize
∑
i
ln
Pr
(
w
i
∣
w
i
+
j
:
j
∈
N
)
{\displaystyle \sum _{i}\ln \Pr(w_{i}\mid w_{i+j}\colon j\in N)}
where
N
{\displaystyle N}
is a set of (non-zero) indices representing the relative locations of nearby words considered to be in
w
i
{\displaystyle w_{i}}
's neighborhood. For example, if we want each word in the corpus to be predicted by every other word in a small span of 4 words. The set of relative indexes of neighbor words will be:
N
=
{
−
2
,
−
1
,
+
1
,
+
2
}
{\displaystyle N=\{-2,-1,+1,+2\}}
, and the objective is to maximize
∑
i
ln
Pr
(
w
i
∣
w
i
−
2
,
w
i
−
1
,
w
i
+
1
,
w
i
+
2
)
{\displaystyle \sum _{i}\ln \Pr(w_{i}\mid w_{i-2},w_{i-1},w_{i+1},w_{i+2})}
. In standard bag-of-words, a word's context is represented by a word-count (aka a word histogram) of its neighboring words. For example, the "sat" in "the cat sat on the mat" is represented as {"the": 2, "cat": 1, "on": 1}. Note that the last word "mat" is not used to represent "sat", because it is outside the neighborhood
N
=
{
−
2
,
−
1
,
+
1
,
+
2
}
{\displaystyle N=\{-2,-1,+1,+2\}}
. In continuous bag-of-words, the histogram is multiplied by a matrix
V
{\displaystyle V}
to obtain a continuous representation of the word's context. The matrix
V
{\displaystyle V}
is also called a dictionary. Its columns are the word vectors. It has
D
{\displaystyle D}
columns, where
D
{\displaystyle D}
is the size of the dictionary. Let
d
{\displaystyle d}
be the length of each word vector. We have
V
∈
R
d
×
D
{\displaystyle V\in \mathbb {R} ^{d\times D}}
. For example, multiplying the word histogram {"the": 2, "cat": 1, "on": 1} with
V
{\displaystyle V}
, we obtain
2
v
the
+
v
cat
+
v
on
{\displaystyle 2v_{\text{the}}+v_{\text{cat}}+v_{\text{on}}}
. This is then multiplied with another matrix
V
′
{\displaystyle V'}
of shape
R
D
×
d
{\displaystyle \mathbb {R} ^{D\times d}}
. Each row of it is a word vector
v
′
{\displaystyle v'}
. This results in a vector of length
D
{\displaystyle D}
, one entry per dictionary entry. Then, apply the softmax to obtain a probability distribution over the dictionary. This system can be visualized as a neural network, similar in spirit to an autoencoder, of architecture linear-linear-softmax, as depicted in the diagram. The system is trained by gradient descent to minimize the cross-entropy loss. In full formula, the cross-entropy loss is:
−
∑
i
ln
e
v
w
i
′
⋅
(
∑
j
∈
N
v
w
j
+
i
)
∑
w
′
e
v
w
′
′
⋅
(
∑
j
∈
N
v
w
j
+
i
)
{\displaystyle -\sum _{i}\ln {\frac {e^{v_{w_{i}}'\cdot (\sum _{j\in N}v_{w_{j+i}})}}{\sum _{w'}e^{v_{w'}'\cdot (\sum _{j\in N}v_{w_{j+i}})}}}}
where the outer summation
∑
i
{\displaystyle \sum _{i}}
is over the words in a corpus, the quantity
∑
j
∈
N
v
w
j
+
i
{\displaystyle \sum _{j\in N}v_{w_{j+i}}}
is the sum of a word's neighbors' vectors, etc. Once such a system is trained, we have two trained matrices
V
,
V
′
{\displaystyle V,V'}
. Either the column vectors of
V
{\displaystyle V}
or the row vectors of
V
′
{\displaystyle V'}
can serve as the dictionary. For example, the word "sat" can be represented as either the "sat"-th column of
V
{\displaystyle V}
or the "sat"-th row of
V
′
{\displaystyle V'}
. It is also possible to simply define
V
′
=
V
⊤
{\displaystyle V'=V^{\top }}
, in which case there would no longer be a choice.
=== Skip-gram ===
The idea of skip-gram is to represent each word with a vector, such that it is possible to predict the vectors of its neighbors using the vector of a word. The architecture is still linear-linear-softmax, the same as CBOW, but the input and the output are switched. Specifically, for each word
w
i
{\displaystyle w_{i}}
in the corpus, the one-hot encoding of the word is used as the input to the neural network. The output of the neural network is a probability distribution over the dictionary, representing a prediction of individual words in the neighborhood of
w
i
{\displaystyle w_{i}}
. The objective of training is to maximize
∑
i
∑
j
∈
N
ln
Pr
(
w
j
+
i
∣
w
i
)
{\displaystyle \sum _{i}\sum _{j\in N}\ln \Pr(w_{j+i}\mid w_{i})}
. In full formula, the loss function is
−
∑
i
∑
j
∈
N
ln
e
v
w
j
+
i
′
⋅
v
w
i
∑
w
′
e
v
w
′
′
⋅
v
w
i
{\displaystyle -\sum _{i}\sum _{j\in N}\ln {\frac {e^{v_{w_{j+i}}'\cdot v_{w_{i}}}}{\sum _{w'}e^{v_{w'}'\cdot v_{w_{i}}}}}}
Same as CBOW, once such a system is trained, we have two trained matrices
V
,
V
′
{\displaystyle V,V'}
. Either the column vectors of
V
{\displaystyle V}
or the row vectors of
V
′
{\displaystyle V'}
can serve as the dictionary. It is also possible to simply define
V
′
=
V
⊤
{\displaystyle V'=V^{\top }}
, in which case there would no longer be a choice. Essentially, skip-gram and CBOW are exactly the same in architecture. They only differ in the objective function during training.