---
title: "Word2vec"
chunk: 2/4
source: "https://en.wikipedia.org/wiki/Word2vec"
category: "reference"
tags: "science, encyclopedia"
date_saved: "2026-05-05T10:14:12.458984+00:00"
instance: "kb-cron"
---

The idea of CBOW is to represent each word with a vector, such that it is possible to predict a word using the sum of the vectors of its neighbors. Specifically, for each word 
  
    
      
        
          w
          
            i
          
        
      
    
    {\displaystyle w_{i}}
  
 in the corpus, the one-hot encoding of the word is used as the input to the neural network. The output of the neural network is a probability distribution over the dictionary, representing a prediction of individual words in the neighborhood of 
  
    
      
        
          w
          
            i
          
        
      
    
    {\displaystyle w_{i}}
  
. The objective of training is to maximize 
  
    
      
        
          ∑
          
            i
          
        
        ln
        ⁡
        Pr
        (
        
          w
          
            i
          
        
        ∣
        
          w
          
            i
            +
            j
          
        
        :
        j
        ∈
        N
        )
      
    
    {\displaystyle \sum _{i}\ln \Pr(w_{i}\mid w_{i+j}\colon j\in N)}
  
 where 
  
    
      
        N
      
    
    {\displaystyle N}
  
 is a set of (non-zero) indices representing the relative locations of nearby words considered to be in 
  
    
      
        
          w
          
            i
          
        
      
    
    {\displaystyle w_{i}}
  
's neighborhood.
For example, if we want each word in the corpus to be predicted by every other word in a small span of 4 words. The set of relative indexes of neighbor words will be: 
  
    
      
        N
        =
        {
        −
        2
        ,
        −
        1
        ,
        +
        1
        ,
        +
        2
        }
      
    
    {\displaystyle N=\{-2,-1,+1,+2\}}
  
, and the objective is to maximize
  
    
      
        
          ∑
          
            i
          
        
        ln
        ⁡
        Pr
        (
        
          w
          
            i
          
        
        ∣
        
          w
          
            i
            −
            2
          
        
        ,
        
          w
          
            i
            −
            1
          
        
        ,
        
          w
          
            i
            +
            1
          
        
        ,
        
          w
          
            i
            +
            2
          
        
        )
      
    
    {\displaystyle \sum _{i}\ln \Pr(w_{i}\mid w_{i-2},w_{i-1},w_{i+1},w_{i+2})}
  
.
In standard bag-of-words, a word's context is represented by a word-count (aka a word histogram) of its neighboring words. For example, the "sat" in "the cat sat on the mat" is represented as {"the": 2, "cat": 1, "on": 1}. Note that the last word "mat" is not used to represent "sat", because it is outside the neighborhood 
  
    
      
        N
        =
        {
        −
        2
        ,
        −
        1
        ,
        +
        1
        ,
        +
        2
        }
      
    
    {\displaystyle N=\{-2,-1,+1,+2\}}
  
.
In continuous bag-of-words, the histogram is multiplied by a matrix 
  
    
      
        V
      
    
    {\displaystyle V}
  
 to obtain a continuous representation of the word's context. The matrix 
  
    
      
        V
      
    
    {\displaystyle V}
  
 is also called a dictionary. Its columns are the word vectors. It has 
  
    
      
        D
      
    
    {\displaystyle D}
  
 columns, where 
  
    
      
        D
      
    
    {\displaystyle D}
  
 is the size of the dictionary. Let 
  
    
      
        d
      
    
    {\displaystyle d}
  
 be the length of each word vector. We have 
  
    
      
        V
        ∈
        
          
            R
          
          
            d
            ×
            D
          
        
      
    
    {\displaystyle V\in \mathbb {R} ^{d\times D}}
  
.
For example, multiplying the word histogram {"the": 2, "cat": 1, "on": 1} with 
  
    
      
        V
      
    
    {\displaystyle V}
  
, we obtain 
  
    
      
        2
        
          v
          
            the
          
        
        +
        
          v
          
            cat
          
        
        +
        
          v
          
            on
          
        
      
    
    {\displaystyle 2v_{\text{the}}+v_{\text{cat}}+v_{\text{on}}}
  
.
This is then multiplied with another matrix 
  
    
      
        
          V
          ′
        
      
    
    {\displaystyle V'}
  
 of shape 
  
    
      
        
          
            R
          
          
            D
            ×
            d
          
        
      
    
    {\displaystyle \mathbb {R} ^{D\times d}}
  
. Each row of it is a word vector 
  
    
      
        
          v
          ′
        
      
    
    {\displaystyle v'}
  
. This results in a vector of length 
  
    
      
        D
      
    
    {\displaystyle D}
  
, one entry per dictionary entry. Then, apply the softmax to obtain a probability distribution over the dictionary.
This system can be visualized as a neural network, similar in spirit to an autoencoder, of architecture linear-linear-softmax, as depicted in the diagram. The system is trained by gradient descent to minimize the cross-entropy loss. 
In full formula, the cross-entropy loss is:
  
    
      
        −
        
          ∑
          
            i
          
        
        ln
        ⁡
        
          
            
              e
              
                
                  v
                  
                    
                      w
                      
                        i
                      
                    
                  
                  ′
                
                ⋅
                (
                
                  ∑
                  
                    j
                    ∈
                    N
                  
                
                
                  v
                  
                    
                      w
                      
                        j
                        +
                        i
                      
                    
                  
                
                )
              
            
            
              
                ∑
                
                  
                    w
                    ′
                  
                
              
              
                e
                
                  
                    v
                    
                      
                        w
                        ′
                      
                    
                    ′
                  
                  ⋅
                  (
                  
                    ∑
                    
                      j
                      ∈
                      N
                    
                  
                  
                    v
                    
                      
                        w
                        
                          j
                          +
                          i
                        
                      
                    
                  
                  )
                
              
            
          
        
      
    
    {\displaystyle -\sum _{i}\ln {\frac {e^{v_{w_{i}}'\cdot (\sum _{j\in N}v_{w_{j+i}})}}{\sum _{w'}e^{v_{w'}'\cdot (\sum _{j\in N}v_{w_{j+i}})}}}}
  
where the outer summation 
  
    
      
        
          ∑
          
            i
          
        
      
    
    {\displaystyle \sum _{i}}
  
 is over the words in a corpus, the quantity 
  
    
      
        
          ∑
          
            j
            ∈
            N
          
        
        
          v
          
            
              w
              
                j
                +
                i
              
            
          
        
      
    
    {\displaystyle \sum _{j\in N}v_{w_{j+i}}}
  
 is the sum of a word's neighbors' vectors, etc.
Once such a system is trained, we have two trained matrices 
  
    
      
        V
        ,
        
          V
          ′
        
      
    
    {\displaystyle V,V'}
  
. Either the column vectors of 
  
    
      
        V
      
    
    {\displaystyle V}
  
 or the row vectors of 
  
    
      
        
          V
          ′
        
      
    
    {\displaystyle V'}
  
 can serve as the dictionary. For example, the word "sat" can be represented as either the "sat"-th column of 
  
    
      
        V
      
    
    {\displaystyle V}
  
 or the "sat"-th row of 
  
    
      
        
          V
          ′
        
      
    
    {\displaystyle V'}
  
. It is also possible to simply define 
  
    
      
        
          V
          ′
        
        =
        
          V
          
            ⊤
          
        
      
    
    {\displaystyle V'=V^{\top }}
  
, in which case there would no longer be a choice.

=== Skip-gram ===

The idea of skip-gram is to represent each word with a vector, such that it is possible to predict the vectors of its neighbors using the vector of a word.
The architecture is still linear-linear-softmax, the same as CBOW, but the input and the output are switched. Specifically, for each word 
  
    
      
        
          w
          
            i
          
        
      
    
    {\displaystyle w_{i}}
  
 in the corpus, the one-hot encoding of the word is used as the input to the neural network. The output of the neural network is a probability distribution over the dictionary, representing a prediction of individual words in the neighborhood of 
  
    
      
        
          w
          
            i
          
        
      
    
    {\displaystyle w_{i}}
  
. The objective of training is to maximize 
  
    
      
        
          ∑
          
            i
          
        
        
          ∑
          
            j
            ∈
            N
          
        
        ln
        ⁡
        Pr
        (
        
          w
          
            j
            +
            i
          
        
        ∣
        
          w
          
            i
          
        
        )
      
    
    {\displaystyle \sum _{i}\sum _{j\in N}\ln \Pr(w_{j+i}\mid w_{i})}
  
. 
In full formula, the loss function is
  
    
      
        −
        
          ∑
          
            i
          
        
        
          ∑
          
            j
            ∈
            N
          
        
        ln
        ⁡
        
          
            
              e
              
                
                  v
                  
                    
                      w
                      
                        j
                        +
                        i
                      
                    
                  
                  ′
                
                ⋅
                
                  v
                  
                    
                      w
                      
                        i
                      
                    
                  
                
              
            
            
              
                ∑
                
                  
                    w
                    ′
                  
                
              
              
                e
                
                  
                    v
                    
                      
                        w
                        ′
                      
                    
                    ′
                  
                  ⋅
                  
                    v
                    
                      
                        w
                        
                          i
                        
                      
                    
                  
                
              
            
          
        
      
    
    {\displaystyle -\sum _{i}\sum _{j\in N}\ln {\frac {e^{v_{w_{j+i}}'\cdot v_{w_{i}}}}{\sum _{w'}e^{v_{w'}'\cdot v_{w_{i}}}}}}
  
Same as CBOW, once such a system is trained, we have two trained matrices 
  
    
      
        V
        ,
        
          V
          ′
        
      
    
    {\displaystyle V,V'}
  
. Either the column vectors of 
  
    
      
        V
      
    
    {\displaystyle V}
  
 or the row vectors of 
  
    
      
        
          V
          ′
        
      
    
    {\displaystyle V'}
  
 can serve as the dictionary. It is also possible to simply define 
  
    
      
        
          V
          ′
        
        =
        
          V
          
            ⊤
          
        
      
    
    {\displaystyle V'=V^{\top }}
  
, in which case there would no longer be a choice. 
Essentially, skip-gram and CBOW are exactly the same in architecture. They only differ in the objective function during training.