kb/data/en.wikipedia.org/wiki/Word2vec-1.md

13 KiB
Raw Blame History

title chunk source category tags date_saved instance
Word2vec 2/4 https://en.wikipedia.org/wiki/Word2vec reference science, encyclopedia 2026-05-05T10:14:12.458984+00:00 kb-cron

The idea of CBOW is to represent each word with a vector, such that it is possible to predict a word using the sum of the vectors of its neighbors. Specifically, for each word

      w
      
        i
      
    
  

{\displaystyle w_{i}}

in the corpus, the one-hot encoding of the word is used as the input to the neural network. The output of the neural network is a probability distribution over the dictionary, representing a prediction of individual words in the neighborhood of

      w
      
        i
      
    
  

{\displaystyle w_{i}}

. The objective of training is to maximize

      ∑
      
        i
      
    
    ln
    
    Pr
    (
    
      w
      
        i
      
    
    
    
      w
      
        i
        +
        j
      
    
    :
    j
    ∈
    N
    )
  

{\displaystyle \sum _{i}\ln \Pr(w_{i}\mid w_{i+j}\colon j\in N)}

where

    N
  

{\displaystyle N}

is a set of (non-zero) indices representing the relative locations of nearby words considered to be in

      w
      
        i
      
    
  

{\displaystyle w_{i}}

's neighborhood. For example, if we want each word in the corpus to be predicted by every other word in a small span of 4 words. The set of relative indexes of neighbor words will be:

    N
    =
    {
    
    2
    ,
    
    1
    ,
    +
    1
    ,
    +
    2
    }
  

{\displaystyle N=\{-2,-1,+1,+2\}}

, and the objective is to maximize

      ∑
      
        i
      
    
    ln
    
    Pr
    (
    
      w
      
        i
      
    
    
    
      w
      
        i
        
        2
      
    
    ,
    
      w
      
        i
        
        1
      
    
    ,
    
      w
      
        i
        +
        1
      
    
    ,
    
      w
      
        i
        +
        2
      
    
    )
  

{\displaystyle \sum _{i}\ln \Pr(w_{i}\mid w_{i-2},w_{i-1},w_{i+1},w_{i+2})}

. In standard bag-of-words, a word's context is represented by a word-count (aka a word histogram) of its neighboring words. For example, the "sat" in "the cat sat on the mat" is represented as {"the": 2, "cat": 1, "on": 1}. Note that the last word "mat" is not used to represent "sat", because it is outside the neighborhood

    N
    =
    {
    
    2
    ,
    
    1
    ,
    +
    1
    ,
    +
    2
    }
  

{\displaystyle N=\{-2,-1,+1,+2\}}

. In continuous bag-of-words, the histogram is multiplied by a matrix

    V
  

{\displaystyle V}

to obtain a continuous representation of the word's context. The matrix

    V
  

{\displaystyle V}

is also called a dictionary. Its columns are the word vectors. It has

    D
  

{\displaystyle D}

columns, where

    D
  

{\displaystyle D}

is the size of the dictionary. Let

    d
  

{\displaystyle d}

be the length of each word vector. We have

    V
    ∈
    
      
        R
      
      
        d
        ×
        D
      
    
  

{\displaystyle V\in \mathbb {R} ^{d\times D}}

. For example, multiplying the word histogram {"the": 2, "cat": 1, "on": 1} with

    V
  

{\displaystyle V}

, we obtain

    2
    
      v
      
        the
      
    
    +
    
      v
      
        cat
      
    
    +
    
      v
      
        on
      
    
  

{\displaystyle 2v_{\text{the}}+v_{\text{cat}}+v_{\text{on}}}

. This is then multiplied with another matrix

      V
      
    
  

{\displaystyle V'}

of shape

        R
      
      
        D
        ×
        d
      
    
  

{\displaystyle \mathbb {R} ^{D\times d}}

. Each row of it is a word vector

      v
      
    
  

{\displaystyle v'}

. This results in a vector of length

    D
  

{\displaystyle D}

, one entry per dictionary entry. Then, apply the softmax to obtain a probability distribution over the dictionary. This system can be visualized as a neural network, similar in spirit to an autoencoder, of architecture linear-linear-softmax, as depicted in the diagram. The system is trained by gradient descent to minimize the cross-entropy loss. In full formula, the cross-entropy loss is:

    
    
      ∑
      
        i
      
    
    ln
    
    
      
        
          e
          
            
              v
              
                
                  w
                  
                    i
                  
                
              
              
            
            ⋅
            (
            
              ∑
              
                j
                ∈
                N
              
            
            
              v
              
                
                  w
                  
                    j
                    +
                    i
                  
                
              
            
            )
          
        
        
          
            ∑
            
              
                w
                
              
            
          
          
            e
            
              
                v
                
                  
                    w
                    
                  
                
                
              
              ⋅
              (
              
                ∑
                
                  j
                  ∈
                  N
                
              
              
                v
                
                  
                    w
                    
                      j
                      +
                      i
                    
                  
                
              
              )
            
          
        
      
    
  

{\displaystyle -\sum _{i}\ln {\frac {e^{v_{w_{i}}'\cdot (\sum _{j\in N}v_{w_{j+i}})}}{\sum _{w'}e^{v_{w'}'\cdot (\sum _{j\in N}v_{w_{j+i}})}}}}

where the outer summation

      ∑
      
        i
      
    
  

{\displaystyle \sum _{i}}

is over the words in a corpus, the quantity

      ∑
      
        j
        ∈
        N
      
    
    
      v
      
        
          w
          
            j
            +
            i
          
        
      
    
  

{\displaystyle \sum _{j\in N}v_{w_{j+i}}}

is the sum of a word's neighbors' vectors, etc. Once such a system is trained, we have two trained matrices

    V
    ,
    
      V
      
    
  

{\displaystyle V,V'}

. Either the column vectors of

    V
  

{\displaystyle V}

or the row vectors of

      V
      
    
  

{\displaystyle V'}

can serve as the dictionary. For example, the word "sat" can be represented as either the "sat"-th column of

    V
  

{\displaystyle V}

or the "sat"-th row of

      V
      
    
  

{\displaystyle V'}

. It is also possible to simply define

      V
      
    
    =
    
      V
      
        
      
    
  

{\displaystyle V'=V^{\top }}

, in which case there would no longer be a choice.

=== Skip-gram ===

The idea of skip-gram is to represent each word with a vector, such that it is possible to predict the vectors of its neighbors using the vector of a word. The architecture is still linear-linear-softmax, the same as CBOW, but the input and the output are switched. Specifically, for each word

      w
      
        i
      
    
  

{\displaystyle w_{i}}

in the corpus, the one-hot encoding of the word is used as the input to the neural network. The output of the neural network is a probability distribution over the dictionary, representing a prediction of individual words in the neighborhood of

      w
      
        i
      
    
  

{\displaystyle w_{i}}

. The objective of training is to maximize

      ∑
      
        i
      
    
    
      ∑
      
        j
        ∈
        N
      
    
    ln
    
    Pr
    (
    
      w
      
        j
        +
        i
      
    
    
    
      w
      
        i
      
    
    )
  

{\displaystyle \sum _{i}\sum _{j\in N}\ln \Pr(w_{j+i}\mid w_{i})}

. In full formula, the loss function is

    
    
      ∑
      
        i
      
    
    
      ∑
      
        j
        ∈
        N
      
    
    ln
    
    
      
        
          e
          
            
              v
              
                
                  w
                  
                    j
                    +
                    i
                  
                
              
              
            
            ⋅
            
              v
              
                
                  w
                  
                    i
                  
                
              
            
          
        
        
          
            ∑
            
              
                w
                
              
            
          
          
            e
            
              
                v
                
                  
                    w
                    
                  
                
                
              
              ⋅
              
                v
                
                  
                    w
                    
                      i
                    
                  
                
              
            
          
        
      
    
  

{\displaystyle -\sum _{i}\sum _{j\in N}\ln {\frac {e^{v_{w_{j+i}}'\cdot v_{w_{i}}}}{\sum _{w'}e^{v_{w'}'\cdot v_{w_{i}}}}}}

Same as CBOW, once such a system is trained, we have two trained matrices

    V
    ,
    
      V
      
    
  

{\displaystyle V,V'}

. Either the column vectors of

    V
  

{\displaystyle V}

or the row vectors of

      V
      
    
  

{\displaystyle V'}

can serve as the dictionary. It is also possible to simply define

      V
      
    
    =
    
      V
      
        
      
    
  

{\displaystyle V'=V^{\top }}

, in which case there would no longer be a choice. Essentially, skip-gram and CBOW are exactly the same in architecture. They only differ in the objective function during training.