kb/data/en.wikipedia.org/wiki/Cosine_similarity-0.md

14 KiB
Raw Blame History

title chunk source category tags date_saved instance
Cosine similarity 1/3 https://en.wikipedia.org/wiki/Cosine_similarity reference science, encyclopedia 2026-05-05T09:53:42.619881+00:00 kb-cron

In data analysis, cosine similarity is a measure of similarity between two non-zero vectors defined in an inner product space. Cosine similarity is the cosine of the angle between the vectors; that is, it is the dot product of the vectors divided by the product of their lengths. It follows that the cosine similarity does not depend on the magnitudes of the vectors, but only on their angle. The cosine similarity always belongs to the interval

    [
    
    1
    ,
    +
    1
    ]
    .
  

{\displaystyle [-1,+1].}

For example, two proportional vectors have a cosine similarity of +1, two orthogonal vectors have a similarity of 0, and two opposite vectors have a similarity of 1. In some contexts, the component values of the vectors cannot be negative, in which case the cosine similarity is bounded in

    [
    0
    ,
    1
    ]
  

{\displaystyle [0,1]}

. For example, in information retrieval and text mining, each word is assigned a different coordinate and a document is represented by the vector of the numbers of occurrences of each word in the document. Cosine similarity then gives a useful measure of how similar two documents are likely to be, in terms of their subject matter, and independently of the length of the documents. The technique is also used to measure cohesion within clusters in the field of data mining. One advantage of cosine similarity is its low complexity, especially for sparse vectors: only the non-zero coordinates need to be considered. Other names for cosine similarity include Orchini similarity and Tucker coefficient of congruence; the OtsukaOchiai similarity (see below) is cosine similarity applied to binary data.

== Definition == The cosine of two non-zero vectors can be derived by using the Euclidean dot product formula:

      A
    
    ⋅
    
      B
    
    =
    
      ‖
      
        A
      
      ‖
    
    
      ‖
      
        B
      
      ‖
    
    cos
    
    θ
  

{\displaystyle \mathbf {A} \cdot \mathbf {B} =\left\|\mathbf {A} \right\|\left\|\mathbf {B} \right\|\cos \theta }

Given two n-dimensional vectors of attributes, A and B, the cosine similarity, cos(θ), is represented using a dot product and magnitude as

      cosine similarity
    
    =
    
      S
      
        C
      
    
    (
    A
    ,
    B
    )
    :=
    cos
    
    (
    θ
    )
    =
    
      
        
          
            A
          
          ⋅
          
            B
          
        
        
          ‖
          
            A
          
          ‖
          ‖
          
            B
          
          ‖
        
      
    
    =
    
      
        
          
            ∑
            
              i
              =
              1
            
            
              n
            
          
          
            
              A
              
                i
              
            
            
              B
              
                i
              
            
          
        
        
          
            
              
                ∑
                
                  i
                  =
                  1
                
                
                  n
                
              
              
                
                  A
                  
                    i
                  
                  
                    2
                  
                
              
            
          
          ⋅
          
            
              
                ∑
                
                  i
                  =
                  1
                
                
                  n
                
              
              
                
                  B
                  
                    i
                  
                  
                    2
                  
                
              
            
          
        
      
    
    ,
  

{\displaystyle {\text{cosine similarity}}=S_{C}(A,B):=\cos(\theta )={\mathbf {A} \cdot \mathbf {B}  \over \|\mathbf {A} \|\|\mathbf {B} \|}={\frac {\sum \limits _{i=1}^{n}{A_{i}B_{i}}}{{\sqrt {\sum \limits _{i=1}^{n}{A_{i}^{2}}}}\cdot {\sqrt {\sum \limits _{i=1}^{n}{B_{i}^{2}}}}}},}

where

      A
      
        i
      
    
  

{\displaystyle A_{i}}

and

      B
      
        i
      
    
  

{\displaystyle B_{i}}

are the

    i
  

{\displaystyle i}

th components of vectors

      A
    
  

{\displaystyle \mathbf {A} }

and

      B
    
  

{\displaystyle \mathbf {B} }

, respectively. The resulting similarity ranges from 1 meaning exactly opposite, to +1 meaning exactly the same, with 0 indicating orthogonality (no correlation), while in-between values indicate intermediate similarity or dissimilarity. For text matching, the attribute vectors A and B are usually the term frequency vectors of the documents. Cosine similarity can be seen as a method of normalizing document length during comparison. In the case of information retrieval, the cosine similarity of two documents will range from

    0
    →
    1
  

{\displaystyle 0\to 1}

, since the term frequencies cannot be negative. This remains true when using TF-IDF weights. The angle between two term frequency vectors cannot be greater than 90°. If the attribute vectors are normalized by subtracting the vector means (e.g.,

    A
    
    
      
        
          A
          ¯
        
      
    
  

{\displaystyle A-{\bar {A}}}

), the measure is called the centered cosine similarity and is equivalent to the Pearson correlation coefficient. For an example of centering,

      if
    
    
    A
    =
    [
    
      A
      
        1
      
    
    ,
    
      A
      
        2
      
    
    
      ]
      
        T
      
    
    ,
    
       then 
    
    
      
        
          A
          ¯
        
      
    
    =
    
      
        [
        
          
            
              
                (
                
                  A
                  
                    1
                  
                
                +
                
                  A
                  
                    2
                  
                
                )
              
              2
            
          
          ,
          
            
              
                (
                
                  A
                  
                    1
                  
                
                +
                
                  A
                  
                    2
                  
                
                )
              
              2
            
          
        
        ]
      
      
        T
      
    
    ,
  

{\displaystyle {\text{if}}\,A=[A_{1},A_{2}]^{T},{\text{ then }}{\bar {A}}=\left[{\frac {(A_{1}+A_{2})}{2}},{\frac {(A_{1}+A_{2})}{2}}\right]^{T},}




  
    
       so 
    
    A
    
    
      
        
          A
          ¯
        
      
    
    =
    
      
        [
        
          
            
              
                (
                
                  A
                  
                    1
                  
                
                
                
                  A
                  
                    2
                  
                
                )
              
              2
            
          
          ,
          
            
              
                (
                
                
                  A
                  
                    1
                  
                
                +
                
                  A
                  
                    2
                  
                
                )
              
              2
            
          
        
        ]
      
      
        T
      
    
    .
  

{\displaystyle {\text{ so }}A-{\bar {A}}=\left[{\frac {(A_{1}-A_{2})}{2}},{\frac {(-A_{1}+A_{2})}{2}}\right]^{T}.}

=== Cosine distance === When the distance between two unit-length vectors is defined to be the length of their vector difference then

    dist
    
    (
    
      A
    
    ,
    
      B
    
    )
    =
    
      
        (
        
          A
        
        
        
          B
        
        )
        ⋅
        (
        
          A
        
        
        
          B
        
        )
      
    
    =
    
      
        
          A
        
        ⋅
        
          A
        
        
        2
        (
        
          A
        
        ⋅
        
          B
        
        )
        +
        
          B
        
        ⋅
        
          B
        
      
    
    =
    
      
        2
        (
        1
        
        
          S
          
            C
          
        
        (
        
          A
        
        ,
        
          B
        
        )
        )
      
    
    
    .
  

{\displaystyle \operatorname {dist} (\mathbf {A} ,\mathbf {B} )={\sqrt {(\mathbf {A} -\mathbf {B} )\cdot (\mathbf {A} -\mathbf {B} )}}={\sqrt {\mathbf {A} \cdot \mathbf {A} -2(\mathbf {A} \cdot \mathbf {B} )+\mathbf {B} \cdot \mathbf {B} }}={\sqrt {2(1-S_{C}(\mathbf {A} ,\mathbf {B} ))}}\,.}

Nonetheless the cosine distance is often defined without the square root or factor of 2:

      cosine distance
    
    =
    
      D
      
        C
      
    
    (
    A
    ,
    B
    )
    :=
    1
    
    
      S
      
        C
      
    
    (
    A
    ,
    B
    )
    
    .
  

{\displaystyle {\text{cosine distance}}=D_{C}(A,B):=1-S_{C}(A,B)\,.}

It is important to note that, by virtue of being proportional to squared Euclidean distance, the cosine distance is not a true distance metric; it does not exhibit the triangle inequality property — or, more formally, the Schwarz inequality — and it violates the coincidence axiom. To repair the triangle inequality property while maintaining the same ordering, one can convert to Euclidean distance

        2
        (
        1
        
        
          S
          
            C
          
        
        (
        A
        ,
        B
        )
        )
      
    
  

{\textstyle {\sqrt {2(1-S_{C}(A,B))}}}

or angular distance θ = arccos(SC(A, B)). Alternatively, the triangular inequality that does work for angular distances can be expressed directly in terms of the cosines; see below.

=== Angular distance and similarity === The normalized angle, referred to as angular distance, between any two vectors

    A
  

{\displaystyle A}

and

    B
  

{\displaystyle B}

is a formal distance metric and can be calculated from the cosine similarity. The complement of the angular distance metric can then be used to define angular similarity function bounded between 0 and 1, inclusive. When the vector elements may be positive or negative: