kb/data/en.wikipedia.org/wiki/Cosine_similarity-2.md

9.4 KiB
Raw Blame History

title chunk source category tags date_saved instance
Cosine similarity 3/3 https://en.wikipedia.org/wiki/Cosine_similarity reference science, encyclopedia 2026-05-05T09:53:42.619881+00:00 kb-cron
    cos
    
    (
    ∠
    
      A
      C
    
    
    ∠
    
      C
      B
    
    )
    ≥
    cos
    
    (
    ∠
    
      A
      B
    
    )
    ≥
    cos
    
    (
    ∠
    
      A
      C
    
    +
    ∠
    
      C
      B
    
    )
    .
  

{\displaystyle \cos(\angle {AC}-\angle {CB})\geq \cos(\angle {AB})\geq \cos(\angle {AC}+\angle {CB}).}

Using the cosine addition and subtraction formulas, these two inequalities can be written in terms of the original cosines,

    cos
    
    (
    A
    ,
    C
    )
    ⋅
    cos
    
    (
    C
    ,
    B
    )
    +
    
      
        
          (
          
            1
            
            cos
            
            (
            A
            ,
            C
            
              )
              
                2
              
            
          
          )
        
        ⋅
        
          (
          
            1
            
            cos
            
            (
            C
            ,
            B
            
              )
              
                2
              
            
          
          )
        
      
    
    ≥
    cos
    
    (
    A
    ,
    B
    )
    ,
  

{\displaystyle \cos(A,C)\cdot \cos(C,B)+{\sqrt {\left(1-\cos(A,C)^{2}\right)\cdot \left(1-\cos(C,B)^{2}\right)}}\geq \cos(A,B),}




  
    cos
    
    (
    A
    ,
    B
    )
    ≥
    cos
    
    (
    A
    ,
    C
    )
    ⋅
    cos
    
    (
    C
    ,
    B
    )
    
    
      
        
          (
          
            1
            
            cos
            
            (
            A
            ,
            C
            
              )
              
                2
              
            
          
          )
        
        ⋅
        
          (
          
            1
            
            cos
            
            (
            C
            ,
            B
            
              )
              
                2
              
            
          
          )
        
      
    
    .
  

{\displaystyle \cos(A,B)\geq \cos(A,C)\cdot \cos(C,B)-{\sqrt {\left(1-\cos(A,C)^{2}\right)\cdot \left(1-\cos(C,B)^{2}\right)}}.}

This form of the triangle inequality can be used to bound the minimum and maximum similarity of two objects A and B if the similarities to a reference object C is already known. This is used for example in metric data indexing, but has also been used to accelerate spherical k-means clustering the same way the Euclidean triangle inequality has been used to accelerate regular k-means.

== Soft cosine measure == A soft cosine or ("soft" similarity) between two vectors considers similarities between pairs of features. The traditional cosine similarity considers the vector space model (VSM) features as independent or completely different, while the soft cosine measure proposes considering the similarity of features in VSM, which help generalize the concept of cosine (and soft cosine) as well as the idea of (soft) similarity. For example, in the field of natural language processing (NLP) the similarity among features is quite intuitive. Features such as words, n-grams, or syntactic n-grams can be quite similar, though formally they are considered as different features in the VSM. For example, words "play" and "game" are different words and thus mapped to different points in VSM; yet they are semantically related. In case of n-grams or syntactic n-grams, Levenshtein distance can be applied (in fact, Levenshtein distance can be applied to words as well). For calculating soft cosine, the matrix s is used to indicate similarity between features. It can be calculated through Levenshtein distance, WordNet similarity, or other similarity measures. Then we just multiply by this matrix. Given two N-dimension vectors

    a
  

{\displaystyle a}

and

    b
  

{\displaystyle b}

, the soft cosine similarity is calculated as follows:

                s
                o
                f
                t
                _
                c
                o
                s
                i
                n
                e
              
              
                1
              
            
            
            (
            a
            ,
            b
            )
            =
            
              
                
                  
                    ∑
                    
                      i
                      ,
                      j
                    
                    
                      N
                    
                  
                  
                    s
                    
                      i
                      j
                    
                  
                  
                    a
                    
                      i
                    
                  
                  
                    b
                    
                      j
                    
                  
                
                
                  
                    
                      
                        ∑
                        
                          i
                          ,
                          j
                        
                        
                          N
                        
                      
                      
                        s
                        
                          i
                          j
                        
                      
                      
                        a
                        
                          i
                        
                      
                      
                        a
                        
                          j
                        
                      
                    
                  
                  
                    
                      
                        ∑
                        
                          i
                          ,
                          j
                        
                        
                          N
                        
                      
                      
                        s
                        
                          i
                          j
                        
                      
                      
                        b
                        
                          i
                        
                      
                      
                        b
                        
                          j
                        
                      
                    
                  
                
              
            
            ,
          
        
      
    
  

{\displaystyle {\begin{aligned}\operatorname {soft\_cosine} _{1}(a,b)={\frac {\sum \nolimits _{i,j}^{N}s_{ij}a_{i}b_{j}}{{\sqrt {\sum \nolimits _{i,j}^{N}s_{ij}a_{i}a_{j}}}{\sqrt {\sum \nolimits _{i,j}^{N}s_{ij}b_{i}b_{j}}}}},\end{aligned}}}

where sij = similarity(featurei, featurej). If there is no similarity between features (sii = 1, sij = 0 for i ≠ j), the given equation is equivalent to the conventional cosine similarity formula. The time complexity of this measure is quadratic, which makes it applicable to real-world tasks. Note that the complexity can be reduced to subquadratic. An efficient implementation of such soft cosine similarity is included in the Gensim open source library.

== See also == SørensenDice coefficient Hamming distance Correlation Jaccard index SimRank Information retrieval

== References ==

== External links == Weighted cosine measure A tutorial on cosine similarity using Python