kb/BLOSUM-1.md at deea1695831fb31a41f25febc8dbcdfb185cd42d

turtle89431 2e50ba1868 Scrape wikipedia-science: 15617 new, 4054 updated, 20200 total (kb-cron)

2026-05-05 07:02:36 -07:00

8.0 KiB

Raw Blame History

title	chunk	source	category	tags	date_saved	instance
BLOSUM	2/3	https://en.wikipedia.org/wiki/BLOSUM	reference	science, encyclopedia	2026-05-05T14:01:55.384613+00:00	kb-cron

    L
    o
    g
    O
    d
    d
    R
    a
    t
    i
    o
    =
    2
    
      log
      
        2
      
    
    ⁡
    
      
        (
        
          
            
              P
              
                (
                O
                )
              
            
            
              P
              
                (
                E
                )
              
            
          
        
        )
      
    
  

{\displaystyle LogOddRatio=2\log _{2}{\left({\frac {P\left(O\right)}{P\left(E\right)}}\right)}}

where

    P
    
      (
      O
      )
    
  

{\displaystyle P\left(O\right)}

is the probability of observing the pair and

    P
    
      (
      E
      )
    
  

{\displaystyle P\left(E\right)}

is the expected probability of such a pair occurring, given the background probabilities of each amino acid.

=== BLOSUM Matrices === The odds for relatedness are calculated from log odd ratio, which are then rounded off to get the substitution matrices BLOSUM matrices.

=== Score of the BLOSUM matrices === A scoring matrix or a table of values is required for evaluating the significance of a sequence alignment, such as describing the probability of a biologically meaningful amino-acid or nucleotide residue-pair occurring in an alignment. Typically, when two nucleotide sequences are being compared, all that is being scored is whether or not two bases are the same at one position. All matches and mismatches are respectively given the same score (typically +1 or +5 for matches, and -1 or -4 for mismatches). But it is different for proteins. Substitution matrices for amino acids are more complicated and implicitly take into account everything that might affect the frequency with which any amino acid is substituted for another. The objective is to provide a relatively heavy penalty for aligning two residues together if they have a low probability of being homologous (correctly aligned by evolutionary descent). Two major forces drive the amino-acid substitution rates away from uniformity: substitutions occur with the different frequencies, and lessen functionally tolerated than others. Thus, substitutions are selected against. Commonly used substitution matrices include the blocks substitution (BLOSUM) and point accepted mutation (PAM) matrices. Both are based on taking sets of high-confidence alignments of many homologous proteins and assessing the frequencies of all substitutions, but they are computed using different methods. Scores within a BLOSUM are log-odds scores that measure, in an alignment, the logarithm for the ratio of the likelihood of two amino acids appearing with a biological sense and the likelihood of the same amino acids appearing by chance. The matrices are based on the minimum percentage identity of the aligned protein sequence used in calculating them. Every possible identity or substitution is assigned a score based on its observed frequencies in the alignment of related proteins. A positive score is given to the more likely substitutions while a negative score is given to the less likely substitutions. To calculate a BLOSUM matrix, the following equation is used:

      S
      
        i
        j
      
    
    =
    
      
        1
        λ
      
    
    log
    ⁡
    
      
        
          p
          
            i
            j
          
        
        
          
            q
            
              i
            
          
          
            q
            
              j
            
          
        
      
    
  

{\displaystyle S_{ij}={\frac {1}{\lambda }}\log {\frac {p_{ij}}{q_{i}q_{j}}}}

Here,

      p
      
        i
        j
      
    
  

{\displaystyle p_{ij}}

is the probability of two amino acids

    i
  

{\displaystyle i}

and

    j
  

{\displaystyle j}

replacing each other in a homologous sequence, and

      q
      
        i
      
    
  

{\displaystyle q_{i}}

and

      q
      
        j
      
    
  

{\displaystyle q_{j}}

are the background probabilities of finding the amino acids

    i
  

{\displaystyle i}

and

    j
  

{\displaystyle j}

in any protein sequence. The factor

    λ
  

{\displaystyle \lambda }

is a scaling factor, set such that the matrix contains easily computable integer values.

== Variants ==

=== BLOSUM === BLOSUM80: more related proteins BLOSUM62: midrange BLOSUM45: distantly related proteins The BLOSUM62 matrix with the amino acids in the table grouped according to the chemistry of the side chain, as in (a). Each value in the matrix is calculated by dividing the frequency of occurrence of the amino acid pair in the BLOCKS database, clustered at the 62% level, divided by the probability that the same two amino acids might align by chance. The ratio is then converted to a logarithm and expressed as a log odds score, as for PAM. BLOSUM matrices are usually scaled in half-bit units. A score of zero indicates that the frequency with which a given two amino acids were found aligned in the database was as expected by chance, while a positive score indicates that the alignment was found more often than by chance, and negative score indicates that the alignment was found less often than by chance.

=== PMB === PMB (Probability Matrix from Blocks) of 2004 uses the additivity of evolutionary distances to improve on BLOSUM's analysis of the BLOCKS database. The up-to-date 2001 version of BLOCKS was used to generate a new set of BLOSUM matrices. The "observed substitution frequencies" found in these BLOSUM matrices are used to estimate actual substitution frequencies (with higher evolutionary distance, i.e. lower r, some later replacement can mask earlier replacements). PMB thus defines a true evolutionary model like PAM and JTT do. It is not a symmetric matrix.

=== RBLOSUM === The original code written by Henikoff and Henikoff does not exactly act according to their paper's description of the algorithm. The BLOSUM62 from that program has been used for many years as standard. Surprisingly, the miscalculated BLOSUM62 improves search performance compared to the 2008 corrected version of the same relative entropy (RBLOSUM64). A 2018 article claims that RBLOSUM is better than BLOSUM and CorBLOSUM.

=== CorBLOSUM === A 2016 paper finds further errors in the original code not addressed by the 2008 RBLOSUM correction. The corrected version from this paper, CorBLOSUM, manages to be more effective than BLOSUM at similarity search in about 75% of cases.

== Some uses in bioinformatics ==

=== Research applications === BLOSUM scores was used to predict and understand the surface gene variants among hepatitis B virus carriers and T-cell epitopes.

==== Surface gene variants among hepatitis B virus carriers ==== DNA sequences of HBsAg were obtained from 180 patients, in which 51 were chronic HBV carrier and 129 newly diagnosed patients, and compared with consensus sequences built with 168 HBV sequences imported from GenBank. Literature review and BLOSUM scores were used to define potentially altered antigenicity.

8.0 KiB Raw Blame History Unescape Escape

8.0 KiB

Raw Blame History