9.2 KiB
| title | chunk | source | category | tags | date_saved | instance |
|---|---|---|---|---|---|---|
| Alignment-free sequence analysis | 3/4 | https://en.wikipedia.org/wiki/Alignment-free_sequence_analysis | reference | science, encyclopedia | 2026-05-05T14:00:53.083806+00:00 | kb-cron |
==== Filtered Spaced-Word Matches (FSWM) ==== FSWM uses a pre-defined binary pattern P representing so-called match positions and don't-care positions. For a pair of input DNA sequences, it then searches for spaced-word matches w.r.t. P, i.e. for local gap-free alignments with matching nucleotides at the match positions of P and possible mismatches at the don't-care positions. Spurious low-scoring spaced-word matches are discarded, evolutionary distances between the input sequences are estimated based on the nucleotides aligned to each other at the don't-care positions of the remaining, homologous spaced-word matches. FSWM has been adapted to estimate distances based on unassembled NGS reads, this version of the program is called Read-SpaM.
==== Prot-SpaM ==== Prot-SpaM (Proteome-based Spaced-word Matches) is an implementation of the FSWM algorithm for partial or whole proteome sequences.
==== Multi-SpaM ==== Multi-SpaM (MultipleSpaced-word Matches) is an approach to genome-based phylogeny reconstruction that extends the FSWM idea to multiple sequence comparison. Given a binary pattern P of match positions and don't-care positions, the program searches for P-blocks, i.e. local gap-free four-way alignments with matching nucleotides at the match positions of P and possible mismatches at the don't-care positions. Such four-way alignments are randomly sampled from a set of input genome sequences. For each P-block, an unrooted tree topology is calculated using RAxML. The program Quartet MaxCut is then used to calculate a supertree from these trees.
=== Methods based on information theory === Information Theory has provided successful methods for alignment-free sequence analysis and comparison. The existing applications of information theory include global and local characterization of DNA, RNA and proteins, estimating genome entropy to motif and region classification. It also holds promise in gene mapping, next-generation sequencing analysis and metagenomics.
==== Base–base correlation (BBC) ==== Base–base correlation (BBC) converts the genome sequence into a unique 16-dimensional numeric vector using the following equation,
T
i
j
(
K
)
=
∑
ℓ
=
1
K
P
i
j
(
ℓ
)
⋅
log
2
(
P
i
j
(
ℓ
)
P
i
P
j
)
{\displaystyle T_{ij}(K)=\sum _{\ell =1}^{K}P_{ij}(\ell )\cdot \log _{2}\left({\frac {P_{ij}(\ell )}{P_{i}P_{j}}}\right)}
The
P
i
{\displaystyle P_{i}}
and
P
j
{\displaystyle P_{j}}
denotes the probabilities of bases i and j in the genome. The
P
i
j
(
ℓ
)
{\displaystyle P_{ij}(\ell )}
indicates the probability of bases i and j at distance ℓ in the genome. The parameter K indicates the maximum distance between the bases i and j. The variation in the values of 16 parameters reflect variation in the genome content and length.
==== Information correlation and partial information correlation (IC-PIC) ==== IC-PIC (information correlation and partial information correlation) based method employs the base correlation property of DNA sequence. IC and PIC were calculated using following formulas,
I
C
ℓ
=
−
2
∑
i
P
i
log
2
P
i
+
∑
i
j
P
i
j
(
ℓ
)
log
2
P
i
j
(
ℓ
)
{\displaystyle IC_{\ell }=-2\sum _{i}P_{i}\log _{2}P_{i}+\sum _{ij}P_{ij}(\ell )\log _{2}P_{ij}(\ell )}
P
I
C
i
j
(
ℓ
)
=
(
P
i
j
(
ℓ
)
−
P
i
P
j
(
ℓ
)
)
2
{\displaystyle PIC_{ij}(\ell )=(P_{ij}(\ell )-P_{i}P_{j}(\ell ))^{2}}
The final vector is obtained as follows:
V
=
I
C
ℓ
P
I
C
i
j
(
ℓ
)
where
ℓ
∈
{
ℓ
0
,
ℓ
0
+
1
,
…
,
ℓ
0
+
n
}
,
{\displaystyle V={IC_{\ell } \over PIC_{ij}(\ell )}{\text{ where }}\ell \in \left\{\ell _{0},\ell _{0}+1,\ldots ,\ell _{0}+n\right\},}
which defines the range of distance between bases. The pairwise distance between sequences is calculated using Euclidean distance measure. The distance matrix thus obtained can be used to construct phylogenetic tree using clustering algorithms like neighbor-joining, UPGMA, etc..
==== Compression ==== Examples are effective approximations to Kolmogorov complexity, for example Lempel-Ziv complexity. In general compression-based methods use the mutual information between the sequences. This is expressed in conditional Kolmogorov complexity, that is, the length of the shortest self-delimiting program required to generate a string given the prior knowledge of the other string. This measure has a relation to measuring k-words in a sequence, as they can be easily used to generate the sequence. It is sometimes a computationally intensive method. The theoretic basis for the Kolmogorov complexity approach was laid by Bennett, Gacs, Li, Vitanyi, and Zurek (1998) by proposing the information distance. The Kolmogorov complexity being incomputable it was approximated by compression algorithms. The better they compress the better they are. Li, Badger, Chen, Kwong,, Kearney, and Zhang (2001) used a non-optimal but normalized form of this approach, and the optimal normalized form by Li, Chen, Li, Ma, and Vitanyi (2003) appeared in and more extensively and proven by Cilibrasi and Vitanyi (2005) in. Otu and Sayood (2003) used the Lempel-Ziv complexity method to construct five different distance measures for phylogenetic tree construction.
==== Context modeling compression ==== In the context modeling complexity the next-symbol predictions, of one or more statistical models, are combined or competing to yield a prediction that is based on events recorded in the past. The algorithmic information content derived from each symbol prediction can be used to compute algorithmic information profiles with a time proportional to the length of the sequence. The process has been applied to DNA sequence analysis.
=== Methods based on graphical representation ===