kb/data/en.wikipedia.org/wiki/Alignment-free_sequence_analysis-1.md

436 lines
8.9 KiB
Markdown
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
title: "Alignment-free sequence analysis"
chunk: 2/4
source: "https://en.wikipedia.org/wiki/Alignment-free_sequence_analysis"
category: "reference"
tags: "science, encyclopedia"
date_saved: "2026-05-05T14:00:53.083806+00:00"
instance: "kb-cron"
---
==== Spaced-word frequencies ====
While most alignment-free algorithms compare the word-composition of sequences, Spaced Words uses a pattern of care and don't care positions. The occurrence of a spaced word in a sequence is then defined by the characters at the match positions only, while the characters at the don't care positions are ignored. Instead of comparing the frequencies of contiguous words in the input sequences, this approach compares the frequencies of the spaced words according to the pre-defined pattern. Note that the pre-defined pattern can be selected by analysis of the Variance of the number of matches, the probability of the first occurrence on several models, or the Pearson correlation coefficient between the expected word frequency and the true alignment distance.
=== Methods based on length of common substrings ===
The methods in this category employ the similarity and differences of substrings in a pair of sequences. These algorithms
were mostly used for string processing in computer science.
==== Average common substring (ACS) ====
In this approach, for a chosen pair of sequences (A and B of lengths n and m respectively), longest substring starting at some position is identified in one sequence (A) which exactly matches in the other sequence (B) at any position. In this way, lengths of longest substrings starting at different positions in sequence A and having exact matches at some positions in sequence B are calculated. All these lengths are averaged to derive a measure
L
(
A
,
B
)
{\displaystyle L(A,B)}
. Intuitively, larger the
L
(
A
,
B
)
{\displaystyle L(A,B)}
, the more similar the two sequences are. To account for the differences in the length of sequences,
L
(
A
,
B
)
{\displaystyle L(A,B)}
is normalized [i.e.
L
(
A
,
B
)
/
log
(
m
)
{\displaystyle L(A,B)/\log(m)}
]. This gives the similarity measure between the sequences.
In order to derive a distance measure, the inverse of similarity measure is taken and a correction term is subtracted from it to assure that
d
(
A
,
A
)
{\displaystyle d(A,A)}
will be zero. Thus
d
(
A
,
B
)
=
[
log
m
L
(
A
,
B
)
]
[
log
n
L
(
A
,
A
)
]
.
{\displaystyle d(A,B)=\left[{\frac {\log m}{L(A,B)}}\right]-\left[{\frac {\log n}{L(A,A)}}\right].}
This measure
d
(
A
,
B
)
{\displaystyle d(A,B)}
is not symmetric, so one has to compute
d
s
(
A
,
B
)
=
d
s
(
B
,
A
)
=
(
d
(
A
,
B
)
+
d
(
B
,
A
)
)
/
2
{\displaystyle d_{s}(A,B)=d_{s}(B,A)=(d(A,B)+d(B,A))/2}
, which gives final ACS measure between the two strings (A and B). The subsequence/substring search can be efficiently performed by
using suffix trees.
==== Mutation distances (Kr) ====
This approach is closely related to the ACS, which calculates the number of substitutions per site between two DNA sequences using the shortest
absent substring (termed as shustring).
=== Methods based on the number of (spaced) word matches ===
==== ====
D
2
S
{\displaystyle D_{2}^{S}}
and
D
2
{\displaystyle D_{2}^{*}}
These approachese are variants of the
D
2
{\displaystyle D_{2}}
statistics that counts the number of
k
{\displaystyle k}
-mer matches between two sequences. They improve the simple
D
2
{\displaystyle D_{2}}
statistics by taking the background distribution of the compared sequences into account.
==== MASH ====
This is an extremely fast method that uses the MinHash bottom sketch strategy for estimating the Jaccard index of the multi-sets of
k
{\displaystyle k}
-mers of two input sequences. That is, it estimates the ratio of
k
{\displaystyle k}
-mer matches to the total number of
k
{\displaystyle k}
-mers of the sequences. This can be used, in turn, to estimate the evolutionary distances between the compared sequences, measured as the number of substitutions per sequence position since the sequences evolved from their last common ancestor.
==== Slope-Tree ====
This approach calculates a distance value between two protein sequences based on the decay of the number of
k
{\displaystyle k}
-mer matches if
k
{\displaystyle k}
increases.
==== Slope-SpaM ====
This method calculates the number
N
k
{\displaystyle N_{k}}
of
k
{\displaystyle k}
-mer or spaced-word matches
(SpaM) for different values for the word length or number of match positions
k
{\displaystyle k}
in the underlying pattern, respectively. The slope of an affine-linear function
F
{\displaystyle F}
that depends on
N
k
{\displaystyle N_{k}}
is calculated to estimate the Jukes-Cantor distance between the input sequences .
==== Skmer ====
Skmer calculates distances between species from unassembled sequencing reads. Similar to MASH, it uses the Jaccard index on the sets of
k
{\displaystyle k}
-mers from the input sequences. In contrast to MASH, the program is still accurate for low sequencing coverage, so it can be used for genome skimming.
=== Methods based on micro-alignments ===
Strictly spoken, these methods are not alignment-free. They are using simple gap-free micro-alignments where sequences are required to match at certain pre-defined positions. The positions aligned at the remaining positions of the micro-alignments where mismatches are allowed, are then used for phylogeny inference.
==== Co-phylog ====
This method searches for so-called structures that are defined as pairs of k-mer matches between two DNA sequences that are one position apart in both sequences. The two k-mer matches are called the context, the position between them is called the object. Co-phylog then defines the distance between two sequences the fraction of such structures for which the two nucleotides in the object are different. The approach can be applied to unassembled sequencing reads.
==== andi ====
andi estimates phylogenetic distances between genomic sequences based on ungapped local alignments that are flanked by maximal exact word matches. Such word matches can be efficiently found using suffix arrays. The gapfree alignments between the exact word matches are then used to estimate phylogenetic distances between genome sequences. The resulting distance estimates are accurate for up to around 0.6 substitutions per position.