5.4 KiB
| title | chunk | source | category | tags | date_saved | instance |
|---|---|---|---|---|---|---|
| Biclustering | 1/3 | https://en.wikipedia.org/wiki/Biclustering | reference | science, encyclopedia | 2026-05-05T14:01:29.881151+00:00 | kb-cron |
Biclustering, block clustering, co-clustering or two-mode clustering is a data mining technique which allows simultaneous clustering of the rows and columns of a matrix. The term was first introduced by Boris Mirkin to name a technique introduced many years earlier, in 1972, by John A. Hartigan. Given a set of
m
{\displaystyle m}
samples represented by an
n
{\displaystyle n}
-dimensional feature vector, the entire dataset can be represented as
m
{\displaystyle m}
rows in
n
{\displaystyle n}
columns (i.e., an
m
×
n
{\displaystyle m\times n}
matrix). The Biclustering algorithm generates Biclusters. A Bicluster is a subset of rows which exhibit similar behavior across a subset of columns, or vice versa.
== Development == Biclustering was originally introduced by John A. Hartigan in 1972. The term "Biclustering" was then later used and refined by Boris G. Mirkin. This algorithm was not generalized until 2000, when Y. Cheng and George M. Church proposed a biclustering algorithm based on the mean squared residue score (MSR) and applied it to biological gene expression data. In 2001 and 2003, I. S. Dhillon published two algorithms applying biclustering to files and words. One version was based on bipartite spectral graph partitioning. The other was based on information theory. Dhillon assumed the loss of mutual information during biclustering was equal to the Kullback–Leibler-distance (KL-distance) between P and Q. P represents the distribution of files and feature words before Biclustering, while Q is the distribution after Biclustering. KL-distance is for measuring the difference between two random distributions. KL = 0 when the two distributions are the same and KL increases as the difference increases. Thus, the aim of the algorithm was to find the minimum KL-distance between P and Q. In 2004, Arindam Banerjee used a weighted-Bregman distance instead of KL-distance to design a Biclustering algorithm that was suitable for any kind of matrix, unlike the KL-distance algorithm. To cluster more than two types of objects, in 2005, Bekkerman expanded the mutual information in Dhillon's theorem from a single pair into multiple pairs.
== Complexity == The complexity of the Biclustering problem depends on the exact problem formulation, and particularly on the merit function used to evaluate the quality of a given Bicluster. However, the most interesting variants of this problem are NP-complete. NP-complete has two conditions. In the simple case that there is an only element a(i,j) either 0 or 1 in the binary matrix A, a Bicluster is equal to a biclique in the corresponding bipartite graph. The maximum size Bicluster is equivalent to the maximum edge biclique in the bipartite graph. In the complex case, the element in matrix A is used to compute the quality of a given Bicluster and solve the more restricted version of the problem. It requires either large computational effort or the use of lossy heuristics to short-circuit the calculation.
== Types of Biclusters == Bicluster with constant values (a) When a Biclustering algorithm tries to find a constant-value Bicluster, it reorders the rows and columns of the matrix to group together similar rows and columns, eventually grouping Biclusters with similar values. This method is sufficient when the data is normalized. A perfect constant Bicluster is a matrix(I,J) in which all values a(i,j) are equal to a given constant μ. In tangible data, these entries a(i,j) may be represented with the form n(i,j) + μ where n(i,j) denotes the noise. According to Hartigan's algorithm, by splitting the original data matrix into a set of Biclusters, variance is used to compute constant Biclusters. Hence, a perfect Bicluster may be equivalently defined as a matrix with a variance of zero. In order to prevent the partitioning of the data matrix into Biclusters with the only one row and one column; Hartigan assumes that there are, for example, K Biclusters within the data matrix. When the data matrix is partitioned into K Biclusters, the algorithm ends. Bicluster with constant values on rows (b) or columns (c) Unlike the constant-value Biclusters, these types of Biclusters cannot be evaluated solely based on the variance of their values. To finish the identification, the columns and the rows should be normalized first. There are, however, other algorithms, without the normalization step, that can find Biclusters which have rows and columns with different approaches. Bicluster with coherent values (d, e) For Biclusters with coherent values on rows and columns, an overall improvement over the algorithms for Biclusters with constant values on rows or on columns should be considered. This algorithm may contain analysis of variance between groups, using co-variance between both rows and columns. In Cheng and Church's theorem, a Bicluster is defined as a subset of rows and columns with almost the same score. The similarity score is used to measure the coherence of rows and columns.
The relationship between these cluster models and other types of clustering such as correlation clustering is discussed in.