6.3 KiB
| title | chunk | source | category | tags | date_saved | instance |
|---|---|---|---|---|---|---|
| Computational biology | 3/4 | https://en.wikipedia.org/wiki/Computational_biology | reference | science, encyclopedia | 2026-05-05T14:02:16.270762+00:00 | kb-cron |
=== Drug discovery === A growing application of computational biology is drug discovery. For example, simulations of intracellular and intercellular signaling events, using data from proteomic or metabolomic experiments, may reduce dependence on experimentation in elucidating pharmacokinetics and pharmacodynamics of drug candidates in living organisms. Increasingly, artificial intelligence plays a central role in the drug discovery process. Using chemical structures of known pharmaceutical agents as inputs, AI models can suggest structures of lead compounds or predict novel modes of drug-protein binding. AI is also used for virtual screening of candidate molecules, avoiding the need to synthesize large numbers of molecules for screening.
== Techniques == Computational biologists use a wide range of software and algorithms to carry out their research.
=== Unsupervised learning === Unsupervised learning is a type of algorithm that finds patterns in unlabeled data. One example is k-means clustering, which aims to partition n data points into k clusters, in which each data point belongs to the cluster with the nearest mean. Another version is the k-medoids algorithm, which, when selecting a cluster center or cluster centroid, will pick one of its data points in the set, and not just an average of the cluster.
The algorithm follows these steps:
Randomly select k distinct data points. These are the initial clusters. Measure the distance between each point and each of the 'k' clusters. (This is the distance of the points from each point k). Assign each point to the nearest cluster. Find the center of each cluster (medoid). Repeat until the clusters no longer change. Assess the quality of the clustering by adding up the variation within each cluster. Repeat the processes with different values of k. Pick the best value for 'k' by finding the "elbow" in the plot of which k value has the lowest variance. One example of this in biology is used in the 3D mapping of a genome. Information of a mouse's HIST1 region of chromosome 13 is gathered from Gene Expression Omnibus. This information contains data on which nuclear profiles show up in certain genomic regions. With this information, the Jaccard distance can be used to find a normalized distance between all the loci.
=== Graph analytics === Graph analytics, or network analysis, is the study of graphs that represent connections between different objects. Graphs can represent all kinds of networks in biology such as protein-protein interaction networks, regulatory networks, Metabolic and biochemical networks and much more. There are many ways to analyze these networks. One of which is looking at centrality in graphs. Finding centrality in graphs assigns nodes rankings to their popularity or centrality in the graph. This can be useful in finding which nodes are most important. For example, given data on the activity of genes over a time period, degree centrality can be used to see what genes are most active throughout the network, or what genes interact with others the most throughout the network. This contributes to the understanding of the roles certain genes play in the network. There are many ways to calculate centrality in graphs all of which can give different kinds of information on centrality. Finding centralities in biology can be applied in many different circumstances, some of which are gene regulatory, protein interaction and metabolic networks.
=== Supervised learning === Supervised learning is a type of algorithm that learns from labeled data and learns how to assign labels to future data that is unlabeled. In biology supervised learning can be helpful when we have data that we know how to categorize and we would like to categorize more data into those categories.
A common supervised learning algorithm is the random forest, which uses numerous decision trees to train a model to classify a dataset. Forming the basis of the random forest, a decision tree is a structure which aims to classify, or label, some set of data using certain known features of that data. A practical biological example of this would be taking an individual's genetic data and predicting whether or not that individual is predisposed to develop a certain disease or cancer. At each internal node the algorithm checks the dataset for exactly one feature, a specific gene in the previous example, and then branches left or right based on the result. Then at each leaf node, the decision tree assigns a class label to the dataset. So in practice, the algorithm walks a specific root-to-leaf path based on the input dataset through the decision tree, which results in the classification of that dataset. Commonly, decision trees have target variables that take on discrete values, like yes/no, in which case it is referred to as a classification tree, but if the target variable is continuous then it is called a regression tree. To construct a decision tree, it must first be trained using a training set to identify which features are the best predictors of the target variable.
=== Using fast matrix multiplication === These are attempts to utilize fast matrix multiplication algorithms in computational biology. Examples of this type of work are and .
=== Open source software === Open source software provides a platform for computational biology where everyone can access and benefit from software developed in research. PLOS cites four main reasons for the use of open source software:
Reproducibility: This allows for researchers to use the exact methods used to calculate the relations between biological data. Faster development: developers and researchers do not have to reinvent existing code for minor tasks. Instead they can use pre-existing programs to save time on the development and implementation of larger projects. Increased quality: Having input from multiple researchers studying the same topic provides a layer of assurance that errors will not be in the code. Long-term availability: Open source programs are not tied to any businesses or patents. This allows for them to be posted to multiple web pages and ensure that they are available in the future.