Scrape wikipedia-science: 15617 new, 4054 updated, 20200 total (kb-cron)

2026-05-05 07:02:36 -07:00 · 2026-05-05 07:02:36 -07:00 · 2e50ba1868
commit 2e50ba1868
parent cb51731750
101 changed files with 5117 additions and 2 deletions
--- a/_index.db
+++ b/_index.db
--- a/data/en.wikipedia.org/wiki/BGZF-0.md
+++ b/data/en.wikipedia.org/wiki/BGZF-0.md
@ -0,0 +1,50 @@
+---
+title: "BGZF"
+chunk: 1/1
+source: "https://en.wikipedia.org/wiki/BGZF"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T14:01:28.637413+00:00"
+instance: "kb-cron"
+---
+
+Blocked GNU Zip Format (BGZF) is a variant of gzip file format that uses block compression, a method that compresses data in independent blocks of content—each of which is a valid gzip file. This design is utilized widely in bioinformatics for genomic data compression. The block-based design provides efficient storage, random access with indexed queries, and parallel processing; allowing large-scale data processing.
+The format was developed as part of SAM/BAM specification and SAMtools. It is a core component of the common BAM format (the binary version of the Sequence Alignment Map format) and is also used to compress and index Variant Call Format (VCF), FASTA, and BED files. Because each block is a standard gzip block, a BGZF file can be decompressed by any standard gzip-compatible tool, ensuring backward compatibility. A general purpose compression utility for producing BGZF files bgzip is distributed with HTSlib software library.
+
+
+== Uses ==
+BGZF is widely utilized in bioinformatics for the compression of large datasets where efficient random access is a crucial requirement. Due to large sizes of next-generation sequencing data formats like SAM files, they are compressed into binary BAM format utilizing BGZF compression.
+For random access, an index file is created for a BGZF-compressed file, typically using Tabix. This index stores the file offsets of the compressed blocks alongside the corresponding genomic coordinates, thus allowing a program to seek directly to the block containing the data queried, decompress only them, and retrieve the requested information, avoiding the need to process the entire file.
+The format is also extensively employed for compressing variant call files (VCF) along with their associated Tabix indexes, and similarly for other substantial genomic data files such as BED, GFF/GTF, and occasionally FASTQ when indexed access is necessary. A broad range of bioinformatics software packages are equipped to read and write BGZF-compressed files; these include well-known tools like SAMtools, HTSlib, BCF/VCFtools, Picard tools, the GATK, and libraries such as Biopython. The standard command-line utility for creating BGZF-compressed files and their corresponding .gzi indexes is bgzip, which is distributed as part of HTSlib.
+BGZF has been adapted for development of more efficient data-specific compression methods and algorithms leveraging its block based design.
+
+
+== Design schema ==
+A BGZF file consists of a series of concatenated BGZF blocks. Each block, whether in its compressed or uncompressed state, is limited to a maximum size of 64 kilobytes. Each BGZF block is itself a fully compliant gzip archive, adhering to the specifications outlined in RFC 1952.
+
+Each BGZF block contains a standard gzip file header with the following standard-compliant extensions:
+
+The F.EXTRA bit in the header is set to indicate that extra fields are present.
+The extra field used by BGZF uses the two subfield ID values 66 and 67 (ASCII 'BC').
+The length of the BGZF extra field payload (field LEN in the gzip specification) is 2 (two bytes of payload).
+The payload of the BGZF extra field is a 16-bit unsigned integer in little-endian format. This integer gives the size of the containing BGZF block minus one.
+This block design allows use of an associated index file (storing offsets of each BGZF block) to fetch and decompress only the block of data that pertains to the query, thus avoiding the computational overhead of reading and decompressing all BGZF blocks.
+
+
+=== Random access ===
+
+
+=== EOF marker ===
+End-of-file marker for BGZF enables detection of erroneously truncated files and generate warnings or errors for the user. The EOF marker block is an empty (data block of length zero) BGZF block encoded with the default zlib compression level settings, and consists of the following 28 hexadecimal bytes:
+1f 8b 08 04 00 00 00 00 00 ff 06 00 42 43 02 00 1b 00 03 00 00 00 00 00 00 00 00 00
+The presence of an EOF marker by itself does not signal an end of the file, however, an EOF marker present at the end of a BGZF file indicates that the immediately following physical EOF is the end of the file as intended by the program that wrote it.
+
+
+== See also ==
+Data compression
+SAM (file format)
+BAM (file format)
+List of file formats in biology
+
+
+== References ==
--- a/data/en.wikipedia.org/wiki/BIOSCI-0.md
+++ b/data/en.wikipedia.org/wiki/BIOSCI-0.md
@ -0,0 +1,27 @@
+---
+title: "BIOSCI"
+chunk: 1/1
+source: "https://en.wikipedia.org/wiki/BIOSCI"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T14:01:49.406680+00:00"
+instance: "kb-cron"
+---
+
+BIOSCI, also known as Bionet, is a set of electronic communication forum used by life scientists around the world.  It includes the Bionet Usenet newsgroups and parallel e-mail lists, with public archives since 1992 at www.bio.net.  BIOSCI/Bionet provides public, open access biology news and discussion for areas such as molecular biology methods and reagents, bioinformatics software and computational biology, toxicology, and several organism communities including yeast, C.elegans and annelida (worms), the plant arabidopsis, fruitfly, maize (corn), and others.
+BIOSCI/Bionet was started as part of the GenBank public biosequence database project by Intelligenetics at Stanford University in the mid-1980s, in collaboration with Martin Bishop and Michael Ashburner in the University of Cambridge.  It latter moved to the United Kingdom's MRC Rosalind Franklin Centre for Genomics Research (RFCGR).  In 2005, with the closing of RFCGR, BIOSCI/Bionet moved to Indiana University Biology Department's IUBio Archive.
+As one of the earliest bioinformatics community projects on the Internet, GenBank  acquired the bio.net domain and the Usenet hierarchy of Bionet for promoting open access communications among bioscientists, in conjunction with public biology data distribution.
+
+Michael Ashburner, co-founder of BIOSCI with Dave Kristofferson of GenBank  (Intelligenetics), writes of its origins  ... in the early 1980s, Martin Bishop and I ran an email news service for a sequence analysis service that we ran on the Cambridge IBM3084Q mainframe.  I was also a user of MOLGEN at Stanford, and there Dave Kristofferson ran an internal bulletin board using ANU News. We combined forces to start the Bulletin boards. 
+Bionet has provided open access, Internet news groups and discussion for many thousands of life scientists for 30 years. As of 2019 April, discussion lists are suspended, but are archived for reading. A new supporting organization is sought to continue Bionet into its fourth decade, as Indiana University will no longer support this public service to biologists.
+The Usenet hierarchy of Bionet.* includes bionet.announce (general biology announcements), and research communities of bionet.microbiology, bionet.molbio.methds-reagnts, bionet.neuroscience, bionet.genome.arabidopsis, bionet.plants.education, bionet.drosophila, bionet.biology.computational plus 50 other active areas of discussion.
+
+
+== References ==
+
+
+== External links ==
+BIOSCI/Bionet
+IUBio Archive
+Dave Kristofferson on BIOSCI
+NIH support of GenBank/BIOSCI
--- a/data/en.wikipedia.org/wiki/BLOSUM-0.md
+++ b/data/en.wikipedia.org/wiki/BLOSUM-0.md
@ -0,0 +1,44 @@
+---
+title: "BLOSUM"
+chunk: 1/3
+source: "https://en.wikipedia.org/wiki/BLOSUM"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T14:01:55.384613+00:00"
+instance: "kb-cron"
+---
+
+In bioinformatics, the BLOSUM (BLOcks SUbstitution Matrix) matrix is a substitution matrix used for sequence alignment of proteins. BLOSUM matrices are used to score alignments between evolutionarily divergent protein sequences. They are based on local alignments. BLOSUM matrices were first introduced in a paper by Steven Henikoff and Jorja Henikoff. They scanned the BLOCKS database for very conserved regions of protein families (that do not have gaps in the sequence alignment) and then counted the relative frequencies of amino acids and their substitution probabilities. Then, they calculated a log-odds score for each of the 210 possible substitution pairs of the 20 standard amino acids. All BLOSUM matrices are based on observed alignments; they are not extrapolated from comparisons of closely related proteins like the PAM Matrices.
+
+== Biological background ==
+The genetic instructions of every replicating cell in a living organism are contained within its DNA. Throughout the cell's lifetime, this information is transcribed and replicated by cellular mechanisms to produce proteins or to provide instructions for daughter cells during cell division, and the possibility exists that the DNA may be altered during these processes. This is known as a mutation. At the molecular level, there are regulatory systems that correct most — but not all — of these changes to the DNA before it is replicated.
+The functionality of a protein is highly dependent on its structure. Changing a single amino acid in a protein may reduce its ability to carry out this function, or the mutation may even change the function that the protein carries out. Changes like these may severely impact a crucial function in a cell, potentially causing the cell — and in extreme cases, the organism — to die. Conversely, the change may allow the cell to continue functioning albeit differently, and the mutation can be passed on to the organism's offspring. If this change does not result in any significant physical disadvantage to the offspring, the possibility exists that this mutation will persist within the population. The possibility also exists that the change in function becomes advantageous.
+The 20 amino acids translated by the genetic code vary greatly by the physical and chemical properties of their side chains. However, these amino acids can be categorised into groups with similar physicochemical properties. Substituting an amino acid with another from the same category is more likely to have a smaller impact on the structure and function of a protein than replacement with an amino acid from a different category.
+Sequence alignment is a fundamental research method for modern biology. The most common sequence alignment for protein is to look for similarity between different sequences in order to infer function or establish evolutionary relationships. This helps researchers better understand the origin and function of genes through the nature of homology and conservation. Substitution matrices are utilized in algorithms to calculate the similarity of different sequences of proteins; however, the utility of Dayhoff PAM Matrix has decreased over time due to the requirement of sequences with a similarity more than 85%. In order to fill in this gap, Henikoff and Henikoff introduced BLOSUM (BLOcks SUbstitution Matrix) matrix which led to marked improvements in alignments and in searches using queries from each of the groups of related proteins.
+
+== Terminology ==
+BLOSUM
+Blocks Substitution Matrix, a substitution matrix used for sequence alignment of proteins.
+Scoring metrics (statistical versus biological)
+When evaluating a sequence alignment, one would like to know how meaningful it is. This requires a scoring matrix, or a table of values that describes the probability of a biologically meaningful amino-acid or nucleotide residue-pair occurring in an alignment. Scores for each position are obtained frequencies of substitutions in blocks of local alignments of protein sequences.
+BLOSUM r
+
+The matrix built from blocks with less than r% of similarity
+E.g., BLOSUM62 is the matrix built using sequences with less than 62% similarity (sequences with ≥ 62% identity were clustered together).
+Note: BLOSUM 62 is the default matrix for protein BLAST. Experimentation has shown that the BLOSUM-62 matrix is among the best for detecting most weak protein similarities.
+Several sets of BLOSUM matrices exist using different alignment databases, named with numbers. BLOSUM matrices with high numbers are designed for comparing closely related sequences, while those with low numbers are designed for comparing distant related sequences. For example, BLOSUM80 is used for closely related alignments, and BLOSUM45 is used for more distantly related alignments. The matrices were created by merging (clustering) all sequences that were more similar than a given percentage into one single sequence and then comparing those sequences (that were all more divergent than the given percentage value) only; thus reducing the contribution of closely related sequences. The percentage used was appended to the name, giving BLOSUM80 for example where sequences that were more than 80% identical were clustered.
+
+== Construction of BLOSUM matrices ==
+BLOSUM matrices are obtained by using blocks of similar amino acid sequences as data, then applying statistical methods to the data to obtain the similarity scores.
+Statistical Methods Steps:
+
+=== Eliminating Sequences ===
+Eliminate the sequences that are more than r% identical. There are two ways to eliminate the sequences. It can be done either by removing sequences from the block or just by finding similar sequences and replace them by new sequences which could represent the cluster. Elimination is done to remove protein sequences that are more similar than the specified threshold.
+
+=== Calculating Frequency & Probability ===
+A database storing the sequence alignments of the most conserved regions of protein families. These alignments are used to derive the BLOSUM matrices. Only the sequences with a percentage of identity lower than the threshold are used.
+By using the block, counting the pairs of amino acids in each column of the multiple alignment.
+
+=== Log odds ratio ===
+It gives the ratio of the occurrence each amino acid combination in the observed data to the expected value of occurrence of the pair.
+It is rounded off and used in the substitution matrix.
--- a/data/en.wikipedia.org/wiki/BLOSUM-1.md
+++ b/data/en.wikipedia.org/wiki/BLOSUM-1.md
@ -0,0 +1,267 @@
+---
+title: "BLOSUM"
+chunk: 2/3
+source: "https://en.wikipedia.org/wiki/BLOSUM"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T14:01:55.384613+00:00"
+instance: "kb-cron"
+---
+
+  
+    
+      
+        L
+        o
+        g
+        O
+        d
+        d
+        R
+        a
+        t
+        i
+        o
+        =
+        2
+        
+          log
+          
+            2
+          
+        
+        ⁡
+        
+          
+            (
+            
+              
+                
+                  P
+                  
+                    (
+                    O
+                    )
+                  
+                
+                
+                  P
+                  
+                    (
+                    E
+                    )
+                  
+                
+              
+            
+            )
+          
+        
+      
+    
+    {\displaystyle LogOddRatio=2\log _{2}{\left({\frac {P\left(O\right)}{P\left(E\right)}}\right)}}
+  
+
+where 
+  
+    
+      
+        P
+        
+          (
+          O
+          )
+        
+      
+    
+    {\displaystyle P\left(O\right)}
+  
+ is the probability of observing the pair and 
+  
+    
+      
+        P
+        
+          (
+          E
+          )
+        
+      
+    
+    {\displaystyle P\left(E\right)}
+  
+ is the expected probability of such a pair occurring, given the background probabilities of each amino acid.
+
+=== BLOSUM Matrices ===
+The odds for relatedness are calculated from log odd ratio, which are then rounded off to get the substitution matrices BLOSUM matrices.
+
+=== Score of the BLOSUM matrices ===
+A scoring matrix or a table of values is required for evaluating the significance of a sequence alignment, such as describing the probability of a biologically meaningful amino-acid or nucleotide residue-pair occurring in an alignment. Typically, when two nucleotide sequences are being compared, all that is being scored is whether or not two bases are the same at one position. All matches and mismatches are respectively given the same score (typically +1 or +5 for matches, and -1 or -4 for mismatches). But it is different for proteins. Substitution matrices for amino acids are more complicated and implicitly take into account everything that might affect the frequency with which any amino acid is substituted for another. The objective is to provide a relatively heavy penalty for aligning two residues together if they have a low probability of being homologous (correctly aligned by evolutionary descent). Two major forces drive the amino-acid substitution rates away from uniformity: substitutions occur with the different frequencies, and lessen functionally tolerated than others. Thus, substitutions are selected against.
+Commonly used substitution matrices include the blocks substitution (BLOSUM)  and point accepted mutation (PAM)  matrices. Both are based on taking sets of high-confidence alignments of many homologous proteins and assessing the frequencies of all substitutions, but they are computed using different methods.
+Scores within a BLOSUM are log-odds scores that measure, in an alignment, the logarithm for the ratio of the likelihood of two amino acids appearing with a biological sense and the likelihood of the same amino acids appearing by chance. The matrices are based on the minimum percentage identity of the aligned protein sequence used in calculating them. Every possible identity or substitution is assigned a score based on its observed frequencies in the alignment of related proteins. A positive score is given to the more likely substitutions while a negative score is given to the less likely substitutions.
+To calculate a BLOSUM matrix, the following equation is used: 
+
+  
+    
+      
+        
+          S
+          
+            i
+            j
+          
+        
+        =
+        
+          
+            1
+            λ
+          
+        
+        log
+        ⁡
+        
+          
+            
+              p
+              
+                i
+                j
+              
+            
+            
+              
+                q
+                
+                  i
+                
+              
+              
+                q
+                
+                  j
+                
+              
+            
+          
+        
+      
+    
+    {\displaystyle S_{ij}={\frac {1}{\lambda }}\log {\frac {p_{ij}}{q_{i}q_{j}}}}
+  
+
+Here, 
+  
+    
+      
+        
+          p
+          
+            i
+            j
+          
+        
+      
+    
+    {\displaystyle p_{ij}}
+  
+ is the probability of two amino acids 
+  
+    
+      
+        i
+      
+    
+    {\displaystyle i}
+  
+ and 
+  
+    
+      
+        j
+      
+    
+    {\displaystyle j}
+  
+ replacing each other in a homologous sequence, and 
+  
+    
+      
+        
+          q
+          
+            i
+          
+        
+      
+    
+    {\displaystyle q_{i}}
+  
+ and 
+  
+    
+      
+        
+          q
+          
+            j
+          
+        
+      
+    
+    {\displaystyle q_{j}}
+  
+ are the background probabilities of finding the amino acids 
+  
+    
+      
+        i
+      
+    
+    {\displaystyle i}
+  
+ and 
+  
+    
+      
+        j
+      
+    
+    {\displaystyle j}
+  
+ in any protein sequence. The factor 
+  
+    
+      
+        λ
+      
+    
+    {\displaystyle \lambda }
+  
+ is a scaling factor, set such that the matrix contains easily computable integer values.
+
+== Variants ==
+
+=== BLOSUM ===
+BLOSUM80: more related proteins
+BLOSUM62: midrange
+BLOSUM45: distantly related proteins
+The BLOSUM62 matrix with the amino acids in the table grouped according to the chemistry of the side chain, as in (a). Each value in the matrix is calculated by dividing the frequency of occurrence of the amino acid pair in the BLOCKS database, clustered at the 62% level, divided by the probability that the same two amino acids might align by chance. The ratio is then converted to a logarithm and expressed as a log odds score, as for PAM. BLOSUM matrices are usually scaled in half-bit units. A score of zero indicates that the frequency with which a given two amino acids were found aligned in the database was as expected by chance, while a positive score indicates that the alignment was found more often than by chance, and negative score indicates that the alignment was found less often than by chance.
+
+=== PMB ===
+PMB (Probability Matrix from Blocks) of 2004 uses the additivity of evolutionary distances to improve on BLOSUM's analysis of the BLOCKS database. The up-to-date 2001 version of BLOCKS was used to generate a new set of BLOSUM matrices. The "observed substitution frequencies" found in these BLOSUM matrices are used to estimate actual substitution frequencies (with higher evolutionary distance, i.e. lower r, some later replacement can mask earlier replacements). PMB thus defines a true evolutionary model like PAM and JTT do. It is not a symmetric matrix.
+
+=== RBLOSUM ===
+The original code written by Henikoff and Henikoff does not exactly act according to their paper's description of the algorithm. The BLOSUM62 from that program has been used for many years as standard. Surprisingly, the miscalculated BLOSUM62 improves search performance compared to the 2008 corrected version of the same relative entropy (RBLOSUM64).
+A 2018 article claims that RBLOSUM is better than BLOSUM and CorBLOSUM.
+
+=== CorBLOSUM ===
+A 2016 paper finds further errors in the original code not addressed by the 2008 RBLOSUM correction. The corrected version from this paper, CorBLOSUM, manages to be more effective than BLOSUM at similarity search in about 75% of cases.
+
+== Some uses in bioinformatics ==
+
+=== Research applications ===
+BLOSUM scores was used to predict and understand the surface gene variants among hepatitis B virus carriers and T-cell epitopes.
+
+==== Surface gene variants among hepatitis B virus carriers ====
+DNA sequences of HBsAg were obtained from 180 patients, in which 51 were chronic HBV carrier and 129 newly diagnosed patients, and compared with consensus sequences built with 168 HBV sequences imported from GenBank. Literature review and BLOSUM scores were used to define potentially altered antigenicity.
--- a/data/en.wikipedia.org/wiki/BLOSUM-2.md
+++ b/data/en.wikipedia.org/wiki/BLOSUM-2.md
@ -0,0 +1,47 @@
+---
+title: "BLOSUM"
+chunk: 3/3
+source: "https://en.wikipedia.org/wiki/BLOSUM"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T14:01:55.384613+00:00"
+instance: "kb-cron"
+---
+
+==== Reliable prediction of T-cell epitopes ====
+A novel input representation has been developed consisting of a combination of sparse encoding, Blosum encoding, and input derived from hidden Markov models. this method predicts T-cell epitopes for the genome of hepatitis C virus and discuss possible applications of the prediction method to guide the process of rational vaccine design.
+
+=== Use in BLAST ===
+BLOSUM matrices are also used as a scoring matrix when comparing DNA sequences or protein sequences to judge the quality of the alignment. This form of scoring system is utilized by a wide range of alignment software including BLAST.
+
+==== Comparing PAM and BLOSUM ====
+In addition to BLOSUM matrices, a previously developed scoring matrix can be used. This is known as a PAM. The two result in the same scoring outcome, but use differing methodologies. BLOSUM looks directly at mutations in motifs of related sequences while PAM's extrapolate evolutionary information based on closely related sequences.
+Since both PAM and BLOSUM are different methods for showing the same scoring information, the two can be compared but due to the very different method of obtaining this score, a PAM100 does not equal a BLOSUM100.
+
+===== The relationship between PAM and BLOSUM =====
+
+===== The differences between PAM and BLOSUM =====
+
+=== Availability ===
+The "reference" version of BLOSUM is found in the NCBI toolkits. Both the older (deprecated) NCBI C Toolkit and the current NCBI C++ Toolkit provide the BLOSUM45, BLOSUM50, BLOSUM62, BLOSUM80, and BLOSUM90 matrices. Both also offer APIs for making use of the matrices.
+The original source code for calculating BLOSUM is also found on the NCBI website, at https://ftp.ncbi.nih.gov/repository/blocks/unix/blosum/. This archive "blosum.tar.Z" represents the original miscalculated version with improved search performance from 1992. The archive also contains pre-calculated BLOSUM outputs at the following similarity levels: "-2" (blosumn), 30, 40, 45, 50, 55, 60, 62, 65, 70, 75, 80, 85, 90, 95, and 100.
+
+==== Software Packages ====
+There are several software packages in different programming languages that allow easy use of Blosum matrices. Besides the aforementioned NCBI Toolkits, there are:
+
+blosum module for Python
+BioJava library for Java
+... and many more.
+
+== See also ==
+Sequence alignment
+Point accepted mutation
+
+== References ==
+
+== External links ==
+Sean R. Eddy (2004). "Where did the BLOSUM62 alignment score matrix come from?". Nature Biotechnology. 22 (8): 1035–6. doi:10.1038/nbt0804-1035. PMID 15286655. S2CID 205269887.
+BLOCKS WWW server
+Scoring systems for BLAST at NCBI
+Data files of matrices including BLOSUM30–100 on the NCBI FTP server.
+Interactive BLOSUM Network Visualization Archived 30 January 2017 at the Wayback Machine
--- a/data/en.wikipedia.org/wiki/Benjamin_Franklin_Award_(Bioinformatics)-0.md
+++ b/data/en.wikipedia.org/wiki/Benjamin_Franklin_Award_(Bioinformatics)-0.md
@ -0,0 +1,44 @@
+---
+title: "Benjamin Franklin Award (Bioinformatics)"
+chunk: 1/1
+source: "https://en.wikipedia.org/wiki/Benjamin_Franklin_Award_(Bioinformatics)"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T14:01:27.481267+00:00"
+instance: "kb-cron"
+---
+
+The Benjamin Franklin Award is an annual award for Open Access in the Life Sciences presented by Bioinformatics.org to an individual who has, in his or her practice, promoted free and open access to the materials and methods used in the life sciences.
+
+
+== Laureates ==
+Source: bioinformatics.org
+
+2002 - Michael B. Eisen
+2003 - Jim Kent
+2004 - Lincoln D. Stein
+2005 - Ewan Birney
+2006 - Michael Ashburner
+2007 - Sean Eddy
+2008 - Robert Gentleman
+2009 - Philip E. Bourne
+2010 - Alex Bateman
+2011 - Jonathan Eisen
+2012 - Heng Li
+2013 - Steven Salzberg
+2014 - Helen M. Berman
+2015 - Owen White
+2016 - Benjamin Langmead
+2017 - Rafael Irizarry
+2018 - Desmond G. Higgins
+2019 - Eugene Koonin
+2020 - Xiaole Shirley Liu
+
+
+== See also ==
+Awards in Bioinformatics and Computational Biology
+List of biology awards
+Prizes named after people
+
+
+== Sources ==
--- a/data/en.wikipedia.org/wiki/Biclustering-0.md
+++ b/data/en.wikipedia.org/wiki/Biclustering-0.md
@ -0,0 +1,80 @@
+---
+title: "Biclustering"
+chunk: 1/3
+source: "https://en.wikipedia.org/wiki/Biclustering"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T14:01:29.881151+00:00"
+instance: "kb-cron"
+---
+
+Biclustering, block clustering, co-clustering or two-mode clustering is a data mining technique which allows simultaneous clustering of the rows and columns of a matrix.
+The term was first introduced by Boris Mirkin to name a technique introduced many years earlier, in 1972, by John A. Hartigan.
+Given a set of 
+  
+    
+      
+        m
+      
+    
+    {\displaystyle m}
+  
+ samples represented by an 
+  
+    
+      
+        n
+      
+    
+    {\displaystyle n}
+  
+-dimensional feature vector, the entire dataset can be represented as 
+  
+    
+      
+        m
+      
+    
+    {\displaystyle m}
+  
+ rows in 
+  
+    
+      
+        n
+      
+    
+    {\displaystyle n}
+  
+ columns (i.e., an 
+  
+    
+      
+        m
+        ×
+        n
+      
+    
+    {\displaystyle m\times n}
+  
+ matrix). The Biclustering algorithm generates Biclusters. A Bicluster is a subset of rows which exhibit similar behavior across a subset of columns, or vice versa.
+
+== Development ==
+Biclustering was originally introduced by John A. Hartigan in 1972. The term "Biclustering" was then later used and refined by Boris G. Mirkin. This algorithm was not generalized until 2000, when Y. Cheng and George M. Church proposed a biclustering algorithm based on the mean squared residue score (MSR) and applied it to biological gene expression data.
+In 2001 and 2003, I. S. Dhillon published two algorithms applying biclustering to files and words. One version was based on bipartite spectral graph partitioning. The other was based on information theory. Dhillon assumed the loss of mutual information during biclustering was equal to the Kullback–Leibler-distance (KL-distance) between P and Q. P represents the distribution of files and feature words before Biclustering, while Q is the distribution after Biclustering. KL-distance is for measuring the difference between two random distributions. KL = 0 when the two distributions are the same and KL increases as the difference increases. Thus, the aim of the algorithm was to find the minimum KL-distance between P and Q. In 2004, Arindam Banerjee used a weighted-Bregman distance instead of KL-distance to design a Biclustering algorithm that was suitable for any kind of matrix, unlike the KL-distance algorithm.
+To cluster more than two types of objects, in 2005, Bekkerman expanded the mutual information in Dhillon's theorem from a single pair into multiple pairs.
+
+== Complexity ==
+The complexity of the Biclustering problem depends on the exact problem formulation, and particularly on the merit function used to evaluate the quality of a given Bicluster. However, the most interesting variants of this problem are NP-complete. NP-complete has two conditions. In the simple case that there is an only element a(i,j) either 0 or 1 in the binary matrix A, a Bicluster is equal to a biclique in the corresponding bipartite graph. The maximum size Bicluster is equivalent to the maximum edge biclique in the bipartite graph.  In the complex case, the element in matrix A is used to compute the quality of a given Bicluster and solve the more restricted version of the problem. It requires either large computational effort or the use of lossy heuristics to short-circuit the calculation.
+
+== Types of Biclusters ==
+Bicluster with constant values (a)
+When a Biclustering algorithm tries to find a constant-value Bicluster, it reorders the rows and columns of the matrix to group together similar rows and columns, eventually grouping Biclusters with similar values. This method is sufficient when the data is normalized. 
+A perfect constant Bicluster is a matrix(I,J) in which all values a(i,j) are equal to a given constant μ. In tangible data, these entries a(i,j) may be represented with the form n(i,j) + μ where n(i,j) denotes the noise. According to Hartigan's algorithm, by splitting the original data matrix into a set of Biclusters, variance is used to compute constant Biclusters. Hence, a perfect Bicluster may be equivalently defined as a matrix with a variance of zero. In order to prevent the partitioning of the data matrix into Biclusters with the only one row and one column; Hartigan assumes that there are, for example, K Biclusters within the data matrix. When the data matrix is partitioned into K Biclusters, the algorithm ends.
+Bicluster with constant values on rows (b) or columns (c)
+Unlike the constant-value Biclusters, these types of Biclusters cannot be evaluated solely based on the variance of their values. To finish the identification, the columns and the rows should be normalized first. There are, however, other algorithms, without the normalization step, that can find Biclusters which have rows and columns with different approaches.
+Bicluster with coherent values (d, e)
+For Biclusters with coherent values on rows and columns, an overall improvement over the algorithms for Biclusters with constant values on rows or on columns should be considered. This algorithm may contain analysis of variance between groups, using co-variance between both rows and columns. In Cheng and Church's theorem, a Bicluster is defined as a subset of rows and columns with almost the same score. The similarity score is used to measure the coherence of rows and columns.
+ 
+
+The relationship between these cluster models and other types of clustering such as correlation clustering is discussed in.
--- a/data/en.wikipedia.org/wiki/Biclustering-1.md
+++ b/data/en.wikipedia.org/wiki/Biclustering-1.md
@ -0,0 +1,41 @@
+---
+title: "Biclustering"
+chunk: 2/3
+source: "https://en.wikipedia.org/wiki/Biclustering"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T14:01:29.881151+00:00"
+instance: "kb-cron"
+---
+
+== Algorithms ==
+There are many Biclustering algorithms developed for bioinformatics, including: block clustering, CTWC (Coupled Two-Way Clustering), ITWC (Interrelated Two-Way Clustering), δ-bicluster, δ-pCluster, δ-pattern, FLOC, OPC, Plaid Model, OPSMs (Order-preserving submatrixes), Gibbs, SAMBA (Statistical-Algorithmic Method for Bicluster Analysis), Robust Biclustering Algorithm (RoBA), Crossing Minimization, cMonkey, PRMs, DCC, LEB (Localize and Extract Biclusters), QUBIC (QUalitative BIClustering), BCCA (Bi-Correlation Clustering Algorithm) BIMAX, ISA and FABIA (Factor analysis for Bicluster Acquisition), runibic,
+and recently proposed hybrid method EBIC (evolutionary-based Biclustering), which was shown to detect multiple patterns with very high accuracy. More recently, IMMD-CC is proposed that is developed based on the iterative complexity reduction concept. IMMD-CC is able to identify co-cluster centroids from highly sparse transformation obtained by iterative multi-mode discretization.
+Biclustering algorithms have also been proposed and used in other application fields under the names co-clustering, bi-dimensional clustering, and subspace clustering.
+Given the known importance of discovering local patterns in time-series data. Recent proposals have addressed the Biclustering problem in the specific case of time-series gene expression data. In this case, the interesting Biclusters can be restricted to those with contiguous columns. This restriction leads to a tractable problem and enables the development of efficient exhaustive enumeration algorithms such as CCC-Biclustering and e-CCC-Biclustering. 
+The approximate patterns in CCC-Biclustering algorithms allow a given number of errors, per gene, relatively to an expression profile representing the expression pattern in the Bicluster. The e-CCC-Biclustering algorithm uses approximate expressions to find and report all maximal CCC-Bicluster's by a discretized matrix A and efficient string processing techniques.
+These algorithms find and report all maximal Biclusters with coherent and contiguous columns with perfect/approximate expression patterns, in time linear/polynomial which is obtained by manipulating a discretized version of original expression matrix in the size of the time-series gene expression matrix using efficient string processing techniques based on suffix trees. These algorithms are also applied to solve problems and sketch the analysis of computational complexity.
+Some recent algorithms have attempted to include additional support for Biclustering rectangular matrices in the form of other datatypes, including cMonkey.
+There is an ongoing debate about how to judge the results of these methods, as Biclustering allows overlap between clusters and some algorithms allow the exclusion of hard-to-reconcile columns/conditions. Not all of the available algorithms are deterministic and the analyst must pay attention to the degree to which results represent stable minima. Because this is an unsupervised classification problem, the lack of a gold standard makes it difficult to spot errors in the results. One approach is to utilize multiple Biclustering algorithms, with the majority or super-majority voting amongst them to decide the best result. Another way is to analyze the quality of shifting and scaling patterns in Biclusters. Biclustering has been used in the domain of text mining (or classification) which is popularly known as co-clustering. Text corpora are represented in a vectoral form as a matrix D whose rows denote the documents and whose columns denote the words in the dictionary. Matrix elements Dij denote occurrence of word j in document i. Co-clustering algorithms are then applied to discover blocks in D that correspond to a group of documents (rows) characterized by a group of words(columns).
+Text clustering can solve the high-dimensional sparse problem, which means clustering text and words at the same time. When clustering text, we need to think about not only the words information, but also the information of words clusters that was composed by words. Then, according to similarity of feature words in the text, will eventually cluster the feature words. This is called co-clustering. There are two advantages of co-clustering: one is clustering the test based on words clusters can extremely decrease the dimension of clustering, it can also appropriate to measure the distance between the tests. Second is mining more useful information and can get the corresponding information in test clusters and words clusters. This corresponding information can be used to describe the type of texts and words, at the same time, the result of words clustering can be also used to text mining and information retrieval.
+Several approaches have been proposed based on the information contents of the resulting blocks: matrix-based approaches such as SVD and BVD, and graph-based approaches. Information-theoretic algorithms iteratively assign each row to a cluster of documents and each column to a cluster of words such that the mutual information is maximized. Matrix-based methods focus on the decomposition of matrices into blocks such that the error between the original matrix and the regenerated matrices from the decomposition is minimized.  Graph-based methods tend to minimize the cuts between the clusters. Given two groups of documents d1 and d2, the number of cuts can be measured as the number of words that occur in documents of groups d1 and d2.
+More recently (Bisson and Hussain) have proposed a new approach of using the similarity between words and the similarity between documents to co-cluster the matrix. Their method (known as χ-Sim, for cross similarity) is based on finding document-document similarity and word-word similarity, and then using classical clustering methods such as hierarchical clustering. Instead of explicitly clustering rows and columns alternately, they consider higher-order occurrences of words, inherently taking into account the documents in which they occur. Thus, the similarity between two words is calculated based on the documents in which they occur and also the documents in which "similar" words occur. The idea here is that two documents about the same topic do not necessarily use the same set of words to describe it, but a subset of the words and other similar words that are characteristic of that topic. This approach of taking higher-order similarities takes the latent semantic structure of the whole corpus into consideration with the result of generating a better clustering of the documents and words.
+In text databases, for a document collection defined by a document by term D matrix (of size m by n, m: number of documents, n: number of terms) the cover-coefficient based clustering methodology yields the same number of clusters both for documents and terms (words) using a double-stage probability experiment. According to the cover coefficient concept number of clusters can also be roughly estimated by the following formula 
+  
+    
+      
+        (
+        m
+        ×
+        n
+        )
+        
+          /
+        
+        t
+      
+    
+    {\displaystyle (m\times n)/t}
+  
+ where t is the number of non-zero entries in D. Note that in D each row and each column must contain at least one non-zero element.
+In contrast to other approaches, FABIA is a multiplicative model that assumes realistic non-Gaussian signal distributions with heavy tails. FABIA utilizes well understood model selection techniques like variational approaches and applies the Bayesian framework. The generative framework allows FABIA to determine the information content of each Bicluster to separate spurious Biclusters from true Biclusters.
--- a/data/en.wikipedia.org/wiki/Biclustering-2.md
+++ b/data/en.wikipedia.org/wiki/Biclustering-2.md
@ -0,0 +1,27 @@
+---
+title: "Biclustering"
+chunk: 3/3
+source: "https://en.wikipedia.org/wiki/Biclustering"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T14:01:29.881151+00:00"
+instance: "kb-cron"
+---
+
+== See also ==
+Formal concept analysis
+Biclique
+Galois connection
+
+== References ==
+
+=== Others ===
+N.K. Verma, S. Bajpai, A. Singh, A. Nagrare, S. Meena, Yan Cui, "A Comparison of Biclustering Algorithms" in International conference on Systems in Medicine and Biology (ICSMB 2010)in IIT Kharagpur India, pp. 90–97, Dec. 16–18.
+J. Gupta, S. Singh and N.K. Verma "MTBA: MATLAB Toolbox for Biclustering Analysis", IEEE Workshop on Computational Intelligence: Theories, Applications and Future Directions", IIT Kanpur India, pp. 148–152, Jul. 2013.
+A. Tanay. R. Sharan, and R. Shamir, "Biclustering Algorithms: A Survey", In Handbook of Computational Molecular Biology, Edited by Srinivas Aluru, Chapman (2004)
+Kluger Y, Basri R, Chang JT, Gerstein MB (2003). "Spectral Biclustering of Microarray Data: Coclustering Genes and Conditions". Genome Research. 13 (4): 703–716. doi:10.1101/gr.648603. PMC 430175. PMID 12671006.
+Adetayo Kasim, Ziv Shkedy, Sebastian Kaiser, Sepp Hochreiter, Willem Talloen (2016), Applied Biclustering Methods for Big and High-Dimensional Data Using R, Chapman & Hall/CRC Press
+Orzechowski, P., Sipper, M., Huang, X., & Moore, J. H. (2018). EBIC: an evolutionary-based parallel biclustering algorithm for pattern discovery. Bioinformatics.
+
+== External links ==
+FABIA: Factor Analysis for Bicluster Acquisition, an R package —software
--- a/data/en.wikipedia.org/wiki/BioCreative-0.md
+++ b/data/en.wikipedia.org/wiki/BioCreative-0.md
@ -0,0 +1,47 @@
+---
+title: "BioCreative"
+chunk: 1/1
+source: "https://en.wikipedia.org/wiki/BioCreative"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T14:01:32.321686+00:00"
+instance: "kb-cron"
+---
+
+BioCreAtIvE (A critical assessment of text mining methods in molecular biology) consists in a community-wide effort for evaluating information extraction and text mining developments in the biological domain. 
+It was preceded by the Knowledge Discovery and Data Mining (KDD) Challenge Cup for detection of gene mentions.
+
+
+== Community Challenges ==
+
+
+=== First edition (2004-2005) ===
+Three main tasks were posed at the first BioCreAtIvE challenge: the entity extraction task, the gene name normalization task, and the functional annotation of gene products task. The data sets produced by this contest serve as a Gold Standard training and test set to evaluate and train Bio-NER tools and annotation extraction tools.
+
+
+=== Second edition (2006-2007) ===
+The second BioCreAtIvE challenge (2006-2007) had also 3 tasks: detection of gene mentions, extraction of unique idenfiers for genes and extraction information related to physical protein-protein interactions. It counted with participation of 44 teams from 13 countries.
+
+
+=== Third edition (2011-2012) ===
+The third edition of BioCreative included for the first time the InterActive Task (IAT), designed to evaluate the practical usability of text mining tools in real-world biocuration tasks.
+
+
+=== Fifth edition (2016) ===
+BioCreative V had 5 different tracks, including an interactive task (IAT) for usability of text mining systems and a track using the BioC format for curating information for BioGRID.
+
+
+== See also ==
+
+Biocuration
+
+
+== References ==
+
+
+== External links ==
+BioCreAtIve, 2007-2015
+BioCreAtIve 2, 2006-2007
+First BioCreAtIvE workshop, 2004
+BMC Bioinformatics special issue : BioCreAtIvE
+First BioCreAtIvE data download request
--- a/data/en.wikipedia.org/wiki/BioMOBY-0.md
+++ b/data/en.wikipedia.org/wiki/BioMOBY-0.md
@ -0,0 +1,63 @@
+---
+title: "BioMOBY"
+chunk: 1/1
+source: "https://en.wikipedia.org/wiki/BioMOBY"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T14:01:45.666985+00:00"
+instance: "kb-cron"
+---
+
+BioMOBY is a registry of web services used in bioinformatics. It allows interoperability between biological data hosts and analytical services by annotating services with terms taken from standard ontologies. BioMOBY is released under the Artistic License.
+
+
+== The BioMOBY project ==
+The BioMoby project began at the Model Organism Bring Your own Database Interface Conference (MOBY-DIC), held in Emma Lake, Saskatchewan on September 21, 2001.  It stemmed from a conversation between Mark D Wilkinson and Suzanna Lewis during a Gene Ontology developers meeting at the Carnegie Institute, Stanford, where the functionalities of the Genquire and Apollo genome annotation tools were being discussed and compared.  The lack of a simple standard that would allow these tools to interact with the myriad of data-sources required to accurately annotate a genome was a critical need of both systems.
+Funding for the BioMOBY project was subsequently adopted by Genome Prairie (2002-2005), Genome Alberta (2005-date), in part through Genome Canada, a not-for-profit institution leading the Canadian X-omic initiatives.
+There are two main branches of the BioMOBY project.  One is a web-service-based approach, while the other utilizes Semantic Web technologies.  This article will refer only to the Web Service specifications.  The other branch of the project, Semantic Moby, is described in a separate entry.
+
+
+== Moby ==
+The Moby project defines three Ontologies that describe biological data-types, biological data-formats, and bioinformatics analysis types.  Most of the interoperable behaviours seen in Moby are achieved through the Object (data-format) and Namespace (data-type) ontologies.
+The MOBY Namespace Ontology is derived from the Cross-Reference Abbreviations List of the Gene Ontology project.  It is simply a list of abbreviations for the different types of identifiers that are used in bioinformatics.  For example, Genbank has "gi" identifiers that are used to enumerate all of their sequence records - this is defined as "NCBI_gi" in the Namespace Ontology.
+The MOBY Object Ontology is an ontology consisting of IS-A, HAS-A, and HAS relationships between data formats.  For example, a DNASequence IS-A GenericSequence and HAS-A String representing the text of the sequence.  All data in Moby must be represented as some type of MOBY Object.  An XML serialization of this ontology is defined in the Moby API such that any given ontology node has a predictable XML structure.
+Thus, between these two ontologies, a service provider and/or a client program can receive a piece of Moby XML, and immediately know both its structure, and its "intent" (semantics).
+The final core component of Moby is the MOBY Central web service registry.  MOBY Central is aware of the Object, Namespace and Service ontologies, and thus can match consumers who have in-hand Moby data, with service providers who claim to consume that data-type (or some compatible ontological data-type) or to perform a particular operation on it.  This "semantic matching" helps ensure that only relevant service providers are identified in a registry query, and moreover, ensures that the in-hand data can be passed to that service provider verbatim.  As such, the interaction between a consumer and a service provider can be partially or fully automated, as shown in the Gbrowse Moby and Ahab clients respectively.
+
+
+== BioMOBY and RDF/OWL ==
+BioMOBY does not, for its core operations, utilize the RDF or OWL standards from the W3C.  This is in part because neither of these standards were stable in 2001, when the project began, and in part because the library support for these standards were not "commodity" in any of the most common languages (i.e. Perl and Java) at that time.
+Nevertheless, the BioMOBY system exhibits what can only be described as Semantic Web-like behaviours.  The BioMOBY Object Ontology controls the valid data structures in exactly the same way as an OWL ontology defines an RDF data instance.  BioMOBY Web Services consume and generate BioMOBY XML, the structure of which is defined by the BioMOBY Object Ontology.  As such, BioMOBY Web Services have been acting as prototypical Semantic Web Services since 2001, despite not using the eventual RDF/OWL standards.
+However, BioMOBY does utilize the RDF/OWL standards, as of 2006, for the description of its Objects, Namespaces, Service, and Registry.  Increasingly these ontologies are being used to govern the behaviour of all BioMOBY functions using DL reasoners.
+
+
+== BioMOBY clients ==
+There are several client applications that can search and browse the BioMOBY registry of services. One of the most popular is the Taverna workbench built as part of the MyGrid project.  The first BioMOBY client was Gbrowse Moby, written in 2001 to allow access to the prototype version of BioMoby Services.  Gbrowse Moby, in addition to being a BioMoby browser, now works in tandem with the Taverna workbench to create SCUFL workflows reflecting the Gbrowse Moby browsing session that can then be run in a high-throughput environment.  The Seahawk applet also provides the ability to  export a session history as a Taverna workflow, in what constitutes a programming by example functionality.
+The Ahab client is a fully automated data mining tool. Given a starting point, it will discover, and execute, every possible BioMOBY service and provide the results in a clickable interface.
+
+
+== See also ==
+Open Bioinformatics Foundation
+SADI the Semantic Automated Discovery and Integration Framework
+
+
+== References ==
+
+
+== Funther reading ==
+Gordon, Paul MK; Sensen, Christoph W. (18 June 2007). "Seahawk: moving beyond HTML in Web-based bioinformatics analysis". BMC Bioinformatics. 8 (1): 208. doi:10.1186/1471-2105-8-208. ISSN 1471-2105. PMC 1906838. PMID 17577405.
+
+
+== External links ==
+Official BioMOBY website
+Publications about BioMOBY tagged using Connotea
+Emma Lake
+Gene Ontology
+Mark D Wilkinson
+Genome Alberta
+Genome Canada
+Genome Prairie
+Namespaces
+Objects
+Services
+Namespace Ontology
--- a/data/en.wikipedia.org/wiki/BioPAX-0.md
+++ b/data/en.wikipedia.org/wiki/BioPAX-0.md
@ -0,0 +1,69 @@
+---
+title: "BioPAX"
+chunk: 1/1
+source: "https://en.wikipedia.org/wiki/BioPAX"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T14:01:46.894369+00:00"
+instance: "kb-cron"
+---
+
+BioPAX (Biological Pathway Exchange) is a RDF/OWL-based
+standard language to represent biological pathways at the molecular and cellular level. Its major use is to facilitate the exchange of pathway data. Pathway data captures our understanding of biological processes, but
+its rapid growth necessitates development of databases and computational tools to aid interpretation. However, the current fragmentation of pathway information across many
+databases with incompatible formats presents barriers to its effective use. BioPAX solves this
+problem by making pathway data substantially easier to collect, index, interpret and share.
+BioPAX can represent metabolic and signaling pathways, molecular and genetic interactions and
+gene regulation networks. BioPAX was created through a community process. Through BioPAX, millions of interactions organized into thousands of pathways across many organisms, from a
+growing number of sources, are available. Thus, large amounts of pathway data are available in a
+computable form to support visualization, analysis and biological discovery.
+It is supported by a variety of online databases (e.g. Reactome) and tools. The latest released version is BioPAX Level 3. There is also an effort to create a version of BioPAX as part of OBO.
+
+
+== Governance and development ==
+The next version of BioPAX, Level 4, is being developed by a community of researchers. Development is coordinated by the board of editors and facilitated by various BioPAX work groups.
+Systems Biology Pathway Exchange (SBPAX) is an extension for Level 3 and proposal for Level 4 to add quantitative data and systems biology terms (such as Systems Biology Ontology). SBPAX export has been implemented by the pathway databases Signaling Gateway Molecule Pages, and the SABIO-Reaction Kinetics Database. SBPAX import has been implemented by the cellular modeling framework Virtual Cell.
+Other proposals for Level 4 include improved support for Semantic Web, validation and visualization.
+
+
+== Databases with BioPAX Export ==
+Online databases offering BioPAX export include:
+
+Signaling Gateway Molecule Pages (SGMP)
+Reactome
+BioCyc
+INOH
+BioModels
+Nature/NCI Pathway Interaction Database
+Cancer Cell Map
+Pathway Commons
+Netpath - A curated resource of signal transduction pathways in humans
+ConsensusPathDB - A database integrating human functional interaction networks
+PANTHER (List of Pathways)
+WikiPathways
+PharmGKB/PharmGKB*
+
+
+== Software ==
+Software supporting BioPAX include:
+
+Paxtools, a Java API for handling BioPAX files
+Systems Biology Linker (Sybil), an application for visualizing BioPAX and converting BioPAX to SBML, as part of the Virtual Cell.
+ChiBE (Chisio BioPAX Editor), an application for visualizing and editing BioPAX.
+BioPAX Validator - syntax and semantic rules and best practices (project wiki)
+Cytoscape includes a BioPAX reader and other extensions, such as PathwayCommons  plugin and CyPath2 app.
+BiNoM, a cytoscape plugin for network analysis, with functions to import and export BioPAX level 3 files.
+BioPAX-pattern, a Java API for defining and searching graph patterns in BioPAX files.
+
+
+== See also ==
+SBML
+Linked Open Vocabularies
+
+
+== References ==
+
+
+== External links ==
+BioPAX homepage
+BioPAX Sourceforge Wiki
--- a/data/en.wikipedia.org/wiki/BioSimGrid-0.md
+++ b/data/en.wikipedia.org/wiki/BioSimGrid-0.md
@ -0,0 +1,23 @@
+---
+title: "BioSimGrid"
+chunk: 1/1
+source: "https://en.wikipedia.org/wiki/BioSimGrid"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T14:01:50.560245+00:00"
+instance: "kb-cron"
+---
+
+BioSimGrid was a project to make data sets from computer simulations in the field of modelling biological systems, particularly biomolecular structures, more accessible to researchers. The project began in 2004 and halted by 2009.
+In 2004 designers presented the concept of the project as a "protein data bank extended in time". Other developers presented a web portal for accessing data in the project. Other designers described the project as efficient.
+A review in 2006 described how BioSimGrid was a model project for making data from research more open.
+BioSimGrid contributors accepted a grant from the National Institutes of Health in 2007.
+
+
+== References ==
+
+
+== External links ==
+Official website
+biosimgrid.org is now defunct and Archived 17 July 2009 at the Wayback Machine
+file archive at SourceForge
--- a/data/en.wikipedia.org/wiki/Biochip-0.md
+++ b/data/en.wikipedia.org/wiki/Biochip-0.md
@ -0,0 +1,27 @@
+---
+title: "Biochip"
+chunk: 1/2
+source: "https://en.wikipedia.org/wiki/Biochip"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T14:01:31.098536+00:00"
+instance: "kb-cron"
+---
+
+In molecular biology, biochips are engineered substrates ("miniaturized laboratories") that can host large numbers of simultaneous biochemical reactions. One of the goals of biochip technology is to efficiently screen large numbers of biological analytes, with potential applications ranging from disease diagnosis to detection of bioterrorism agents. For example, digital microfluidic biochips are under investigation for applications in biomedical fields. In a digital microfluidic biochip, a group of (adjacent) cells in the microfluidic array can be configured to work as storage, functional operations, as well as for transporting fluid droplets dynamically.
+
+== History ==
+The development started with early work on the underlying sensor technology. One of the first portable, chemistry-based sensors was the glass pH electrode, invented in 1922 by Hughes.  The basic concept of using exchange sites to create permselective membranes was used to develop other ion sensors in subsequent years. For example, a K+ sensor was produced by incorporating valinomycin into a thin membrane.
+In 1953, Watson and Crick announced their discovery of the now familiar double helix structure of DNA molecules and set the stage for genetics research that continues to the present day. The development of sequencing techniques in 1977 by Gilbert and Sanger (working separately) enabled researchers to directly read the genetic codes that provide instructions for protein synthesis. This research showed how hybridization of complementary single oligonucleotide strands could be used as a basis for DNA sensing. Two additional developments enabled the technology used in modern DNA-based. First, in 1983 Kary Mullis invented the polymerase chain reaction (PCR) technique, a method for amplifying DNA concentrations. This discovery made possible the detection of extremely small quantities of DNA in samples. Secondly in 1986 Hood and co-workers devised a method to label DNA molecules with fluorescent tags instead of radiolabels, thus enabling hybridization experiments to be observed optically.
+
+Figure 1 shows the make up of a typical biochip platform. The actual sensing component (or "chip") is just one piece of a complete analysis system. Transduction must be done to translate the actual sensing event (DNA binding, oxidation/reduction, etc.) into a format understandable by a computer (voltage, light intensity, mass, etc.), which then enables additional analysis and processing to produce a final, human-readable output. The multiple technologies needed to make a successful biochip—from sensing chemistry, to microarraying, to signal processing—require a true multidisciplinary approach, making the barrier to entry steep. One of the first commercial biochips was introduced by Affymetrix. Their "GeneChip" products contain thousands of individual DNA sensors for use in sensing defects, or single nucleotide polymorphisms (SNPs), in genes such as p53 (a tumor suppressor) and BRCA1 and BRCA2 (related to breast cancer). The chips are produced by using microlithography techniques traditionally used to fabricate integrated circuits (see below).
+
+== Microarray fabrication ==
+
+The microarray—the dense, two-dimensional grid of biosensors—is the critical component of a biochip platform. Typically, the sensors are deposited on a flat substrate, which may either be passive (e.g. silicon or glass) or active, the latter
+consisting of integrated electronics or micromechanical devices that perform or assist signal transduction. Surface chemistry is used to covalently bind the sensor molecules to the substrate medium. The fabrication of microarrays is non-trivial and is a major economic and technological hurdle that may ultimately decide the success of future biochip platforms. The primary manufacturing challenge is the process of placing each sensor at a specific position (typically on a Cartesian grid) on the substrate. Various means exist to achieve the placement, but typically robotic micro-pipetting or micro-printing systems are used to place tiny spots of sensor material on the chip surface. Because each sensor is unique, only a few spots can be placed at a time. The low-throughput nature of this
+process results in high manufacturing costs.
+Fodor and colleagues developed a unique fabrication process (later used by Affymetrix) in which a series of microlithography steps is used to combinatorially synthesize hundreds of thousands of unique, single-stranded DNA sensors on a substrate one nucleotide at a time.  One lithography step is needed per base type; thus, a total of four steps is required per nucleotide level. Although this technique is very powerful in that many sensors can be created simultaneously, it is currently only feasible for creating short DNA strands (15–25 nucleotides). Reliability and cost factors limit the number of photolithography steps that can be done. Furthermore, light-directed combinatorial synthesis techniques are not currently possible for proteins or other sensing molecules.
+As noted above, most microarrays consist of a Cartesian grid of sensors. This approach is used chiefly to map or "encode" the coordinate of each sensor to its function. Sensors in these arrays typically use a universal signalling technique (e.g. fluorescence), thus making coordinates their only identifying feature. These arrays must be made using a serial process (i.e. requiring multiple, sequential steps) to ensure that each sensor is placed at the correct position.
+"Random" fabrication, in which the sensors are placed at arbitrary positions on the chip, is an alternative to the serial method. The tedious and expensive positioning process is
+not required, enabling the use of parallelized self-assembly techniques. In this approach, large batches of identical sensors can be produced; sensors from each batch are then combined and assembled into an array. A non-coordinate based encoding scheme must be used to identify each sensor. As the figure shows, such a design was first demonstrated (and later commercialized by Illumina) using functionalized beads placed randomly in the wells of an etched fiber optic cable. Each bead was uniquely encoded with a fluorescent signature. However, this encoding scheme is limited in the number of unique dye combinations that can be used and successfully differentiated.
--- a/data/en.wikipedia.org/wiki/Biochip-1.md
+++ b/data/en.wikipedia.org/wiki/Biochip-1.md
@ -0,0 +1,55 @@
+---
+title: "Biochip"
+chunk: 2/2
+source: "https://en.wikipedia.org/wiki/Biochip"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T14:01:31.098536+00:00"
+instance: "kb-cron"
+---
+
+== Protein biochip array and other microarray technologies ==
+Microarrays are not limited to DNA analysis; protein microarrays, antibody microarray, chemical compound microarray can also be produced using biochips.  Randox Laboratories Ltd. launched Evidence, the first protein Biochip Array Technology analyzer in 2003.  In protein Biochip Array Technology, the biochip replaces the ELISA plate or cuvette as the reaction platform.  The biochip is used to simultaneously analyze a panel of related tests in a single sample, producing a patient profile.  The patient profile can be used in disease screening, diagnosis, monitoring disease progression or monitoring treatment.  Performing multiple analyses simultaneously, described as multiplexing, allows a significant reduction in processing time and the amount of patient sample required.  Biochip Array Technology is a novel application of a familiar methodology, using sandwich, competitive and antibody-capture immunoassays. The difference from conventional immunoassays is that, the capture ligands are covalently attached to the surface of the biochip in an ordered array rather than in solution.
+In sandwich assays an enzyme-labelled antibody is used; in competitive assays an enzyme-labelled antigen is used.  On antibody-antigen binding a chemiluminescence reaction produces light.  Detection is by a charge-coupled device (CCD) camera.  The CCD camera is a sensitive and high-resolution sensor able to accurately detect and quantify very low levels of light.  The test regions are located using a grid pattern then the chemiluminescence signals are analysed by imaging software to rapidly and simultaneously quantify the individual analytes.
+Biochips are also used in the field of microphysiometry e.g. in skin-on-a-chip applications.
+For details about other array technologies, see Antibody microarray.
+
+== Types ==
+There are several types of biotechnology chips, each designed for specific applications. 
+
+=== DNA microarrays ===
+DNA microarrays are perhaps the most widely used biotechnology chips. They consist of glass slides, silicon substrates, or polymer-based supports that can bind to specific DNA sequences. Researchers use DNA microarrays to detect gene expression, analyze genetic variation, and explore gene function.
+
+=== Protein chips ===
+Protein chips (also known as proteomics chips) are designed to detect and analyze proteins. These chips contain arrays of immobilized proteins or antibodies, which can be used for profiling protein interactions, identification, and quantification.
+
+=== Lab-on-a-chip (LOC) ===
+Lab-on-a-chip (LOC) devices integrate multiple laboratory functions into a single chip. These chips incorporate sample preparation, reaction, analysis, and detection into one compact platform. LOC devices are used for clinical diagnostics, environmental monitoring, and chemical analysis.
+
+=== Cell chips ===
+Cell chips are designed to grow and analyze living cells. They provide a platform for studying cellular behavior, drug interactions, and cell signaling. Cell-based chips are capable of simulating various physiological and pathological environments, providing high-throughput screening capabilities. This allows for the rapid evaluation of potential drug effects on cells, including drug toxicity, efficacy, and their impact on cellular signaling pathways.
+
+=== Microfluidic chips ===
+Microfluidic chips manipulate tiny amounts of liquids and gases in channels with micro-scale dimensions. Commonly fabricated from materials such as polydimethylsiloxane (PDMS), glass, and thermoplastic polymers, these chips demonstrate outstanding biocompatibility and optical properties. These chips are used for a variety of biological applications, including PCR amplification, cell sorting, and DNA sequencing.
+
+== Applications ==
+Biotechnology chips have a wide range of applications across many fields.
+
+=== Medical diagnostics ===
+Biotechnology chips are widely used in medical diagnostics for detecting diseases such as cancer, infections, and genetic disorders, owing to its advantages of high throughput, high sensitivity, rapid detection, and low sample consumption. These chips can rapidly analyze samples of blood, saliva, or tissue to detect genetic mutations, infectious agents, and biomarkers.
+
+=== Drug development and personalized medicine ===
+Biotechnology chips play a key role in the development of new drugs. They enable high-throughput screening of potential drug compounds and help identify biomarkers for personalized treatment plans. Additionally, the use of biotechnology chips allows for more efficient testing of drug efficacy and safety.
+
+=== Genomics and proteomics ===
+Biotechnology chips are used in genomics and proteomics research. DNA microarrays and protein chips enable scientists to analyze large amounts of genetic and protein data simultaneously. These chips facilitate the study of gene expression, genetic variation, and protein interactions, advancing the understanding of complex biological systems.
+
+=== Environmental monitoring ===
+Lab-on-a-chip and microfluidic devices are also used for environmental monitoring. These chips can test water, air, and soil samples for contaminants, pathogens, and toxins. Their portable nature allows for on-site analysis, making them valuable for environmental research and disaster response.
+
+=== Agricultural biotechnology ===
+Biotechnology chips are used in agricultural research to analyze crops, monitor soil health, and study plant pathogens. They enable more efficient and precise breeding programs, improving crop yield, pest resistance, and disease prevention.
+
+== See also ==
+
+== References ==
--- a/data/en.wikipedia.org/wiki/Bioimage_informatics-0.md
+++ b/data/en.wikipedia.org/wiki/Bioimage_informatics-0.md
@ -0,0 +1,46 @@
+---
+title: "Bioimage informatics"
+chunk: 1/2
+source: "https://en.wikipedia.org/wiki/Bioimage_informatics"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T14:01:33.460591+00:00"
+instance: "kb-cron"
+---
+
+Bioimage informatics is a subfield of bioinformatics and computational biology. It focuses on the use of computational techniques to analyze bioimages, especially cellular and molecular images, at large scale and high throughput. The  goal is to obtain useful knowledge out of complicated and heterogeneous image and related metadata.
+Automated microscopes are able to collect large numbers of images with minimal intervention. This has led to a data explosion, which absolutely requires automatic processing. Additionally, and surprisingly, for several of these tasks, there is evidence that automated systems can perform better than humans. In addition, automated systems are unbiased, unlike human based analysis whose evaluation may (even unconsciously) be influenced by the desired outcome.
+There has been an increasing focus on developing novel image processing, computer vision, data mining, database and visualization techniques to extract, compare, search and manage the biological knowledge in these data-intensive problems.
+
+== Data Modalities ==
+Several data collection systems and platforms are used, which require different methods to be handled optimally.
+
+=== Fluorescent Microscopy ===
+
+Fluorescent microscopy allows the direct visualization of molecules at the subcellular level, in both live and fixed cells. Molecules of interest are marked with either green fluorescent protein (GFP), another fluorescent protein, or a fluorescently labeled antibody. Several types of microscope are regularly used: widefield, confocal, or two-photon. Most microscopy system will also support the collection of time-series (movies).
+In general, filters are used so that each dye is imaged separately (for example, a blue filter is used to image Hoechst, then rapidly switched to a green filter to image GFP). For  consumption, the images are often displayed in false color by showing each channel in a different color, but these may not even be related to the original wavelengths used. In some cases, the original image could even have been acquired in non-visible wavelengths (infrared is common).
+The choices at the image acquisition stage will influence the analysis and often require special processing. Confocal stacks will require 3D processing and widefield pseudo-stacks will often benefit from digital deconvolution to remove the out-of-focus light.
+The advent of automated microscopes that can acquire many images automatically is one of the reasons why analysis cannot be done by  eye (otherwise, annotation would rapidly become the research bottleneck). Using automated microscopes means that some images might be out-of-focus (automated focus finding systems may sometimes be incorrect), contain a small number of cells, or be filled with debris. Therefore, the images generated will be harder to analyse than images acquired by an operator as they would have chosen other locations to image and focus correctly. On the other hand, the operator might introduce an unconscious bias in his selection by choosing only the cells whose phenotype is most like the one expected before the experiment.
+
+=== Histology ===
+
+Histology is a microscopy application where tissue slices are stained and observed under the microscope (typically light microscope, but electron microscopy is also used).
+When using a light microscope, unlike the case of fluorescent imaging, images are typically acquired using standard color camera-systems. This reflects partially the history of the field, where humans were often interpreting the images, but also the fact that the sample can be illuminated with white light and all light collected rather than having to excite fluorophores. When more than one dye is used, a necessary preprocessing step is to unmix the channels and recover an estimate of the pure dye-specific intensities.
+It has been shown that the subcellular location of stained proteins can be identified from histology images.
+If the goal is a medical diagnostic, then histology applications will often fall into the realm of digital pathology or automated tissue image analysis, which are sister fields of bioimage informatics. The same computational techniques are often applicable, but the goals are medically- rather than research-oriented.
+
+== Important Problems ==
+
+=== Subcellular Location Analysis ===
+
+Subcellular location analysis was one of the initial problems in this field. In its supervised mode, the problem is to learn a classifier that can recognize images from the major cell organelles based on images.
+Methods used are based on machine learning, building a discriminative classifier based on numeric features computed from the image. Features are either generic features from computer vision, such as Haralick texture features or features specially designed to capture biological factors (e.g., co-localization with a nuclear marker being a typical example).
+For the basic problem of identifying organelles, very high accuracy values can be obtained, including better than ? results. These methods are useful in basic cell biology research, but have also been applied to the discovery of proteins whose location changes in cancer cells.
+However, classification into organelles is a limited form of the problem as many proteins will localize to multiple locations simultaneously (mixed patterns) and many patterns can be distinguished even though they are not different membrane-bound components. There are several unsolved problems in this area and research is ongoing.
+
+=== High-Content Screening ===
+
+High throughput screens using automated imaging technology (sometimes called high-content screening) have become a standard method for both drug discovery and basic biological research. Using multi-well plates, robotics, and automated microscopy, the same assay can be applied to a large library of possible reagents (typically either small molecules or RNAi) very rapidly, obtaining thousands of images in a short amount of time. Due to the high volume of data generated, automatic image analysis is a necessity.
+When positive and negative controls are available, the problem can be approached as a classification problem and the same techniques of feature computation and classification that are used for subcellular location analysis can be applied.
+
+=== Segmentation ===
--- a/data/en.wikipedia.org/wiki/Bioimage_informatics-1.md
+++ b/data/en.wikipedia.org/wiki/Bioimage_informatics-1.md
@ -0,0 +1,43 @@
+---
+title: "Bioimage informatics"
+chunk: 2/2
+source: "https://en.wikipedia.org/wiki/Bioimage_informatics"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T14:01:33.460591+00:00"
+instance: "kb-cron"
+---
+
+Segmentation of cells is an important sub-problem in many of the fields below (and sometimes useful on its own if the goal is only to obtain a cell count in a viability assay). The goal is to identify the boundaries of cells in a multi-cell image. This allows for processing each cell individually to measure parameters. In 3D data, segmentation must be performed in 3D space.
+As the imaging of a nuclear marker is common across many images, a widely used protocol is to segment the nuclei. This can be useful by itself if nuclear measurements are needed or it can serve to seed a watershed which extends the segmentation to the whole image.
+All major segmentation methods have been reported on cell images, from simple thresholding to level set methods. Because there are multiple image modalities and different cell types, each of which implies different tradeoffs, there is no single accepted solution for this problem.
+Cell image segmentation as an important procedure is often used to study gene expression and colocalization relationship etc. of individual cells. In such cases of single-cell analysis it is often needed to uniquely determine the identities of cells while segmenting the cells. Such a recognition task is often non-trivial computationally. For model organisms such as C. elegans that have well-defined cell lineages, it is possible to explicitly recognize the cell identities via image analysis, by combining both image segmentation and pattern recognition methods. Simultaneous segmentation and recognition of cells  has also been proposed as a more accurate solution for this problem when an "atlas" or other prior information of cells is available. Since gene expression at single cell resolution can be obtained using these types of imaging based approaches, it is possible to combine these methods with other single cell gene expression quantification methods such as RNAseq.
+
+=== Tracking ===
+Tracking is another traditional image processing problem which appears in bioimage informatics. The problem is to relate objects that appear in subsequent frames of a film. As with segmentation, the problem can be posed in both two- and three-dimensional forms.
+In the case of fluorescent imaging, tracking must often be performed on very low contrast images. As obtaining high contrast is done by shining more light which damages the sample and destroys the dye, illumination is kept at a minimum. It is often useful to think of a photon budget: the number of photons that can be used for imaging before the damage to the sample is so great that data can no longer be trusted. Therefore, if high contrast images are to be obtained, then only a few frames can be used; while for long movies, each frame will be of very low contrast.
+
+=== Registration ===
+
+When image data samples of different natures, such as those corresponding to different labeling methods, different individuals, samples at different time points, etc. are considered, images often need to be registered for better comparison. One example is as time-course data is collected, images in subsequent frames must often be registered so that minor shifts in the camera position can be corrected for. Another example is that when many images of a model animal (e.g. C. elegans or Drosophila brain or a mouse brain) are collected, there is often a substantial need to register these images to compare their patterns (e.g. those correspond to the same or different neuron population, those share or differ in the gene expression, etc.).
+Medical image registration software packages were early attempts to be used for the microscopic image registration applications. However, due to the often much larger image file size and a much bigger number of specimens in the experiments, in many cases it is needed to develop new 3D image registration software.BrainAligner is software that has been used to automate the 3D deformable and nonlinear registration process using a reliable-landmark-matching strategy. It has been primarily used to generate more than 50,000 3D standardized fruitfly brain images at Janelia Farm of HHMI, with other applications including dragonfly and mice.
+
+== Important Venues ==
+A consortium of scientists from universities and research institutes have organized annual meetings on bioimage informatics  since 2005. The ISMB conference has had a Bioimaging & Data Visualization track since 2010. The journal Bioinformatics also introduced a Bioimage Informatics track in 2012. The OpenAccess journal BMC Bioinformatics has a section devoted to bioimage analysis, visualization and related applications. Other computational biology and bioinformatics journals also regularly publish bioimage informatics work. A European Union Cost action called NEUBIAS (network of european bioimage analysts) has been organizing annual conferences as well as bioimage analyst training schools and taggathons since 2017.
+
+== Software ==
+There are several packages that make bioimage informatics methods available through a graphical user interface such as ImageJ, FIJI, CellProfiler or Icy. Visualization and analysis platforms such as Vaa3D have appeared in recent years and have been used in both large scale projects especially for neuroscience and desktop applications. 
+
+Other researchers develop their own methods, typically based on a programming language with good computer vision support such as Python, C++, or MATLAB.  The Mahotas library for Python is one popular example. Although, examples of researcher developed methods in programming languages with less computer vision support as R exist (e.g. trackdem ).
+
+== See also ==
+Focus stacking The technique of combining multiple images with difference focus distances into one.
+High-content screening
+digital pathology
+Medical imaging
+
+== External links ==
+Vaa3D: High-performance multi-dimensional image visualization and analysis
+Bioformats The Image file IO engine that supports dozens of formats
+
+== References ==
--- a/data/en.wikipedia.org/wiki/Bioinformatics_Institute_(Singapore)-0.md
+++ b/data/en.wikipedia.org/wiki/Bioinformatics_Institute_(Singapore)-0.md
@ -4,7 +4,7 @@ chunk: 1/1
 source: "https://en.wikipedia.org/wiki/Bioinformatics_Institute_(Singapore)"
 category: "reference"
 tags: "science, encyclopedia"
-date_saved: "2026-05-05T10:35:10.060975+00:00"
+date_saved: "2026-05-05T14:01:35.961204+00:00"
 instance: "kb-cron"
 ---

--- a/data/en.wikipedia.org/wiki/Bioinformatics_Open_Source_Conference-0.md
+++ b/data/en.wikipedia.org/wiki/Bioinformatics_Open_Source_Conference-0.md
@ -0,0 +1,57 @@
+---
+title: "Bioinformatics Open Source Conference"
+chunk: 1/1
+source: "https://en.wikipedia.org/wiki/Bioinformatics_Open_Source_Conference"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T14:01:37.154707+00:00"
+instance: "kb-cron"
+---
+
+The Bioinformatics Open Source Conference (BOSC) is an academic conference on open-source programming and other open science practices in bioinformatics, organised by the Open Bioinformatics Foundation. The conference has been held annually since 2000 and is run as a two-day meeting either within Intelligent Systems for Molecular Biology (ISMB) conference or as a joint conference with the Galaxy community.
+
+
+== Program ==
+The conference is held as a single track consisting of presentations, poster sessions and two keynote talks by people of influence in open-source bioinformatics.
+Since 2010, an informal two-day "CollaborationFest" (formerly Codefest) has been held directly preceding the conference.
+
+
+== History ==
+National Institutes of Health Associate Director for Data Science Philip Bourne and C. Titus Brown gave keynote talks at BOSC 2014.
+BOSC 2016 was organized in Orlando, Florida from July 8–9 before the main ISMB conference.
+In 2018 and 2020, BOSC partnered with Galaxy to organize two joint conferences called GCCBOSC and Bioinformatics Community Conference (BCC) respectively. The event in 2018 was held in Portland, Oregon. The BCC in 2020 took place online with two time schedules for eastern/western time zones
+Since 2021, BOSC has been taking place within the ISMB conferences again. In 2023 BOSC took place in Lyon, France between July 24–28 as part of the ISMB/ECCB conference.
+
+
+== BOSC 2024 ==
+The BOSC 2024 conference was a part of the Intelligent Systems for Molecular Biology Conference of 2024. The 2024 event also marked the 25th anniversary of the conference, which took place in Montreal, Canada.
+The conference was held in a hybrid setting, with around 200 live attendees and the rest watching live. The conference covered a wide variety of topics, with the main theme focusing on approaches to using Artificial intelligence (AI) and  Machine learning (ML) in Bioinformatics.
+
+
+=== Keynote Speakers ===
+The conference featured two keynote speakers.
+One of them, Dr. Mélanie Courtot, gave a presentation titled "The Data Shows We Need Better Data" on day one of the conference.  During her speech, she introduced the TRUE principles for preparing data for AI tools.
+The next keynote speaker to present on day two was Andrew Su, who gave a presentation titled "Open Data, Knowledge Graphs, and Large Language Models". This presentation discussed how to verify the accuracy of answers produced by Large language Models (LLM). A solution he presented was Retrieval-Augmented Generation (RAG).
+
+
+=== Other Presentations ===
+Other than the keynote speakers, there were a total of 36 talks and 23 posters selected to be presented at the conference over the course of multiple sessions. One of the sessions being Data Analysis. These presentations were about approaches to analyzing biomedical data, different types of data that are freely available for use, and some of the research that has been done using these open-source tools and data. Another session was titled Open Data Session, which included presentations about some of the freely available databases, open data portals, and platforms that are being used by researchers around the world. The session Visualization included presentations about new additions to older biological databases.
+The next session was titled “Standards and Frameworks for Open Science”. This session was all about how to create consistent, recyclable, and long lasting open source software. The final session was called “Open Approaches to AI/ML”, which was about how to use machine learning to solve biological problems.
+
+
+==== Open Panel Discussion ====
+
+The events of day two concluded with an open panel discussion titled “Open Source AI/ML: A Game Changer for Bioinformatics?”. 
+The researchers on the panel included Lawrence Hunter, Thomas Hervé Mboa Nkoudou, Mélanie Courtot, and Andrew Su. The moderator of the panel was Monica Munoz-Torres. This discussion explored the benefits and drawbacks of using Artificial Intelligence and Machine Learning in Bioinformatics research.
+
+
+== External links ==
+BOSC 2024 conference page
+Full recorded BOSC 2024 open panel discussion
+Other BOSC 2024 presentations
+Dr. Mélanie Courtot BOSC 2024 keynote speech
+Andrew Su BOSC 2024 keynote speech
+ISMB 2024 conference page
+
+
+== References ==
--- a/data/en.wikipedia.org/wiki/Bioinformatics_discovery_of_non-coding_RNAs-0.md
+++ b/data/en.wikipedia.org/wiki/Bioinformatics_discovery_of_non-coding_RNAs-0.md
@ -0,0 +1,41 @@
+---
+title: "Bioinformatics discovery of non-coding RNAs"
+chunk: 1/1
+source: "https://en.wikipedia.org/wiki/Bioinformatics_discovery_of_non-coding_RNAs"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T14:01:34.656882+00:00"
+instance: "kb-cron"
+---
+
+Non-coding RNAs have been discovered using both experimental and bioinformatic approaches.  Bioinformatic approaches can be divided into three main categories.  The first involves homology search, although these techniques are by definition unable to find new classes of ncRNAs.  The second category includes algorithms designed to discover specific types of ncRNAs that have similar properties.  Finally, some discovery methods are based on very general properties of RNA, and are thus able to discover entirely new kinds of ncRNAs.
+
+
+== Discovery by homology search ==
+Homology search refers to the process of searching a sequence database for RNAs that are similar to already known RNA sequences.  Any algorithm that is designed for homology search of nucleic acid sequences can be used, e.g., BLAST.  However, such algorithms typically are not as sensitive or accurate as algorithms specifically designed for RNA.
+Of particular importance for RNA is its conservation of a secondary structure, which can be modeled to achieve additional accuracy in searches.  For example, Covariance models can be viewed as an extension to a profile hidden Markov model that also reflects conserved secondary structure.  Covariance models are implemented in the Infernal software package.
+
+
+== Discovery of specific types of ncRNAs ==
+Some types of RNAs have shared properties that algorithms can exploit.  For example, tRNAscan-SE is specialized to finding tRNAs.  The heart of this program is a tRNA homology search based on covariance models, but other tRNA-specific search programs are used to accelerate searches.
+The properties of snoRNAs have enabled the development of programs to detect new examples of snoRNAs, including those that might be only distantly related to previously known examples.  Computer programs implementing such approaches include snoscan and snoReport.
+Similarly, several algorithms have been developed to detect microRNAs.  Examples include miRNAFold and miRNAminer.
+
+
+== Discovery by general properties ==
+Some properties are shared by multiple unrelated classes of ncRNA, and these properties can be targeted to discover new classes.  Chief among them is the conservation of an RNA secondary structure.  To measure conservation of secondary structure, it is necessary to somehow find homologous sequences that might exhibit a common structure.  Strategies to do this have included the use of BLAST between two sequences  or multiple sequences, exploited synteny via orthologous genes  or used locality sensitive hashing in combination with sequence and structural features.
+Mutations that change the nucleotide sequence, but preserve secondary structure are called covariation, and can provide evidence of conservation.  Other statistics and probabilistic models can be used to measure such conservation.  The first ncRNA discovery method to use structural conservation was QRNA, which compared the probabilities of an alignment of two sequences based on either an RNA model or a model in which only the primary sequence conserved.  Work in this direction has allowed for more than two sequences and included phylogenetic models, e.g., with EvoFold.  An approach taken in RNAz involved computing statistics on an input multiple-sequence alignment.  Some of these statistics relate to structural conservation, while others measure general properties of the alignment that could affect the expected ranges of the structural statistics.  These statistics were combined using a support vector machine.
+Other properties include the appearance of a promoter to transcribe the RNA.  ncRNAs are also often followed by a Rho-independent transcription terminator.
+Using a combination of these approaches, multiple studies have enumerated candidate RNAs, e.g.,
+Some studies have proceeded to manual analysis of the predictions to find a details structural and functional prediction.
+
+
+== See also ==
+6A RNA motif
+AbiF RNA motif
+ARRPOF RNA motif
+CyVA-1 RNA motif
+List of RNA structure prediction software
+
+
+== References ==
--- a/data/en.wikipedia.org/wiki/Biological_data-0.md
+++ b/data/en.wikipedia.org/wiki/Biological_data-0.md
@ -0,0 +1,35 @@
+---
+title: "Biological data"
+chunk: 1/2
+source: "https://en.wikipedia.org/wiki/Biological_data"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T14:01:38.379909+00:00"
+instance: "kb-cron"
+---
+
+Biological data refers to a compound or information derived from living organisms and their products. A medicinal compound made from living organisms, such as a serum or a vaccine, could be characterized as biological data. Biological data is highly complex when compared with other forms of data. There are many forms of biological data, including text, sequence data, protein structure, genomic data and amino acids, and links among others.
+
+== Biological data and bioinformatics ==
+Biological data works closely with bioinformatics, which is a recent discipline focusing on addressing the need to analyze and interpret vast amounts of genomic data.
+In the past few decades, leaps in genomic research have led to massive amounts of biological data. As a result, bioinformatics was created as the convergence of genomics, biotechnology, and information technology, while concentrating on biological data.
+Biological data has also been difficult to define, as bioinformatics is a wide-encompassing field. Further, the question of what constitutes as being a living organism has been contentious, as "alive" represents a nebulous term that encompasses molecular evolution, biological modeling, biophysics, and systems biology. From the past decade onwards, bioinformatics and the analysis of biological data have been thriving as a result of leaps in technology required to manage and interpret data. It is currently a thriving field, as society has become more concentrated on the acquisition, transfer, and exploitation of bioinformatics and biological data.
+
+== Types of biological data ==
+Biological data can be extracted for use in the domains of omics, bio-imaging, and medical imaging. Life scientists value biological data to provide molecular details in living organisms. Tools for DNA sequencing, gene expression (GE), bio-imaging, neuro-imaging, and brain-machine interfaces are all domains that utilize biological data, and model biological systems with high dimensionality.
+Moreover, raw biological sequence data usually refers to DNA, RNA, and amino acids.
+Biological data can also be described as data on biological entities. For instance, characteristics such as: sequences, graphs, geometric information, scalar and vector fields, patterns, constraints, images, and spatial information may all be characterized as biological data, as they describe features of biological beings. In many instances, biological data are associated with several of these categories. For instance, as described in the National Institute of Health's report on Catalyzing Inquiry at the Interface of Computing and Biology, a protein structure may be associated with a one-dimensional sequence, a two-dimensional image, and a three dimensional structure, and so on.
+
+=== Biomedical databases ===
+Biomedical databases have often been referred to as the databases of Electronic Health Records (EHRs), genomic data in decentralized federal database systems, and biological data, including genomic data, collected from large-scale clinical studies.
+
+== Bio-hacking and privacy threats ==
+
+=== Bio-hacking ===
+Bio-computing attacks have become more common as recent studies have shown that common tools may allow an assailant to synthesize biological information which can be used to hijack information from DNA-analyses. The threat of biohacking has become more apparent as DNA-analysis increases in commonality in fields such as forensic science, clinical research, and genomics.
+Biohacking can be carried out by synthesizing malicious DNA and inserted into biological samples. Researchers have established scenarios that demonstrate the threat of biohacking, such as a hacker reaching a biological sample by hiding malicious DNA on common surfaces, such as lab coats, benches, or rubber gloves, which would then contaminate the genetic data.
+However, the threat of biohacking may be mitigated by using similar techniques that are used to prevent conventional injection attacks. Clinicians and researchers may mitigate a bio-hack by extracting genetic information from biological samples, and comparing the samples to identify material unknown materials. Studies have shown that comparing genetic information with biological samples, to identify bio-hacking code, has been up to 95% effective in detecting malicious DNA inserts in bio-hacking attacks.
+
+=== Genetic samples as personal data ===
+Privacy concerns in genomic research have arises around the notion of whether or not genomic samples contain personal data, or should be regarded as physical matter. Moreover, concerns arise as some countries recognize genomic data as personal data (and apply data protection rules) while other countries regard the samples in terms of physical matter and do not apply the same data protection laws to genomic samples. The forthcoming General Data Protection Regulation (GDPR) has been cited as a potential legal instrument that may better enforce privacy regulations in bio-banking and genomic research.
+However, ambiguity surrounding the definition of "personal data" in the text of the GDPR, especially regarding biological data, has led to doubts on whether regulation will be enforced for genetic samples. Article 4(1) states that personal data is defined as "Any information relating to an identified or identifiable natural person ('data subject')" 
--- a/data/en.wikipedia.org/wiki/Biological_data-1.md
+++ b/data/en.wikipedia.org/wiki/Biological_data-1.md
@ -0,0 +1,41 @@
+---
+title: "Biological data"
+chunk: 2/2
+source: "https://en.wikipedia.org/wiki/Biological_data"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T14:01:38.379909+00:00"
+instance: "kb-cron"
+---
+
+== Applications of deep learning to biological data ==
+As a result of rapid advances in data science and computational power, life scientists have been able to apply data-intensive machine learning methods to biological data, such as deep learning (DL), reinforcement learning (RL), and their combination (deep RL). These methods, alongside increases in data storage and computing, have allowed life scientists to mine biological data and analyze data sets that were previously too large or complex. Deep Learning (DL) and reinforcement learning (RL) have been used in the field of omics research (which includes genomics, proteomics, or metabolomics.) Typically, raw biological sequence data (such as DNA, RNA, and amino acids) is extracted and used to analyze features, functions, structures, and molecular dynamics from the biological data. From that point onwards, different analyses may be performed, such as GE profiling splicing junction prediction, and protein-protein interaction evaluation may all be performed.
+Reinforcement learning, a term stemming from behavioral psychology, is a method of problem solving by learning things through trial and error. Reinforcement learning can be applied to biological data, in the field of omics, by using RL to predict bacterial genomes.
+Other studies have shown that reinforcement learning can be used to accurately predict biological sequence annotation.
+Deep Learning (DL) architectures are also useful in training biological data. For instance, DL architectures that target pixel levels of biological images have been used to identify the process of mitosis in histological images of the breast. DL architectures have also been used to identify nuclei in images of breast cancer cells.
+
+== Challenges to data mining in biomedical informatics ==
+
+=== Complexity ===
+The primary problem facing biomedical data models has typically been complexity, as life scientists in clinical settings and biomedical research face the possibility of information overload. However, information overload has often been a debated phenomenon in medical fields. Computational advances have allowed for separate communities to form under different philosophies. For instance, data mining and machine learning researchers search for relevant patterns in biological data, and the architecture does not rely on human intervention. However, there are risks involved when modeling artifacts when human intervention, such as end user comprehension and control, are lessened.
+Researchers have pointed out that with increasing health care costs and tremendous amounts of underutilized data, health information technologies may be the key to improving the efficiency and quality of healthcare.
+
+=== Database errors and abuses ===
+Electronic health records (EHR) can contain genomic data from millions of patients, and the creation of these databases has resulted in both praise and concern.
+Legal scholars have pointed towards three primary concerns for increasing litigation pertaining to biomedical databases. First, data contained in biomedical databases may be incorrect or incomplete. Second, systemic biases, which may arise from researcher biases or the nature of the biological data, may threaten the validity of research results. Third, the presence of data mining in biological databases can make it easier for individuals with political, social, or economic agendas to manipulate research findings to sway public opinion.
+An example of database misuse occurred in 2009 when the Journal of Psychiatric Research published a study that associated abortion to psychiatric disorders. The purpose of the study was to analyze associations between abortion history and psychiatric disorders, such as anxiety disorders (including panic disorder, PTSD, and agoraphobia) alongside substance abuse disorders and mood disorders.
+However, the study was discredited in 2012 when scientists scrutinized the methodology of the study and found it severely faulty. The researchers had used "national data sets with reproductive history and mental health variables" to produce their findings. However, the researchers had failed to compare women (who had unplanned pregnancies and had abortions) to the group of women who did not have abortions, while focusing on psychiatric problems that occurred after the terminated pregnancies. As a result, the findings which appeared to give scientific credibility, gave rise to several states enacting legislation that required women to seek counseling before abortions, due to the potential of long-term mental health consequences.
+Another article, published in the New York Times, demonstrated how Electronic Health Records (EHR) systems could be manipulated by doctors to exaggerate the amount of care they provided for purposes of Medicare reimbursement.
+
+== Biomedical data sharing ==
+Sharing biomedical data has been touted as an effective way to enhance research reproducibility and scientific discovery.
+While researchers struggle with technological issues in sharing data, social issues are also a barrier to sharing biological data. For instance, clinicians and researchers face unique challenges to sharing biological or health data within their medical communities, such as privacy concerns and patient privacy laws such as HIPAA.
+
+=== Attitudes towards data sharing ===
+According to a 2015 study focusing on the attitudes of practices of clinicians and scientific research staff, a majority of the respondents reported data sharing as important to their work, but signified that their expertise in the subject was low. Of the 190 respondents to the survey, 135 identified themselves as clinical or basic research scientists, and the population of the survey included clinical and basic research scientists in the Intramural Research Program at the National Institute of Health. The study also found that, among the respondents, sharing data directly with other clinicians was a common practice, but the subjects of the study had little practice uploading data to a repository.
+Within the field of biomedical research, data sharing has been promoted as an important way for researchers to share and reuse data in order to fully capture the benefits towards personalized and precision medicine.
+
+=== Challenges to data sharing ===
+Data sharing in healthcare has remained a challenge for several reasons. Despite research advances in data sharing in healthcare, many healthcare organizations remain reluctant or unwilling to release medical data on account of privacy laws such as the Health Insurance Portability and Accountability Act (HIPAA). Moreover, sharing biological data between institutions requires protecting confidentiality for data that may span several organizations. Achieving data syntax and semantic heterogeneity while meeting diverse privacy requirements are all factors that pose barriers to data sharing.
+
+== References ==
--- a/data/en.wikipedia.org/wiki/Biological_data_visualization-0.md
+++ b/data/en.wikipedia.org/wiki/Biological_data_visualization-0.md
@ -0,0 +1,44 @@
+---
+title: "Biological data visualization"
+chunk: 1/5
+source: "https://en.wikipedia.org/wiki/Biological_data_visualization"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T14:01:39.614425+00:00"
+instance: "kb-cron"
+---
+
+Biological data visualization is a branch of bioinformatics concerned with the application of computer graphics, scientific visualization, and information visualization to different areas of the life sciences. This includes visualization of sequences, genomes, alignments, phylogenies, macromolecular structures, systems biology, microscopy, and magnetic resonance imaging data. Software tools used for visualizing biological data range from simple, standalone programs to complex, integrated systems.
+An emerging trend is the blurring of boundaries between the visualization of 3D structures at atomic resolution, the visualization of larger complexes by cryo-electron microscopy, and the visualization of the location of proteins and complexes within whole cells and tissues. There has also been an increase in the availability and importance of time-resolved data from systems biology, electron microscopy, and cell and tissue imaging.
+
+== Sequence alignment ==
+
+Sequence alignment visualization plays a crucial role in bioinformatics and genomics by enabling researchers to interpret and analyze complex genetic data effectively. Visualizing sequence alignments allows for the identification of similarities, differences, conserved regions, and evolutionary patterns within DNA or protein sequences, aiding in understanding genetic relationships, functional elements, and evolutionary processes. Sequence alignment visualization is essential for several reasons:
+Identifying conserved sequence: Visualization helps researchers identify conserved regions across sequences, which are indicative of functional importance or evolutionary relationships.
+Detecting mutations and variations: Visualization tools enable the detection of mutations, insertions, deletions, and other variations within sequences, providing insights into genetic diversity and disease-causing mutations.
+Understanding evolutionary relationships: By visualizing sequence alignments, researchers can infer evolutionary relationships, construct phylogenetic trees, and study the evolutionary history of species or genes.
+Predicting functional elements: Visualization aids in predicting functional elements such as protein domains, motifs, and regulatory regions within sequences, facilitating functional genomics studies.
+
+Comparing genomes: comparative genomics rely on sequence alignment visualization to compare genomes, identify orthologous and paralogous genes, and study genome evolution across species. To visualize sequence alignments and their features, researchers often rely on popular bioinformatics software tools such as Clustal Omega, MUSCLE, T-Coffee, and MAFFT. These tools provide interactive platforms for aligning sequences, highlighting conserved regions, displaying sequence variations, and identifying sequence motifs. Additionally, visualization software like Jalview, BioEdit, and Geneious offer advanced features for visualizing and analyzing sequence alignments, making it easier for researchers to interpret and extract meaningful information from genetic data.
+Techniques
+Besides software tools, such as Clustal Omega, MUSCLE, T-Coffee, and MAFFT, several popular techniques exist for genomic sequence alignment visualization, which plays a crutial role in helping researchers understand generic relationship, functional elements, and evolutionary processes. Among popular tools, common techniques in sequence alignment visualization include:
+
+Sequence logo: Sequence logos are graphical representations of sequence alignments that display the conservation of residues at each position as well as the relative frequency of each amino acid or nucleotide. Sequence logos provide a compact and informative visualization of conserved sequence and variability.
+Multiple sequence alignment: Multiple sequence alignment viewers, such as Jalview and MEGA, provide interactive platforms for visualizing and analyzing multiple sequence alignment. These tools offer features for highlighting conserved sequence regions, identifying motifs, and exploring evolutionary relationships within sequences.
+
+Protein structure alignment tools: tools like PyMOL and UCSF Chimera enable the visualization of sequence alignments in the context of protein structures. By superimposing aligned sequences onto protein structures, researchers can analyze the spatial arrangement of conserved residues and functional domains.
+Phylogenetic tree visualization: Phylogenetic tree visualization tools, such as FigTree and iTOL, allow researchers to visualize evolutionary relationships inferred from sequence alignments. These tools provide interactive displays of phylogenetic trees, highlighting branch lengths, node support values, and evolutionary distances.
+Genome browser: Genome browsers like UCSC Genome Browser and Ensembl provide comprehensive platforms for visualizing sequence alignments across entire genomes. Researchers can explore DNA annotation, regulatory elements, and comparative genomics data within the context of genome sequences.
+
+Applications
+Genomic sequence alignment visualization is used in various applications, playing a crucial role in various areas of genomics and bioinformatics, enabling researchers to analyze, interpret, and extract valuable insights from genetic data. The applications of sequence alignment visualization are diverse and encompass a wide range of research fields. Some key applications include:
+Comparative genomics: Sequence alignment visualization is essential for comparative genomics studies, where researchers compare genetic sequences across different species to identify evolutionary relationships, conserved sequence regions, and functional elements. Visualization tools help in detecting similarities and differences between genomes, aiding in the study of evolutionary processes. 
+
+Variant analysis: In the field of genetics and personalized medicine, sequence alignment visualization is used for variant analysis to identify single nucleotide polymorphisms (SNPs), insertions, deletions, and other genetic variation. Visualization tools help researchers pinpoint specific variations in genomic sequences and assess their potential impact on phenotypic traits.
+Phylogenetic analysis: Phylogenetics studies rely on sequence alignment visualization to construct phylogenetic trees and analyze genetic relationships between species or population. Visualization tools enable researchers to visualize sequence similarities, calculate evolutionary distances, and infer phylogenetic relationships based on sequence alignments.
+Functional genomics: In functional genomics research, sequence alignment visualization is employed to study gene expression, regulatory elements, and protein-protein interactions. By visualizing sequence alignments in the context of functional annotations and gene networks, researchers can elucidate the biological functions and regulatory mechanisms of genes.
+Structural bioinformatics: Sequence alignment visualization is integral to structural bioinformatics, where researchers analyze protein sequences and structures to understand their three-dimensional organization and functional properties. Visualization tools help in aligning protein sequences, predicting structural motif, and exploring protein-protein interactions.
+
+== Macromolecular ==
+The visualization of macromolecules is critical for an intricate understanding of the multifaceted structures and functionalities that are fundamental to biological systems. Remarkable progress has been made in the three-dimensional portrayal of such macromolecules, spanning carbohydrates, proteins, nucleic acids, and their complexes. Recent advancements in visualization methodologies have precipitated a quantum leap in our ability to discern the subtleties of biological data. These sophisticated visualizations bestow an unprecedented level of clarity and granularity, thereby enhancing our comprehension of the mechanistic underpinnings governing the behavior and interaction of biological entities.
+Techniques
--- a/data/en.wikipedia.org/wiki/Biological_data_visualization-1.md
+++ b/data/en.wikipedia.org/wiki/Biological_data_visualization-1.md
@ -0,0 +1,45 @@
+---
+title: "Biological data visualization"
+chunk: 2/5
+source: "https://en.wikipedia.org/wiki/Biological_data_visualization"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T14:01:39.614425+00:00"
+instance: "kb-cron"
+---
+
+Segmentation enhances biological imaging interpretation, with automated tools improving data analysis. This has led to a rise in web-based visualization for 3D segmentations. Segmentation plays a vital role in deciphering biological imaging data. The advent of sophisticated automated segmentation technologies, along with their incorporation into public imaging data repositories, greatly enhances the interpretation process.
+Volume rendering reveals internal macromolecular structures without segmentation, providing a non-invasive view inside the molecules.
+Integrating experimental data into visualizations, like overlaying mutations or binding data, offers richer insights. This can be displayed as heat maps or gradients on the molecule, vital for managing the growing complexity of biomolecular data.
+Interactive 3D visualization offers hands-on engagement with macromolecules, allowing for manipulation such as rotation and zooming, which enhances comprehension.
+
+Virtual reality and augmented reality present immersive methods to engage with macromolecules, delivering a 3D perspective that screen-based tools can't match. AR app also designed to help students visualize and interact with 3D macromolecular structures, addressing the limitations of traditional 2D images in conveying spatial details and depth perception.
+Animation of molecular activities illustrates the dynamic behaviors of biomolecules, serving as a powerful educational and research tool. Utilizing Unity3D game engine technology, this approach democratizes the creation of interactive molecular visualization tools, resulting in a user-friendly platform that simplifies complex biological data depiction.
+High-performance computing visualization enables real-time rendering of massive, intricate datasets, a necessity for advanced macromolecular analysis. Software leveraging high-performance computing dynamically and efficiently analyzes drug-receptor interactions via molecular dynamics simulations, offering profound insights and predictions on drug efficacy, and facilitating visualization.
+Hybrid visualization techniques merge various methods to provide a multifaceted view of molecules, combining detailed atomic positions with a holistic understanding of structure and volume.
+Visualization in different types of macromolecular
+
+Carbohydrates visualization
+Visualizations of the Carbohydrate Binding Module (CBM) of cellulase examine its interactions with cellulose during hydrolysis from three angles: the adsorption of CBM to cellulose, its spatial occupation, and the accessibility of the cellulose surface to CBM.
+
+Proteins visualization
+The RCSB Protein Data Bank (RCSB PDB), supported by major US scientific agencies, has been a pivotal resource for structural biologists globally and acts as the US data center within the Worldwide Protein Data Bank (wwPDB) partnership. As the designated Archive Keeper, RCSB PDB ensures the security of PDB data and serves tens of thousands of data depositors annually across all inhabited continents using various structural determination methods. The RCSB.org web portal provides unrestricted access to PDB data to millions globally. This article details the growth and evolution of the archive with advancing experimental techniques, the critical role of data standards and integration, and the introduction of new tools and features for 3D structural analysis and visualization over the past year.
+Nucleic acid visualization
+Researchers have developed a swift, straightforward, and precise method for detecting Infectious Bovine Rhinotracheitis Virus (IBRV) in cattle—a virus known for causing chronic infections and substantial economic impacts. This method integrates recombinant polymerase amplification (RPA) with a vertical flow visualization strip (VF) to form an RPA-VF assay that targets the thymidine kinase gene, ensuring fast detection, high specificity, and zero cross-reactivity with other pathogens.
+Large non-polymeric molecules
+The visualization of nanoscale materials is crucial for understanding their structure-function relationships, and it typically requires advanced microscopy and analytical techniques that provide high-resolution and high-magnification images.
+
+Nanoparticles are tiny particles that measure in the range of 1 to 100 nanometers. Due to their small size and high surface area to volume ratio, they exhibit unique chemical and physical properties. Visualization of nanoparticles is typically achieved using high-resolution techniques like Transmission Electron Microscopy (TEM), Scanning Electron Microscopy (SEM), Atomic Force Microscopy (AFM), and Dynamic Light Scattering (DLS) for size distribution analysis.
+
+Nanocomposites are materials that incorporate nanoparticles within a matrix of another material, such as polymers, ceramics, or metals. These composites often exhibit enhanced properties, such as increased strength or electrical conductivity. Visualization of the distribution and interaction of nanoparticles within the matrix can be carried out using techniques like TEM, SEM, and X-ray diffraction (XRD).
+
+Nanotubes, specifically carbon nanotubes (CNTs), are cylindrical structures with diameters as small as 1 nanometer. They have remarkable mechanical, electrical, and thermal properties and are used in various applications from materials science to nanotechnology. Visualization of nanotubes typically requires TEM, SEM, or AFM.
+
+Nanofibers are fibers with diameters in the nanometer scale. They are created through processes like electrospinning and have applications in areas such as filtration, textiles, and biomedicine. Nanofibers can be visualized using SEM, which provides detailed images of their morphology and distribution.
+Visualize the interactions between macromolecules
+The interactions of protein-carbohydrae was visulazed by hydrogen atoms in a perdeuterated lectin-fucose complex.
+Computational docking plays a vital role in structural biology, with software providing a user-friendly web platform for modeling various macromolecular interactions, such as flexible complexes and membrane-associated assemblies. This enhances accessibility and enriches the user experience within the structural biology community.
+Tools
+PyMOL, Chimera, ChimeraX, Jmol, VMD, Swiss-PdbViewer, Coot, Biovia Discovery Studio, LightDock and Schrodinger's Maestro are key tools in molecular visualization, each offering unique capabilities ranging from high-quality 3D imaging and interactive analysis to support for virtual reality and large-scale simulations, catering to diverse needs in molecular modeling, publication, and education across both open-source and commercial platforms.
+
+== Systems biology ==
--- a/data/en.wikipedia.org/wiki/Biological_data_visualization-2.md
+++ b/data/en.wikipedia.org/wiki/Biological_data_visualization-2.md
@ -0,0 +1,31 @@
+---
+title: "Biological data visualization"
+chunk: 3/5
+source: "https://en.wikipedia.org/wiki/Biological_data_visualization"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T14:01:39.614425+00:00"
+instance: "kb-cron"
+---
+
+Systems biology is a branch of biological data visualization dedicated to analyzing and modeling complex biological systems. Popular computational models used in systems biology include process calculi, such as stochastic π-calculus, and constraint-based reconstruction and analysis (COBRA), a paradigm that considers physical, enzymatic, and topological constraints underlying a phenotype in a metabolic network.
+Most data visualization in systems biology is done using mathematically generated models. Researchers will diagram all of the protein, gene, or metabolic pathways in a given biological system, then  determine the speed of the reactions in that system using mass action kinetics or enzyme kinetics. These values are used as parameters to construct differential equations representing the system, which can then be used to determine the behavior of the things within that system. Alternative mathematical modeling solutions also exist; for instance, a COBRA method such as flux balance analysis could be used to analyze the flow of metabolites through a particular metabolic network.
+Another key imaging method in systems biology is mass spectrometry, which can be used to visualize the spatial distribution of compounds, biomarkers, metabolites, peptides, and/or proteins within the body. This is especially helpful in metabolomics, a branch of systems biology that uses mass spectrometry to measure metabolite distribution information, then uses the measured intensity to construct an image.
+Popular software tools used in systems biology modeling include massPy, Cytosim, and PySB. Further examples may be found at Wikipedia's list of systems biology modeling software.
+
+== Microscopy visualization ==
+Other than optical and electron microscopy, other techniques like scanning probe, ultraviolet, infrared, digital holographic, laser, and amateur are also utilize on Visualization.
+
+New approaches
+There is study investigates the use of two-photon microscopy, a technique capable of imaging depths up to 800 μm through two-photon absorption, for visualizing microrobotic agents beneath biological tissue, demonstrating its transformative potential for both in vitro and in vivo microrobotics applications.
+Researchers used bright-field light microscopy with high-intensity pulsing LED illumination to capture detailed 12-bit-per-channel images of live cells, addressing data distortions caused by optical path interactions and sensor anomalies with a comprehensive spectroscopic calibration approach, allowing for visualization with minimal information loss in 8-bit intensity depth.
+Researchers explored a community-driven initiative focused on improving the depiction of light microscopy data in scientific publications by adhering to the 'FAIR Data Principles,' which aim to enhance data findability, accessibility, interoperability, and reproducibility. Despite persistent challenges related to data quality and communication, the initiative emphasizes the role of global scientific collaboration in advancing imaging standards and leverages historical insights to guide and promote future advancements in biological imaging.
+
+== Magnetic resonance imaging ==
+
+Magnetic resonance imaging (MRI) is a common form of biological data visualization used to form pictures of internal biological processes. Different settings of radiofrequency pulses and gradients result in different image appearances; these combinations are known as MRI sequences. A particularly notable subset of MRI is magnetic resonance angiography, which is a group of techniques used to image arteries and veins. MRI's imaging utility is further expanded upon by diffusion MRI and functional MRI, which can be used to capture neuronal tracts and blood flow respectively.
+
+Diffusion MRI further relies on diffusion tensor imaging (DTI), which measures water molecule diffusion and directionality, and diffusion basis spectrum imaging (DBSI), which extracts multiple anisotropic and isotropic diffusion tensors. Functional MRI relies on blood-oxygen-level dependent (BOLD) contrast, which measures the proportion of oxygenated hemoglobin in specific areas of the brain; this allows it to measure and model brain activity based on blood flow. Further MRI techniques include saturation pulses (used to reduce motion artifacts), gradient echo (such as dynamic contrast enhancement), spin echo, and diffusion weighting (a signal contrast generation method based on differences in Brownian motion).
+
+To generate an observable image using MRI, the target is placed in a powerful magnetic field, such as that of an MRI machine. This causes the axes of the hydrogen protons inside the target, which are usually randomly aligned according to equilibrium, to be lined up in the same direction, creating a magnetic vector oriented along the magnet's axis. This orientation also allows the hydrogen protons' spin, or frequency of rotation, to be measured. The alignment is then disrupted using radiofrequency (RF) pulses (RF being a type of non-ionizing electromagnetic radiation). When the magnetic field is removed, the hydrogen protons return to their equilibrium states in a process known as relaxation, and in doing so they emit RF energy. Different tissues relax at different rates, which allows scientists to use specific RF pulse sequences to emphasize particular tissues or abnormalities.
+After a period of time following the RF pulse, the RF energy signals emitted by the protons are measured to obtain frequency information from each location in the imaged plane. Then Fourier transformation is used to convert this frequency information into intensity levels, which are displayed as shades of grey in the generated image.
--- a/data/en.wikipedia.org/wiki/Biological_data_visualization-3.md
+++ b/data/en.wikipedia.org/wiki/Biological_data_visualization-3.md
@ -0,0 +1,29 @@
+---
+title: "Biological data visualization"
+chunk: 4/5
+source: "https://en.wikipedia.org/wiki/Biological_data_visualization"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T14:01:39.614425+00:00"
+instance: "kb-cron"
+---
+
+In general, two aspects of the relaxation process are measured: the time taken for the magnetic vector to return to its resting state (also known as T1 or spin–lattice relaxation), and the time taken for the axial spin of the hydrogen protons to return to its resting state (also known as T2 or spin–spin relaxation). To create a T1-weighted image, the MR signal is measured by changing the amount of time between RF pulses (also known as the time to repeat, or TR). To create a T2-weighted image, the MR signal is measured by changing the amount of time between delivering the RF pulse and receiving the RF energy signals from the hydrogen protons (also known as the time to echo, or TE). The dominant signal intensities of T1 image weighting are fluid (black due to low intensity), muscle (grey due to intermediate signal intensity), and fat (white due to high signal intensity). Fat suppression is applied to many T1 weighted sequences to suppress the brightness of the signal created by it. The dominant signal intensities of T2 image weighting are fluid (white), muscle (grey), and fat (white). T2 signals are also often emphasized or suppressed depending on what the goal of the imaging is; notable examples include fat suppression, fluid attenuation, and susceptibility weighting.
+Also of note are proton density (PD) weighted images, which are generated using a long TR and a short TE. PD is useful for differentiating between fluid, hyaline cartilage and fibrocartilage, which makes it ideal for imaging joints. Outside of joint imaging it has largely been replaced by fluid attenuated inversion recovery (FLAIR), an inversion recovery sequence that removes the signal from cerebrospinal fluid.
+
+== Tomography ==
+
+Computed tomography (CT) and positron emission tomography (PET) scans are similar to MRI, but rely on different imaging techniques (X-rays and ionizing radiation, respectively). A variation of CT known as contrast CT also requires the subject to take in a contrast medium called a radiocontrast (typically by oral consumption, enema, or injection). Positive radiocontrast agents such as barium sulfate increase the body's X-ray attenuation, causing the tissue containing them to appear whiter in the X-ray image. Meanwhile, negative agents such as carbon dioxide gas allow X-rays to pass through them easily, causing the tissues containing them to appear darker.
+Like magnetic resonance imaging, CT scans use numerous methods to display and measure data, including sequential CT (where the CT table steps from location to location), spiral CT (where the entire X-ray tube is spun around the subject), and electron beam tomography (where only the electron paths are spun using deflection coils). PET scanners don't have quite as much hardware variation and instead use different radiotracers depending on what the imaging target is. Note that radiotracers are distinct from radiocontrasts; the former relies on radioactive decay to trace its path while the latter is absorbed into specific tissue and affects that tissue's X-ray attenuation. Because these methods are not mutually exclusive, PET and CT can be performed simultaneously using PET-CT scanners, which are used for the majority of modern PET scans.
+Either or both of these methods can be used in conjunction with maximum intensity projection (MIP) to convert the scan data into a 3D image. This can be difficult to accomplish due to artifacts created by respiration and bloodflow, which can appear as abnormalities to an untrained eye; however, it's possible to distinguish these artifacts from real disease so long as careful attention is paid to them. When done well, CT and PET scans taken with MIP are excellent for identifying small abnormal tissue growths, especially in the lungs. Scans taken with MIP for this purpose tend to have higher significance than averaged images created with traditional CT.
+MIP imaging is also used with magnetic resonance angiography, and research has indicated that it could feasibly be used with MRI. At least one study has shown that MIP MRI actually significantly outperforms single-slice MRI when used by neural networks to classify lesions based on malignancy.
+
+== Alignment ==
+A sequence alignment is a way of arranging the sequences of protein, RNA or DNA, to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. The concept initially compares only two such sequences in the so called pairwise alignment.
+Global alignments, which attempt to align every residue in every sequence, are most useful when the sequences in the query set are similar and of roughly equal size. Local alignments are more useful for dissimilar sequences that are suspected to contain regions of similarity or similar sequence motifs within their larger sequence context.
+Multiple sequence alignment is an extension of pairwise alignment to incorporate more than two sequences at a time. Multiple alignment methods try to align all the sequences in each query set. Multiple alignments are often used in identifying conserved sequence regions across a group of sequences hypothesized to be evolutionarily related.
+Purposes of Alignment Visualization:
+
+Aid general understanding of large-scale DNA or protein alignments. When analyzing data, it is helpful to visualize it somehow, to be able to easily spot clear patters or relations.
+Visualize alignments for figures and publication. It summarizes the multiple sequence alignment in an easy-to-digest form.
+Manually edit and curate automatically generated alignments. Even though there are efficient algorithms, none is perfect and visualization tools provide a way to edit small discrepancies.
--- a/data/en.wikipedia.org/wiki/Biological_data_visualization-4.md
+++ b/data/en.wikipedia.org/wiki/Biological_data_visualization-4.md
@ -0,0 +1,46 @@
+---
+title: "Biological data visualization"
+chunk: 5/5
+source: "https://en.wikipedia.org/wiki/Biological_data_visualization"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T14:01:39.614425+00:00"
+instance: "kb-cron"
+---
+
+Regular multiple sequence alignment – Aligned sequences of nucleotide or amino acid residues are typically represented as rows within a matrix. Gaps are inserted between the residues so that identical or similar characters are aligned in successive columns. Many sequence visualization programs also use color to display information about the properties of the individual sequence elements; in DNA and RNA sequences, this equates to assigning each nucleotide its own color. In protein alignments color is often used to indicate amino acid properties to aid in judging the conservation of a given amino acid substitution. 
+For multiple sequences the last row in each column is often the consensus sequence determined by the alignment; the consensus sequence is also often represented in graphical format with a sequence logo in which the size of each nucleotide or amino acid letter corresponds to its degree of conservation.
+
+ 
+Circular multiple sequence alignment – A common assumption of multiple sequence alignment techniques is that the left- and right-most positions of the input sequences are relevant to the alignment. However, the position where a sequence starts or ends can be totally arbitrary. For instance, when linearizing a circular molecular structure, the start of the sequence is selected randomly. This is relevant, for instance, in the process of multiple sequence alignment of mitochondrial DNA, viroid, viral or other genomes, which have a circular molecular structure. 
+
+Spiral multiple sequence alignment – Color is used to display information about the properties of the individual sequence elements. There can also be gaps that make the sequences fit better among themselves. In summary, the topology of the spiral sequence alignment is equivalent to a standard linear matrix, with the advantage that it summarizes very long sequences in a practical way. That means that each individual spiral represents one of the sequences being aligned.
+
+3D visualization – A common, one-dimensional, representation of a protein sequence is a list of the amino acids that form it. However, 3-dimensional alignment displays the way sequences may match each other. The 1D-3D Group Alignment Viewer, from the RCSD Protein Data Bank, supports exploration of multiple sequence alignments (MSA) at sequence and structure levels for PDB experimental structures and Computed Structure Models (CSMs). It is possible to select proteins and/or residue regions from the MSA to view their 3D structures aligned.
+RCSB.org clusters protein entities (PDB experimental structures and CSMs) by sequence identity threshold and UniProt accession. For each cluster, the MSA is calculated using Clustal Omega and displayed in the 1D-3D Group Alignment Viewer using specific color schemes. PDB protein sequence positions are represented in blue if residue was experimentally determined, and in gray if not. CSMs are colored according to their local pLDDT scores.
+
+== Phylogenies ==
+A phylogenetic tree is a branching diagram or a tree showing the evolutionary relationships among various biological species or other entities based upon similarities and differences in their physical or genetic characteristics. It is a visual representation that shows the evolutionary history between a set of species or taxa during a specific time.
+Two things are implicitly occurring along the branches of a phylogenetic tree. The first is the passage of time. Deeper nodes are older than the shallower nodes to which they are connected. Thus, deeper nodes indicate both more distant relationships among the terminal taxa that they connect, and a greater age for the most recent common ancestor of those taxa. The second thing is evolutionary modification, or the accumulation of hereditary genetic and/or structural changes along these branches. The term "branch length" typically refers to the number of these changes. If the "branch lengths" of the tree measure these changes, we also call the tree a phylogram.
+Regular phylogenetic tree – Generally called a dendrogram, it is a diagram with straight lines representing a tree. It would show a column of nodes representing individual taxa, and the remaining nodes represent the clusters to which the data belong, with the arrows representing the distance: a way to measure how different they are (dissimilarity). The distance between merged clusters is monotone, increasing with the level of the merger: the height of each node in the plot is proportional to the value of the intergroup dissimilarity between its two branches.
+
+Cladogram – It is also a diagram with straight lines representing a tree. The difference between a cladogram and an evolutionary tree is that the cladogram does not show how ancestors are related to descendants, nor does it show how much they have changed. This means that more than one evolutionary tree may correspond to the same cladogram.
+
+ 
+Circular phylogenetic tree – Circular trees are often used to illustrate relationships among members of major groups of extant organisms, and these trees may have many terminal taxa. It might seem counterintuitive, but the same information given in a regular phylogenetic tree is given in a circular genetic tree. The topology of the structure remains the same, and it only changes shape to better fit a lot of information in less space.
+
+3D Visualization – In a phylogram, the evolutionary distance is represented on one of the axes and the genes on the other. For it to be possible to visualize the paralogs, a third axis can be added. In standard (2D) phylogeny layout it is not always easy to distinguish gene duplication events (paralogs) from speciation branching (species), because only one spatial axis (genes) is available to show the mix of these two kinds of information. By contrast, they can be easily distinguished in 3DPE, because it projects them onto two orthogonal axes: species (X) vs. paralogs (Z). For instance, the evolution of many paralogs is visually obvious in the 3DPE view (in the three eukaryote species, on the right), but this pattern is less clear in the 2D representation.
+
+== Visualization software ==
+
+== References ==
+
+== External links ==
+
+=== Related conferences ===
+BioVis: Symposium on Biological Data Visualization
+Applications of Information Visualization in Bioinformatics
+CIBDV: Computational Intelligence for Biological Data Visualization
+IVBI: Information Visualization in Biomedical Informatics Symposium
+VMLS: Visualization in Medicine & Life Sciences
+VIZBI: Workshop on Visualizing Biological Data
--- a/data/en.wikipedia.org/wiki/Biological_network-0.md
+++ b/data/en.wikipedia.org/wiki/Biological_network-0.md
@ -0,0 +1,36 @@
+---
+title: "Biological network"
+chunk: 1/5
+source: "https://en.wikipedia.org/wiki/Biological_network"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T14:01:40.853976+00:00"
+instance: "kb-cron"
+---
+
+A biological network is a method of representing systems as complex sets of binary interactions or relations between various biological entities. In general, networks or graphs are used to capture relationships between entities or objects. A network can be represented as an N×N matrix where N is the number of nodes, and who's entries tell you if two nodes share an edge. Typically, this matrix will have entries of 0 or 1 with 1 denoting an edge. We can optionally use a weighted graph by assigning a weight to each edge to show how relevant the connection between two nodes is. 
+
+== History of networks ==
+
+As early as 1736 Leonhard Euler analyzed a real-world issue known as the Seven Bridges of Königsberg, which established the foundation of graph theory. From the 1930s-1950s the study of random graphs were developed. During the mid 1990s, it was discovered that many different types of "real" networks have structural properties quite different from random networks. In the late 2000s, scale-free and small-world networks began shaping the emergence of systems biology, network biology, and network medicine. In 2014, graph theoretical methods were used by Frank Emmert-Streib to analyze biological networks.
+In the 1980s, researchers started viewing DNA or genomes as the dynamic storage of a language system with precise computable finite states represented as a finite-state machine. Recent complex systems research has also suggested some far-reaching commonality in the organization of information in problems from biology, computer science, and physics.
+
+== Networks in biology ==
+
+=== Protein–protein interaction networks ===
+
+Protein-protein interaction networks (PINs) represent the physical relationship among proteins present in a cell, where proteins are nodes, and their interactions are undirected edges. Due to their undirected nature, it is difficult to identify all the proteins involved in an interaction. Protein–protein interactions (PPIs) are essential to the cellular processes and also the most intensely analyzed networks in biology. PPIs could be discovered by various experimental techniques, among which the yeast two-hybrid system is a commonly used technique for the study of binary interactions. Recently, high-throughput studies using mass spectrometry have identified large sets of protein interactions.
+Many international efforts have resulted in databases that catalog experimentally determined protein-protein interactions. Some of them are the Human Protein Reference Database, Database of Interacting Proteins, the Molecular Interaction Database (MINT), IntAct, and BioGRID.  At the same time, multiple computational approaches have been proposed to predict interactions. FunCoup and STRING are examples of such databases, where protein-protein interactions inferred from multiple evidences are gathered and made available for public usage.
+Recent studies have indicated the conservation of molecular networks through deep evolutionary time. Moreover, it has been discovered that proteins with high degrees of connectedness are more likely to be essential for survival than proteins with lesser degrees. This observation suggests that the overall composition of the network (not simply interactions between protein pairs) is vital for an organism's overall functioning.
+
+=== Gene regulatory networks (DNA–protein interaction networks) ===
+
+The genome encodes thousands of genes whose products (mRNAs, proteins) are crucial to the various processes of life, such as cell differentiation, cell survival, and metabolism. Genes produce such products through a process called transcription, which is regulated by a class of proteins called transcription factors. For instance, the human genome encodes almost 1,500 DNA-binding transcription factors that regulate the expression of more than 20,000 human genes. The complete set of gene products and the interactions among them constitutes gene regulatory networks (GRN). GRNs regulate the levels of gene products within the cell and in-turn the cellular processes.
+GRNs are represented with genes and transcriptional factors as nodes and the relationship between them as edges. These edges are directional, representing the regulatory relationship between the two ends of the edge. For example, the directed edge from gene A to gene B indicates that A regulates the expression of B. Thus, these directional edges can not only represent the promotion of gene regulation but also its inhibition.
+GRNs are usually constructed by utilizing the gene regulation knowledge available from databases such as., Reactome and KEGG. High-throughput measurement technologies, such as microarray, RNA-Seq, ChIP-chip, and ChIP-seq, enabled the accumulation of large-scale transcriptomics data, which could help in understanding the complex gene regulation patterns.
+
+=== Gene co-expression networks (transcript–transcript association networks) ===
+
+Gene co-expression networks can be perceived as association networks between variables that measure transcript abundances. These networks have been used to provide a system biologic analysis of DNA microarray data, RNA-seq data, miRNA data, etc. weighted gene co-expression network analysis is extensively used to identify co-expression modules and intramodular hub genes. Co-expression modules may correspond to cell types or pathways, while highly connected intramodular hubs can be interpreted as representatives of their respective modules.
+
+=== DNA-DNA chromatin networks ===
--- a/data/en.wikipedia.org/wiki/Biological_network-1.md
+++ b/data/en.wikipedia.org/wiki/Biological_network-1.md
@ -0,0 +1,33 @@
+---
+title: "Biological network"
+chunk: 2/5
+source: "https://en.wikipedia.org/wiki/Biological_network"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T14:01:40.853976+00:00"
+instance: "kb-cron"
+---
+
+Within a nucleus, DNA is constantly in motion. Perpetual actions such as genome folding and Cohesin extrusion morph the shape of a genome in real time. The spatial location of strands of chromatin relative to each other plays an important role in the activation or suppression of certain genes. DNA-DNA Chromatin Networks help biologists to understand these interactions by analyzing commonalities amongst different loci. The size of a network can vary significantly, from a few genes to several thousand and thus network analysis can provide vital support in understanding relationships among different areas of the genome. As an example, analysis of spatially similar loci within the organization in a nucleus with Genome Architecture Mapping (GAM) can be used to construct a network of loci with edges representing highly linked genomic regions.
+In such networks, edge weights often correspond to the frequency or strength of interaction between loci, while network construction may involve filtering or thresholding to retain only strong interactions. Some examples of this may include filtering out certain gene locations, filtering based on quartile of closeness, or by expression as this can serve to reduce noise and highlight biologically meaningful relationships for interpretation.
+The first graphic portrays the layout of the Hist1 region of the mm9 mouse genome, a large cluster of genes that encode for replication-dependant histones. The organization of the histone genes in this cluster have been found to be practically identical to that of the human Hist1 region. The data used to develop this network graph was discovered through GAM. Each node on the graph represents a genomic loci within the mouse genome. The edges between the nodes represent a linkage disequilibrium between the connected nodes greater than the average across all 81 genomic windows. The initial locations of the nodes within the graphic were randomly selected but the methodology of choosing edges shaped the graph into a rudimentary graphical representation of the placement of genomic loci throughout the Hist1 region.
+Highly connected nodes in such chromatin interaction networks can be interpreted as hubs, and may be used to define communities of loci that interact more frequently with one another. These community structures reflect the modular organization commonly observed in biological and regulatory networks . In hub-based approaches, nodes are assigned to the community of the hub with which they share the strongest interaction, often with constraints to ensure that each node belongs to only one community.
+Such network representations are closely related to heat map visualizations, where interaction data are displayed as a matrix (adjacency matrix) in which each cell represents the interaction strength between two loci. Patterns observed in heat maps, such as dense blocks of high interaction, often correspond to communities identified in the network representation. These approaches enable combination of graph-based and matrix-based analyses of chromatin organization. This type of comparison can be seen in the graphics below where the heat map and network visualizations can be compared in such a manner.
+
+=== Metabolic networks ===
+
+Cells break down the food and nutrients into small molecules necessary for cellular processing through a series of biochemical reactions. These biochemical reactions are catalyzed by enzymes. The complete set of all these biochemical reactions in all the pathways represents the metabolic network. Within the metabolic network, the small molecules take the roles of nodes, and they could be either carbohydrates, lipids, or amino acids. The reactions which convert these small molecules from one form to another are represented as edges. It is possible to use network analyses to infer how selection acts on metabolic pathways.
+
+=== Signaling networks ===
+
+Signals are transduced within cells or in between cells and thus form complex signaling networks which plays a key role in the tissue structure. For instance, the MAPK/ERK pathway is transduced from the cell surface to the cell nucleus by a series of protein-protein interactions, phosphorylation reactions, and other events. Signaling networks typically integrate protein–protein interaction networks, gene regulatory networks, and metabolic networks. Single cell sequencing technologies allows the extraction of inter-cellular signaling, an example is NicheNet, which allows to modeling intercellular communication by linking ligands to target genes.
+
+=== Neuronal networks ===
+
+The complex interactions in the brain make it a perfect candidate to apply network theory. Neurons in the brain are deeply connected with one another, and this results in complex networks being present in the structural and functional aspects of the brain. For instance, small-world network properties have been demonstrated in connections between cortical regions of the primate brain or during swallowing in humans. This suggests that cortical areas of the brain are not directly interacting with each other, but most areas can be reached from all others through only a few interactions.
+
+=== Food webs ===
+
+All organisms are connected through feeding interactions. If a species eats or is eaten by another species, they are connected in an intricate food web of predator and prey interactions. The stability of these interactions has been a long-standing question in ecology. That is to say if certain individuals are removed, what happens to the network (i.e., does it collapse or adapt)? Network analysis can be used to explore food web stability and determine if certain network properties result in more stable networks. Moreover, network analysis can be used to determine how selective removals of species will influence the food web as a whole. This is especially important considering the potential species loss due to global climate change.
+
+=== Network Medicine ===
--- a/data/en.wikipedia.org/wiki/Biological_network-2.md
+++ b/data/en.wikipedia.org/wiki/Biological_network-2.md
@ -0,0 +1,14 @@
+---
+title: "Biological network"
+chunk: 3/5
+source: "https://en.wikipedia.org/wiki/Biological_network"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T14:01:40.853976+00:00"
+instance: "kb-cron"
+---
+
+Network medicine is an emerging field that applies network principles to understand the molecular basis of human disease.   Instead of focusing on single genes or proteins, network medicine examines how diseases arise from small changes in complex biological networks, including protein-protein interaction networks, gene regulatory networks, and metabolic pathways.  Within this framework, diseases are associated with specific “disease modules,” defined as groups of interconnected components whose collective dysfunction contributes to a pathological state.  This network based perspective enables the identification of disease associated genes, the analysis of relationships between different diseases, and the development of therapeutic strategies that target multiple components of a biological system rather than a single molecule.  Network medicine also supports approaches such as drug repurposing and the integration of large-scale omics data, providing a systems-level complement to traditional reductionist methods and contributing to advances in precision medicine. 
+
+=== Between-species interaction networks ===
+In biology, pairwise interactions have historically been the focus of intense study. With the recent advances in network science, it has become possible to scale up pairwise interactions to include individuals of many species involved in many sets of interactions to understand the structure and function of larger ecological networks. The use of network analysis can allow for both the discovery and understanding of how these complex interactions link together within the system's network, a property that has previously been overlooked. This powerful tool allows for the study of various types of interactions (from competitive to cooperative) using the same general framework. For example, plant-pollinator interactions are mutually beneficial and often involve many different species of pollinators as well as many different species of plants. These interactions are critical to plant reproduction and thus the accumulation of resources at the base of the food chain for primary consumers, yet these interaction networks are threatened by anthropogenic change. The use of network analysis can illuminate how pollination networks work and may, in turn, inform conservation efforts. Within pollination networks, nestedness (i.e., specialists interact with a subset of species that generalists interact with), redundancy (i.e., most plants are pollinated by many pollinators), and modularity play a large role in network stability. These network properties may actually work to slow the spread of disturbance effects through the system and potentially buffer the pollination network from anthropogenic changes somewhat. More generally, the structure of species interactions within an ecological network can tell us something about the diversity, richness, and robustness of the network. Researchers can even compare current constructions of species interactions networks with historical reconstructions of ancient networks to determine how networks have changed over time. Much research into these complex species interactions networks is highly concerned with understanding what factors (e.g., species richness, connectance, nature of the physical environment) lead to network stability.
--- a/data/en.wikipedia.org/wiki/Biological_network-3.md
+++ b/data/en.wikipedia.org/wiki/Biological_network-3.md
@ -0,0 +1,30 @@
+---
+title: "Biological network"
+chunk: 4/5
+source: "https://en.wikipedia.org/wiki/Biological_network"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T14:01:40.853976+00:00"
+instance: "kb-cron"
+---
+
+=== Within-species interaction networks ===
+Network analysis provides the ability to quantify associations between individuals, which makes it possible to infer details about the network as a whole at the species and/or population level. One of the most attractive features of the network paradigm would be that it provides a single conceptual framework in which the social organization of animals at all levels (individual, dyad, group, population) and for all types of interaction (aggressive, cooperative, sexual, etc.) can be studied.
+Researchers interested in ethology across many taxa, from insects to primates, are starting to incorporate network analysis into their research. Researchers interested in social insects (e.g., ants and bees) have used network analyses better to understand the division of labor, task allocation, and foraging optimization within colonies. Other researchers are interested in how specific network properties at the group and/or population level can explain individual-level behaviors. Studies have demonstrated how animal social network structure can be influenced by factors ranging from characteristics of the environment to characteristics of the individual, such as developmental experience and personality. At the level of the individual, the patterning of social connections can be an important determinant of fitness, predicting both survival and reproductive success. At the population level, network structure can influence the patterning of ecological and evolutionary processes, such as frequency-dependent selection and disease and information transmission. For instance, a study on wire-tailed manakins (a small passerine bird) found that a male's degree in the network largely predicted the ability of the male to rise in the social hierarchy (i.e., eventually obtain a territory and matings). In bottlenose dolphin groups, an individual's degree and betweenness centrality values may predict whether or not that individual will exhibit certain behaviors, like the use of side flopping and upside-down lobtailing to lead group traveling efforts; individuals with high betweenness values are more connected and can obtain more information, and thus are better suited to lead group travel and therefore tend to exhibit these signaling behaviors more than other group members.
+Social network analysis can also be used to describe the social organization within a species more generally, which frequently reveals important proximate mechanisms promoting the use of certain behavioral strategies. These descriptions are frequently linked to ecological properties (e.g., resource distribution). For example, network analyses revealed subtle differences in the group dynamics of two related equid fission-fusion species, Grevy's zebra and onagers, living in variable environments; Grevy's zebras show distinct preferences in their association choices when they fission into smaller groups, whereas onagers do not. Similarly, researchers interested in primates have also utilized network analyses to compare social organizations across the diverse primate order, suggesting that using network measures (such as centrality, assortativity, modularity, and betweenness) may be useful in terms of explaining the types of social behaviors we see within certain groups and not others.
+Finally, social network analysis can also reveal important fluctuations in animal behaviors across changing environments. For example, network analyses in female chacma baboons (Papio hamadryas ursinus) revealed important dynamic changes across seasons that were previously unknown; instead of creating stable, long-lasting social bonds with friends, baboons were found to exhibit more variable relationships which were dependent on short-term contingencies related to group-level dynamics as well as environmental variability. Changes in an individual's social network environment can also influence characteristics such as 'personality': for example, social spiders that huddle with bolder neighbors tend to increase also in boldness. This is a very small set of broad examples of how researchers can use network analysis to study animal behavior. Research in this area is currently expanding very rapidly, especially since the broader development of animal-borne tags and computer vision can be used to automate the collection of social associations. Social network analysis is a valuable tool for studying animal behavior across all animal species and has the potential to uncover new information about animal behavior and social ecology that was previously poorly understood.
+
+== Modelling biological networks ==
+
+=== Introduction ===
+To draw useful information from a biological network, an understanding of the statistical and mathematical techniques of identifying relationships within a network is vital. Procedures to identify association, communities, and centrality within nodes in a biological network can provide insight into the relationships of whatever the nodes represent whether they are genes, species, etc. Formulation of these methods transcends disciplines and relies heavily on graph theory, computer science, and bioinformatics.
+
+=== Association ===
+
+There are many different ways to measure the relationships of nodes when analyzing a network. In many cases, the measure used to find nodes that share similarity within a network is specific to the application it is being used. One of the types of measures that biologists utilize is correlation which specifically centers around the linear relationship between two variables. As an example, weighted gene co-expression network analysis uses Pearson correlation to analyze linked gene expression and understand genetics at a systems level. Another measure of correlation is linkage disequilibrium. Linkage disequilibrium describes the non-random association of genetic sequences among loci in a given chromosome. An example of its use is in detecting relationships in GAM data across genomic intervals based upon detection frequencies of certain loci.
+
+=== Centrality ===
+The concept of centrality can be extremely useful when analyzing biological network structures. There are many different methods to measure centrality such as degree, betweenness, closeness, Eigenvector, and Katz centrality. Every type of centrality technique can provide different insights on nodes in a particular network; However, they all share the commonality that they are to measure the prominence of a node in a network.
+In 2005, Researchers at Harvard Medical School utilized centrality measures with the yeast protein interaction network. They found that proteins that exhibited high Betweenness centrality were more essential and translated closely to a given protein's evolutionary age.
+These differing centrality measures reflect distinct structural roles of nodes within biological networks, including protein–protein interaction networks, gene regulatory networks, and metabolic networks.
+Degree centrality measures how many direct connections a node has. It is defined as
--- a/data/en.wikipedia.org/wiki/Biological_network-4.md
+++ b/data/en.wikipedia.org/wiki/Biological_network-4.md
@ -0,0 +1,59 @@
+---
+title: "Biological network"
+chunk: 5/5
+source: "https://en.wikipedia.org/wiki/Biological_network"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T14:01:40.853976+00:00"
+instance: "kb-cron"
+---
+
+where ki is the number of nodes directly connected to node i, and n is the total number of nodes in the network. In biological networks, nodes with high degree can be referred to as hubs and are associated with proteins or genes that participate in many interactions, contributing to core cellular functions.
+Betweenness centrality measures the extent to which a node lies on shortest paths between other nodes. It is defined as
+
+where σst is the total number of shortest paths between nodes s and t, and σst(i) is the number of those paths that pass through node i. Nodes with high betweenness centrality can connect different regions of a network and facilitate interactions between them.
+Closeness centrality is based on the average shortest path distance from a node to all other nodes in the network. It is defined as
+
+where d(i,j) is the shortest path distance between nodes i and j, and n is the total number of nodes. Nodes with high closeness centrality occupy central positions within the network and can interact with other nodes through relatively short paths, allowing efficient communication across the network.
+Eigenvector centrality assigns scores to nodes based on the centrality of their neighbors. It is defined as
+
+where Aij is the adjacency matrix (1 if nodes i and j are connected, 0 otherwise), CE(j) is the centrality of neighbor j, and λ is a constant (the largest eigenvalue of the adjacency matrix). In biological networks, this measure identifies nodes that are connected to other highly connected or influential nodes and is used to detect key regulators within complex systems.
+Katz centrality extends eigenvector centrality by incorporating both direct and indirect connections, with reduced influence assigned to longer paths. It is defined as
+
+where Aij tells you if node i is connected to node j (1 = yes, 0 = no), CK(j) is the score of node j, α controls how much influence farther-away nodes have (smaller = less influence), and β gives every node a small base score. This measure accounts for the cumulative influence of a node across multiple steps in a network, which is relevant in multi-step biological processes.
+These centrality measures provide complementary approaches for analyzing the structure and organization of biological networks.
+
+=== Communities ===
+
+Studying the community structure of a network by subdividing groups of nodes into like-regions can be an integral tool for bioinformatics when exploring data as a network. A food web of The Secaucus High School Marsh exemplifies the benefits of grouping as the relationships between nodes are far easier to analyze with well-made communities. While the first graphic is hard to visualize, the second provides a better view of the pockets of highly connected feeding relationships that would be expected in a food web. The problem of community detection is still an active problem. Scientists and graph theorists continuously discover new ways of subsectioning networks and thus a plethora of different algorithms exist for creating these relationships. Like many other tools that biologists utilize to understand data with network models, every algorithm can provide its own unique insight and may vary widely on aspects such as accuracy or time complexity of calculation.
+In 2002, a food web of marine mammals in the Chesapeake Bay was divided into communities by biologists using a community detection algorithm based on neighbors of nodes with high degree centrality. The resulting communities displayed a sizable split in pelagic and benthic organisms. Two very common community detection algorithms for biological networks are the Louvain Method and Leiden Algorithm.
+The Louvain method is a greedy algorithm that attempts to maximize modularity, which favors heavy edges within communities and sparse edges between, within a set of nodes. The algorithm starts by each node being in its own community and iteratively being added to the particular node's community that favors a higher modularity. Once no modularity increase can occur by joining nodes to a community, a new weighted network is constructed of communities as nodes with edges representing between-community edges and loops representing edges within a community. The process continues until no increase in modularity occurs. While the Louvain Method provides good community detection, there are a few ways that it is limited. By mainly focusing on maximizing a given measure of modularity, it may be led to craft badly connected communities by degrading a model for the sake of maximizing a modularity metric; However, the Louvain Method performs fairly and is easy to understand compared to many other community detection algorithms.
+The Leiden Algorithm expands on the Louvain Method by providing a number of improvements. When joining nodes to a community, only neighborhoods that have been recently changed are considered. This greatly improves the speed of merging nodes. Another optimization is in the refinement phase in which the algorithm randomly chooses for a node from a set of communities to merge with. This allows for greater depth in choosing communities as the Louvain Method solely focuses on maximizing the modularity that was chosen. The Leiden algorithm, while more complex than the Louvain Method, performs faster with better community detection and can be a valuable tool for identifying groups.
+
+=== Network Motifs ===
+Network motifs, or statistically significant recurring interaction patterns within a network, are a commonly used tool to understand biological networks. A major use case of network motifs is in Neurophysiology where motif analysis is commonly used to understand interconnected neuronal functions at varying scales. As an example, in 2017, researchers at Beijing Normal University analyzed highly represented 2 and 3 node network motifs in directed functional brain networks constructed by Resting state fMRI data to study the basic mechanisms in brain information flow.
+
+== See also ==
+List of omics topics in biology
+Biological network inference
+Biostatistics
+Cellular model
+Computational biology
+Systems biology
+Weighted correlation network analysis
+Interactome
+Network medicine
+Ecological network
+
+== References ==
+
+== Books ==
+
+== External links ==
+Networkbio.org, The site of the series of Integrative Network Biology (INB) meetings. For the 2012 event also see www.networkbio.org
+Network Tools and Applications in Biology (NETTAB) workshops.
+Networkbiology.org, NetworkBiology wiki site.
+Linding Lab, Technical University of Denmark (DTU) studies Network Biology and Cellular Information Processing, and is also organizing the Denmark branch of the annual "Integrative Network Biology and Cancer" symposium series.
+NRNB.org, The National Resource for Network Biology. A US National Institute of Health (NIH) Biomedical Technology Research Center dedicated to the study of biological networks.
+Network Repository The first interactive data and network data repository with real-time visual analytics.
+Animal Social Network Repository (ASNR) The first multi-taxonomic repository that collates 790 social networks from more than 45 species, including those of mammals, reptiles, fish, birds, and insects
--- a/data/en.wikipedia.org/wiki/Biological_network_inference-0.md
+++ b/data/en.wikipedia.org/wiki/Biological_network_inference-0.md
@ -0,0 +1,49 @@
+---
+title: "Biological network inference"
+chunk: 1/3
+source: "https://en.wikipedia.org/wiki/Biological_network_inference"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T14:01:42.011989+00:00"
+instance: "kb-cron"
+---
+
+Biological network inference is the process of making inferences and predictions about biological networks. By using these networks to analyze patterns in biological systems, such as food-webs, we can visualize the nature and strength of these interactions between species, DNA, proteins, and more.
+The analysis of biological networks with respect to diseases has led to the development of the field of network medicine.  Recent examples of application of network theory in biology include applications to understanding the cell cycle as well as a quantitative framework for developmental processes. Good network inference requires proper planning and execution of an experiment, thereby ensuring quality data acquisition. Optimal experimental design in principle refers to the use of statistical and or mathematical concepts to plan for data acquisition. This must be done in such a way that the data information content is enriched, and a sufficient amount of data is collected with enough technical and biological replicates where necessary.
+
+== Steps ==
+The general cycle to modeling biological networks is as follows:
+
+Prior knowledge
+Involves a thorough literature and database search or seeking an expert's opinion.
+Model selection
+A formalism to model your system, usually an ordinary differential equation, boolean network, or Linear regression models, e.g. Least-angle regression, by Bayesian network or based on Information theory approaches.  it can also be done by the application of a correlation-based inference algorithm, as will be discussed below, an approach which is having increased success as the size of the available microarray sets keeps increasing 
+Hypothesis/assumptions
+Experimental design
+Data acquisition
+Ensure that high quality data is collected with all the required variables being measured
+Network inference
+This process is mathematical rigorous and computationally costly.
+Model refinement
+Cross-check how well the results meet the expectations. The process is terminated upon obtaining a good model fit to data, otherwise, there is need for model re-adjustment.
+
+== Biological networks ==
+A network is a set of nodes and a set of directed or undirected edges between the nodes. Many types of biological networks exist, including transcriptional, signalling and metabolic. Few such networks are known in anything approaching their complete structure, even in the simplest bacteria. Still less is known on the parameters governing the behavior of such networks over time, how the networks at different levels in a cell interact, and how to predict the complete state description of a eukaryotic cell or bacterial organism at a given point in the future. Systems biology, in this sense, is still in its infancy .
+There is great interest in network medicine for the modelling biological systems. This article focuses on inference of biological network structure using the growing sets of high-throughput expression data for genes, proteins, and metabolites. Briefly, methods using high-throughput data for inference of regulatory networks rely on searching for patterns of partial correlation or conditional probabilities that indicate causal influence. Such patterns of partial correlations found in the high-throughput data, possibly combined with other supplemental data on the genes or proteins in the proposed networks, or combined with other information on the organism, form the basis upon which such algorithms work. Such algorithms can be of use in inferring the topology of any network where the change in state of one node can affect the state of other nodes.
+
+== Transcriptional regulatory networks ==
+Genes are the nodes and the edges are directed. A gene serves as the source of a direct regulatory edge to a target gene by producing an RNA or protein molecule that functions as a transcriptional activator or inhibitor of the target gene. If the gene is an activator, then it is the source of a positive regulatory connection; if an inhibitor, then it is the source of a negative regulatory connection. Computational algorithms take as primary input data measurements of mRNA expression levels of the genes under consideration for inclusion in the network, returning an estimate of the network topology. Such algorithms are typically based on linearity, independence or normality assumptions, which must be verified on a case-by-case basis. Clustering or some form of statistical classification is typically employed to perform an initial organization of the high-throughput mRNA expression values derived from microarray experiments, in particular to select sets of genes as candidates for network nodes. The question then arises: how can the clustering or classification results be connected to the underlying biology? Such results can be useful for pattern classification – for example, to classify subtypes of cancer, or to predict differential responses to a drug (pharmacogenomics). But to understand the relationships between the genes, that is, to more precisely define the influence of each gene on the others, the scientist typically attempts to reconstruct the transcriptional regulatory network.
+
+== Gene co-expression networks ==
+
+A gene co-expression network is an undirected graph, where each node corresponds to a gene, and a pair of nodes is connected with an edge if there is a significant co-expression relationship between them.
+
+== Signal transduction ==
+
+Signal transduction networks use proteins for the nodes and directed edges to represent interaction in which the biochemical conformation of the child is modified by the action of the parent (e.g. mediated by phosphorylation, ubiquitylation, methylation, etc.). Primary input into the inference algorithm would be data from a set of experiments measuring protein activation / inactivation (e.g., phosphorylation / dephosphorylation) across a set of proteins. Inference for such signalling networks is complicated by the fact that total concentrations of signalling proteins will fluctuate over time due to transcriptional and translational regulation. Such variation can lead to statistical confounding. Accordingly, more sophisticated statistical techniques must be applied to analyse such datasets.(very important in the biology of cancer)
+
+== Metabolic network ==
+
+Metabolite networks use nodes to represent chemical reactions and directed edges for the metabolic pathways and regulatory interactions that guide these reactions. Primary input into an algorithm would be data from a set of experiments measuring metabolite levels.
+
+== Protein-protein interaction networks ==
--- a/data/en.wikipedia.org/wiki/Biological_network_inference-1.md
+++ b/data/en.wikipedia.org/wiki/Biological_network_inference-1.md
@ -0,0 +1,63 @@
+---
+title: "Biological network inference"
+chunk: 2/3
+source: "https://en.wikipedia.org/wiki/Biological_network_inference"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T14:01:42.011989+00:00"
+instance: "kb-cron"
+---
+
+One of the most intensely studied networks in biology, Protein-protein interaction networks (PINs) visualize the physical relationships between proteins inside a cell. in a PIN, proteins are the nodes and their interactions are the undirected edges. PINs can be discovered with a variety of methods including; Two-hybrid Screening, in vitro: co-immunoprecipitation, blue native gel electrophoresis, and more.
+
+== Neuronal network ==
+
+A neuronal network is composed to represent neurons with each node and synapses for the edges, which are typically weighted and directed. the weights of edges are usually adjusted by the activation of connected nodes. The network is usually organized into input layers, hidden layers, and output layers.
+
+== Food webs ==
+
+A food web is an interconnected directional graph of what eats what in an ecosystem. The members of the ecosystem are the nodes and if a member eats another member then there is a directed edge between those 2 nodes.
+
+== Within species and between species interaction networks ==
+These networks are defined by a set of pairwise interactions between and within a species that is used to understand the structure and function of larger ecological networks. By using network analysis we can discover and understand how these interactions link together within the system's network. It also allows us to quantify associations between individuals, which makes it possible to infer details about the network as a whole at the species and/or population level.
+
+== DNA-DNA chromatin networks ==
+
+DNA-DNA chromatin networks are used to clarify the activation or suppression of genes via the relative location of strands of chromatin. These interactions can be understood by analyzing commonalities amongst different loci, a fixed position on a chromosome where a particular gene or genetic marker is located. Network analysis can provide vital support in understanding relationships among different areas of the genome.
+
+== Gene regulatory networks ==
+
+A gene regulatory network is a set of molecular regulators that interact with each other and with other substances in the cell. The regulator can be DNA, RNA, protein and complexes of these. Gene regulatory networks can be modeled in numerous ways including; Coupled ordinary differential equations, Boolean networks, Continuous networks, and Stochastic gene networks.
+
+== Network attributes ==
+
+== Data sources ==
+The initial data used to make the inference can have a huge impact on the accuracy of the final inference. Network data is inherently noisy and incomplete sometimes due to evidence from multiple sources that don't overlap or contradictory data. Data can be sourced in multiple ways to include manual curation of scientific literature put into databases, High-throughput datasets, computational predictions, and text mining of old scholarly articles from before the digital era.
+
+== Network diameter ==
+A network's diameter is the maximum number of steps separating any two nodes and can be used to determine the How connected a graph is, in topology analysis, and clustering analysis.
+
+== Transitivity ==
+The transitivity or clustering coefficient of a network is a measure of the tendency of the nodes to cluster together. High transitivity means that the network contains communities or groups of nodes that are densely connected internally. In biological networks, finding these communities is very important, because they can reflect functional modules and protein complexes
+The uncertainty about the connectivity may distort the results and should be taken into account when the transitivity and other topological descriptors are computed for inferred networks.
+
+== Network confidence ==
+Network confidence is a way to measure how sure one can be that the network represents a real biological interaction. We can do this via contextual biological information, counting the number of times an interaction is reported in the literature, or group different strategies into a single score. the MIscore method for assessing the reliability of protein-protein interaction data is based on the use of standards. MIscore gives an estimation of confidence weighting on all available evidence for an interacting pair of proteins. The method allows weighting of evidence provided by different sources, provided the data is represented following the standards created by the IMEx consortium. The weights are number of publications, detection method, interaction evidence type.
+
+== Closeness ==
+
+Closeness, a.k.a. closeness centrality, is a measure of centrality in a network and is calculated as the reciprocal of the sum of the length of the shortest paths between the node and all other nodes in the graph. This measure can be used to make inferences in all graph types and analysis methods.
+
+== Betweenness ==
+
+Betweeness, a.k.a. betweenness centrality, is a measure of centrality in a graph based on shortest paths. The betweenness for each node is the number of these shortest paths that pass through the node.
+
+== Network analysis methods ==
+
+For our purposes, network analysis is closely related to graph theory. By measuring the attributes in the previous section we can utilize many different techniques to create accurate inferences based on biological data.
+
+== Topology analysis ==
+Topology Analysis analyzes the topology of a network to identify relevant participates and substructures that may be of biological significance. The term encompasses an entire class of techniques such as network motif search, centrality analysis, topological clustering, and shortest paths. These are but a few examples, each of these techniques use the general idea of focusing on the topology of a network to make inferences.
+
+=== Network Motif Search ===
+A motif is defined as a frequent and unique sub-graph. By counting all the possible instances, listing all patterns, and testing isomorphisms we can derive crucial information about a network. They're suggested to be the basic building blocks complex biological networks. The computational research has focused on improving existing motif detection tools to assist the biological investigations and allow larger networks to be analyzed. Several different algorithms have been provided so far, which are elaborated in the next section.
--- a/data/en.wikipedia.org/wiki/Biological_network_inference-2.md
+++ b/data/en.wikipedia.org/wiki/Biological_network_inference-2.md
@ -0,0 +1,35 @@
+---
+title: "Biological network inference"
+chunk: 3/3
+source: "https://en.wikipedia.org/wiki/Biological_network_inference"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T14:01:42.011989+00:00"
+instance: "kb-cron"
+---
+
+=== Centrality Analysis ===
+Centrality gives an estimation on how important a node or edge is for the connectivity or the information flow of the network. It is a useful parameter in signalling networks and it is often used when trying to find drug targets. It is most commonly used in PINs to determine important proteins and their functions. Centrality can be measured in different ways depending on the graph and the question that needs answering, they include the degree of nodes or the number of connected edges to a node, global centrality measures, or via random walks which is used by the Google PageRank algorithm to assign weight to each webpage.
+The centrality measures may be affected by errors due to noise on measurement and other causes. Therefore, the topological descriptors should be defined as random variable with the associated probability distribution encoding the uncertainty on their value.
+
+=== Topological Clustering ===
+Topological Clustering or Topological Data Analysis (TDA) provides a general framework to analyze high dimensional, incomplete, and noisy data in a way that reduces dimensional and gives a robustness to noise. The idea that is that the shape of data sets contains relevant information. When this information is a homology group there is a mathematical interpretation that assumes that features that persist for a wide range of parameters are "true" features and features persisting for only a narrow range of parameters are noise, although the theoretical justification for this is unclear. This technique has been used for progression analysis of disease, viral evolution, propagation of contagions on networks, bacteria classification using molecular spectroscopy, and much more in and outside of biology.
+
+=== Shortest paths ===
+The shortest path problem is a common problem in graph theory that tries to find the path between two vertices (or nodes) in a graph such that the sum of the weights of its constituent edges is minimized. This method can be used to determine the network diameter or redundancy in a network. there are many algorithms for this including Dijkstra's algorithm, Bellman–Ford algorithm, and the Floyd–Warshall algorithm just to name a few.
+
+== Clustering analysis ==
+Cluster analysis groups objects (nodes) such that objects in the same cluster are more similar to each other than to those in other clusters. This can be used to perform pattern recognition, image analysis, information retrieval, statistical data analysis, and so much more. It has applications in Plant and animal ecology, Sequence analysis, antimicrobial activity analysis, and many other fields. Cluster analysis algorithms come in many forms as well such as Hierarchical clustering, k-means clustering, Distribution-based clustering, Density-based clustering, and Grid-based clustering.
+
+== Annotation enrichment analysis ==
+Gene annotation databases are commonly used to evaluate the functional properties of experimentally derived gene sets. Annotation Enrichment Analysis (AEA) is used to overcome biases from overlap statistical methods used to assess these associations. It does this by using gene/protein annotations to infer which annotations are over-represented in a list of genes/proteins taken from a network.
+
+== Network analysis tools ==
+
+== See also ==
+Cellular model
+Cytoscape tool
+Bayesian probability
+Network medicine
+
+== References ==
--- a/data/en.wikipedia.org/wiki/Biomedical_text_mining-0.md
+++ b/data/en.wikipedia.org/wiki/Biomedical_text_mining-0.md
@ -0,0 +1,54 @@
+---
+title: "Biomedical text mining"
+chunk: 1/3
+source: "https://en.wikipedia.org/wiki/Biomedical_text_mining"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T14:01:43.230057+00:00"
+instance: "kb-cron"
+---
+
+Biomedical text mining (including biomedical natural language processing or BioNLP) refers to the methods and study of how text mining may be applied to texts and literature of the biomedical domain. As a field of research, biomedical text mining incorporates ideas from natural language processing, bioinformatics, medical informatics and computational linguistics. The strategies in this field have been applied to the biomedical literature available through services such as PubMed.
+In recent years, the scientific literature has shifted to electronic publishing but the volume of information available can be overwhelming. This revolution of publishing has caused a high demand for text mining techniques. Text mining offers information retrieval (IR) and entity recognition (ER). IR allows the retrieval of relevant papers according to the topic of interest, e.g. through PubMed. ER is practiced when certain biological terms are recognized (e.g. proteins or genes) for further processing.
+
+== Considerations ==
+Applying text mining approaches to biomedical text requires specific considerations common to the domain.
+
+=== Availability of annotated text data ===
+
+Large annotated corpora used in the development and training of general purpose text mining methods (e.g., sets of movie dialogue, product reviews, or Wikipedia article text) are not specific for biomedical language. While they may provide evidence of general text properties such as parts of speech, they rarely contain concepts of interest to biologists or clinicians. Development of new methods to identify features specific to biomedical documents therefore requires assembly of specialized corpora. Resources designed to aid in building new biomedical text mining methods have been developed through the Informatics for Integrating Biology and the Bedside (i2b2) challenges and biomedical informatics researchers. Text mining researchers frequently combine these corpora with the controlled vocabularies and ontologies available through the National Library of Medicine's Unified Medical Language System (UMLS) and Medical Subject Headings (MeSH).
+Machine learning-based methods often require very large data sets as training data to build useful models. Manual annotation of large text corpora is not realistically possible. Training data may therefore be products of weak supervision or purely statistical methods.
+
+=== Data structure variation ===
+Like other text documents, biomedical documents contain unstructured data. Research publications follow different formats, contain different types of information, and are interspersed with figures, tables, and other non-text content. Both unstructured text and semi-structured document elements, such as tables, may contain important information that should be text mined. Clinical documents may vary in structure and language between departments and locations. Other types of biomedical text, such as drug labels, may follow general structural guidelines but lack further details.
+
+=== Uncertainty ===
+Biomedical literature contains statements about observations that may not be statements of fact. This text may express uncertainty or skepticism about claims. Without specific adaptations, text mining approaches designed to identify claims within text may mis-characterize these "hedged" statements as facts.
+
+=== Supporting clinical needs ===
+Biomedical text mining applications developed for clinical use should ideally reflect the needs and demands of clinicians. This is a concern in environments where clinical decision support is expected to be informative and accurate. A comprehensive overview of the development and uptake of NLP methods applied to free-text clinical notes related to chronic diseases
+is presented in.
+
+=== Interoperability with clinical systems ===
+New text mining systems must work with existing standards, electronic medical records, and databases. Methods for interfacing with clinical systems such as LOINC have been developed but require extensive organizational effort to implement and maintain.
+
+=== Patient privacy ===
+Text mining systems operating with private medical data must respect its security and ensure it is rendered anonymous where appropriate.
+
+== Processes ==
+Specific sub tasks are of particular concern when processing biomedical text.
+
+=== Named entity recognition ===
+Developments in biomedical text mining have incorporated identification of biological entities with named entity recognition, or NER. Names and identifiers for biomolecules such as proteins and genes, chemical compounds and drugs, and disease names have all been used as entities. Most entity recognition methods are supported by pre-defined linguistic features or vocabularies, though methods incorporating deep learning and word embeddings have also been successful at biomedical NER.
+
+=== Document classification and clustering ===
+Biomedical documents may be classified or clustered based on their contents and topics. In classification, document categories are specified manually, while in clustering, documents form algorithm-dependent, distinct groups. These two tasks are representative of supervised and unsupervised methods, respectively, yet the goal of both is to produce subsets of documents based on their distinguishing features. Methods for biomedical document clustering have relied upon k-means clustering.
+
+=== Relationship discovery ===
+Biomedical documents describe connections between concepts, whether they are interactions between biomolecules, events occurring subsequently over time (i.e., temporal relationships), or causal relationships. Text mining methods may perform relation discovery to identify these connections, often in concert with named entity recognition.
+
+=== Hedge cue detection ===
+The challenge of identifying uncertain or "hedged" statements has been addressed through hedge cue detection in biomedical literature.
+
+=== Claim detection ===
+Multiple researchers have developed methods to identify specific scientific claims from literature. In practice, this process involves both isolating phrases and sentences denoting the core arguments made by the authors of a document (a process known as argument mining, employing tools used in fields such as political science) and comparing claims to find potential contradictions between them.
--- a/data/en.wikipedia.org/wiki/Biomedical_text_mining-1.md
+++ b/data/en.wikipedia.org/wiki/Biomedical_text_mining-1.md
@ -0,0 +1,48 @@
+---
+title: "Biomedical text mining"
+chunk: 2/3
+source: "https://en.wikipedia.org/wiki/Biomedical_text_mining"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T14:01:43.230057+00:00"
+instance: "kb-cron"
+---
+
+=== Information extraction ===
+Information extraction, or IE, is the process of automatically identifying structured information from unstructured or partially structured text. IE processes can involve several or all of the above activities, including named entity recognition, relationship discovery, and document classification, with the overall goal of translating text to a more structured form, such as the contents of a template or knowledge base. In the biomedical domain, IE is used to generate links between concepts described in text, such as gene A inhibits gene B and gene C is involved in disease G. Biomedical knowledge bases containing this type of information are generally products of extensive manual curation, so replacement of manual efforts with automated methods remains a compelling area of research.
+
+=== Information retrieval and question answering ===
+Biomedical text mining supports applications for identifying documents and concepts matching search queries. Search engines such as PubMed search allow users to query literature databases with words or phrases present in document contents, metadata, or indices such as MeSH. Similar approaches may be used for medical literature retrieval. For more fine-grained results, some applications permit users to search with natural language queries and identify specific biomedical relationships.
+On 16 March 2020, the National Library of Medicine and others launched the COVID-19 Open Research Dataset (CORD-19) to enable text mining of the current literature on the novel virus. The dataset is hosted by the Semantic Scholar project of the Allen Institute for AI. Other participants include Google, Microsoft Research, the Center for Security and Emerging Technology, and the Chan Zuckerberg Initiative.
+
+== Resources ==
+
+=== Corpora ===
+The following table lists a selection of biomedical text corpora and their contents. These items include annotated corpora, sources of biomedical research literature, and resources frequently used as vocabulary and/or ontology references, such as MeSH. Items marked "Yes" under "Freely Available" can be downloaded from a publicly accessible location. 
+
+=== Word embeddings ===
+Several groups have developed sets of biomedical vocabulary mapped to vectors of real numbers, known as word vectors or word embeddings. Sources of pre-trained embeddings specific for biomedical vocabulary are listed in the table below. The majority are results of the word2vec model developed by Mikolov et al or variants of word2vec.
+
+== Applications ==
+
+Text mining applications in the biomedical field include computational approaches to assist with studies in protein docking, protein interactions, and protein-disease associations. Text mining techniques have several advantages over traditional manual curation for identifying associations. Text mining algorithms can identify and extract information from a vast amount of literature, and more efficiently than manual curation. This includes the integration of data from different sources, including literature, databases, and experimental results. These algorithms have transformed the process of identifying and prioritizing novel genes and gene-disease associations that have previously been overlooked. 
+
+These methods are the foundation to facilitate systematic searches of overlooked scientific and biomedical  literature which could carry significant association between research. The combination of information can stem new discoveries and hypotheses especially with the integration of datasets. The quality of the database is as important as the size of it. Promising text mining methods such as iProLINK (integrated Protein Literature Information and Knowledge) have been developed to curate data sources that can aid text mining research in areas of bibliography mapping, annotation extraction, protein named entity recognition, and protein ontology development. Curated databases such as UniProt can accelerate the accessibility of targeted information not only for genetic sequences, but also for literature and phylogeny.
+
+=== Gene cluster identification ===
+Methods for determining the association of gene clusters obtained by microarray experiments with the biological context provided by the corresponding literature have been developed.
+
+=== Protein interactions ===
+Automatic extraction of protein interactions and associations of proteins to functional concepts (e.g. gene ontology terms) has been explored. The search engine PIE was developed to identify and return protein-protein interaction mentions from MEDLINE-indexed articles. The extraction of kinetic parameters from text or the subcellular location of proteins have also been addressed by information extraction and text mining technology.
+
+=== Gene-disease associations ===
+Computational gene prioritization is an essential step in understanding the genetic basis of diseases, particularly within genetic linkage analysis. Text mining and other computational tools extract relevant information, including gene-disease associations, among others, from numerous data sources, then apply different ranking algorithms to prioritize the genes based on their relevance to the specific disease. Text mining and gene prioritization allow researchers to focus their efforts on the most promising candidates for further research.
+Computational tools for gene prioritization continue to be developed and analyzed. One group studied the performance of various text-mining techniques for disease gene prioritization. They investigated different domain vocabularies, text representation schemes, and ranking algorithms in order to find the best approach for identifying disease-causing genes to establish a benchmark.
+
+=== Gene-trait associations ===
+An agricultural genomics group identified genes related to bovine reproductive traits using text mining, among other approaches.
+
+==== Applications of phrase mining to disease associations ====
+A text mining study assembled a collection of 709 core extracellular matrix proteins and associated proteins based on two databases: MatrixDB (matrixdb.univ-lyon1.fr) and UniProt. This set of proteins had a manageable size and a rich body of associated information, making it a suitable for the application of text mining tools. The researchers conducted phrase-mining analysis to cross-examine individual extracellular matrix proteins across the biomedical literature concerned with six categories of cardiovascular diseases. They used a phrase-mining pipeline, Context-aware Semantic Online Analytical Processing (CaseOLAP), then semantically scored all 709 proteins according to their Integrity, Popularity, and Distinctiveness using the CaseOLAP pipeline. The text mining study validated existing relationships and informed previously unrecognized biological processes in cardiovascular pathophysiology.
+
+== Software tools ==
--- a/data/en.wikipedia.org/wiki/Biomedical_text_mining-2.md
+++ b/data/en.wikipedia.org/wiki/Biomedical_text_mining-2.md
@ -0,0 +1,40 @@
+---
+title: "Biomedical text mining"
+chunk: 3/3
+source: "https://en.wikipedia.org/wiki/Biomedical_text_mining"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T14:01:43.230057+00:00"
+instance: "kb-cron"
+---
+
+=== Search engines ===
+Search engines designed to retrieve biomedical literature relevant to a user-provided query frequently rely upon text mining approaches. Publicly available tools specific for research literature include PubMed search, Europe PubMed Central search, GeneView, and APSE Similarly, search engines and indexing systems specific for biomedical data have been developed, including DataMed and OmicsDI.
+Some search engines, such as Essie, OncoSearch, PubGene, and GoPubMed were previously public but have since been discontinued, rendered obsolete, or integrated into commercial products.
+
+=== Medical record analysis systems ===
+Electronic medical records (EMRs) and electronic health records (EHRs) are collected by clinical staff in the course of diagnosis and treatment. Though these records generally include structured components with predictable formats and data types, the remainder of the reports are often free-text and difficult to search, leading to challenges with patient care. Numerous complete systems and tools have been developed to analyse these free-text portions. The MedLEE system was originally developed for analysis of chest radiology reports but later extended to other report topics. The clinical Text Analysis and Knowledge Extraction System, or cTAKES, annotates clinical text using a dictionary of concepts. The CLAMP system offers similar functionality with a user-friendly interface.
+
+=== Frameworks ===
+Computational frameworks have been developed to rapidly build tools for biomedical text mining tasks. SwellShark is a framework for biomedical NER that requires no human-labeled data but does make use of resources for weak supervision (e.g., UMLS semantic types). The SparkText framework uses Apache Spark data streaming, a NoSQL database, and basic machine learning methods to build predictive models from scientific articles.
+
+=== APIs ===
+Some biomedical text mining and natural language processing tools are available through application programming interfaces, or APIs. NOBLE Coder performs concept recognition through an API.
+
+== Conferences ==
+The following academic conferences and workshops host discussions and presentations in biomedical text mining advances. Most publish proceedings. 
+
+== Journals ==
+
+A variety of academic journals publishing manuscripts on biology and medicine include topics in text mining and natural language processing software. Some journals, including the Journal of the American Medical Informatics Association (JAMIA) and the Journal of Biomedical Informatics are popular publications for these topics.
+
+== References ==
+
+== Further reading ==
+
+== External links ==
+Bio-NLP resources, systems and application database collection Archived 2009-05-04 at the Wayback Machine
+The BioNLP mailing list archives
+Corpora for biomedical text mining Archived 2011-07-24 at the Wayback Machine
+The BioCreative evaluations of biomedical text mining technologies
+Directory of people involved in BioNLP Archived 2011-08-09 at the Wayback Machine
--- a/data/en.wikipedia.org/wiki/Biomimetics-0.md
+++ b/data/en.wikipedia.org/wiki/Biomimetics-0.md
@ -0,0 +1,31 @@
+---
+title: "Biomimetics"
+chunk: 1/7
+source: "https://en.wikipedia.org/wiki/Biomimetics"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T14:01:44.463976+00:00"
+instance: "kb-cron"
+---
+
+Biomimetics or biomimicry is the replication of the models, systems, and elements of nature for the purpose of solving complex human problems. The terms "biomimetics" and "biomimicry" are derived from Ancient Greek: βίος (bios), life, and μίμησις (mīmēsis), imitation, from μιμεῖσθαι (mīmeisthai), to imitate, from μῖμος (mimos), actor. To combine the word to mean imitating life. A closely related field is bionics.
+The Theory of Evolution is a feature of biological systems for over 3.8 billion years according to observed life appearance estimations. Theoretically evolving species with high performance using commonly found materials. Surfaces of solids interact with other surfaces and the environment and derive the properties of materials. Biological materials are highly organized from the molecular to the nano-, micro-, and macroscales, often in a hierarchical manner with intricate nanoarchitecture that ultimately makes up a myriad of different functional elements. Properties of materials and surfaces result from a complex interplay between surface structure and morphology and physical and chemical properties. Many materials, surfaces, and objects in general provide multifunctionality.
+Various materials, structures, and devices have been fabricated for commercial interest by engineers, material scientists, chemists, and biologists, and for beauty, structure, and design by artists and architects. Nature has solved engineering problems such as self-healing abilities, environmental exposure tolerance and resistance, hydrophobicity, self-assembly, and harnessing solar energy. Economic impact of bioinspired materials and surfaces is significant, on the order of several hundred billion dollars per year worldwide. 
+
+== History ==
+One of the early examples of biomimicry was the study of birds and bats to enable human flight. Although never successful in creating a "flying machine", Leonardo da Vinci (1452–1519) was a keen observer of the anatomy and flight of aves and mammals, and made numerous notes and sketches on his observations as well as sketches of "flying machines". The Wright Brothers, who succeeded in flying the first heavier-than-air aircraft in 1903, allegedly derived inspiration from observations of pigeons in flight.
+
+During the 1950s, the American biophysicist and polymath Otto Schmitt developed the concept of "biomimetics". During his doctoral research, he developed the Schmitt trigger by studying the nerves in squid, attempting to engineer a device that replicated the biological system of nerve propagation. He continued to focus on devices that mimic natural systems and by 1957 he had perceived a converse to the standard view of biophysics at that time, a view he would come to call biomimetics.
+
+Biophysics is not so much a subject matter as it is a point of view. It is an approach to problems of biological science utilizing the theory and technology of the physical sciences. Conversely, biophysics is also a biologist's approach to problems of physical science and engineering, although this aspect has largely been neglected.
+In 1960, Jack E. Steele coined a similar term, bionics, at Wright-Patterson Air Force Base in Dayton, Ohio, where Otto Schmitt also worked. Steele defined bionics as "the science of systems which have some function copied from nature, or which represent characteristics of natural systems or their analogues". During a later meeting in 1963, Schmitt stated:
+
+Let us consider what bionics has come to mean operationally and what it or some word like it (I prefer biomimetics) ought to mean in order to make good use of the technical skills of scientists specializing, or rather, I should say, despecializing into this area of research.
+In 1969, Schmitt used the term "biomimetic" in the title one of his papers, and by 1974 it had found its way into Webster's Dictionary. Bionics entered the same dictionary earlier in 1960 as "a science concerned with the application of data about the functioning of biological systems to the solution of engineering problems". Bionic took on a different connotation when Martin Caidin referenced Jack Steele and his work in the novel Cyborg, which later resulted in the 1974 television series The Six Million Dollar Man and its spin-offs. The term bionic then became associated with "the use of electronically operated artificial body parts" and "having ordinary human powers increased by or as if by the aid of such devices". Because the term bionic took on the implication of supernatural strength, the scientific community in English speaking countries largely abandoned it.
+The term biomimicry appeared as early as 1982. Biomimicry was popularized by scientist and author Janine Benyus in her 1997 book Biomimicry: Innovation Inspired by Nature. Biomimicry is defined in the book as a "new science that studies nature's models and then imitates or takes inspiration from these designs and processes to solve human problems". Benyus suggests looking to Nature as a "Model, Measure, and Mentor" and emphasizes sustainability as an objective of biomimicry.
+The potential long-term impacts of biomimicry were quantified in a 2013 Fermanian Business & Economic Institute Report commissioned by the San Diego Zoo. The findings demonstrated the potential economic and environmental benefits of biomimicry, which can be further seen in Johannes-Paul Fladerer and Ernst Kurzmann's "managemANT" approach. This term (a combination of the words "management" and "ant"), describes the usage of behavioural strategies of ants in economic and management strategies.
+
+== Bio-inspired technologies ==
+Biomimetics could in principle be applied in many fields. Because of the diversity and complexity of biological systems, the number of features that might be imitated is large. Biomimetic applications are at various stages of development from technologies that might become commercially usable to prototypes. Murray's law, which in conventional form determined the optimum diameter of blood vessels, has been re-derived to provide simple equations for the pipe or tube diameter which gives a minimum mass engineering system.
+
+=== Locomotion ===
--- a/data/en.wikipedia.org/wiki/Biomimetics-1.md
+++ b/data/en.wikipedia.org/wiki/Biomimetics-1.md
@ -0,0 +1,28 @@
+---
+title: "Biomimetics"
+chunk: 2/7
+source: "https://en.wikipedia.org/wiki/Biomimetics"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T14:01:44.463976+00:00"
+instance: "kb-cron"
+---
+
+Aircraft wing design and flight techniques are being inspired by birds and bats. The aerodynamics of streamlined design of improved Japanese high speed train Shinkansen 500 Series were modelled after the beak of Kingfisher bird.
+Biorobots based on the physiology and methods of locomotion of animals include BionicKangaroo which moves like a kangaroo, saving energy from one jump and transferring it to its next jump; Kamigami Robots, a children's toy, mimic cockroach locomotion to run quickly and efficiently over indoor and outdoor surfaces, and Pleobot, a shrimp-inspired robot to study metachronal swimming and the ecological impacts of this propulsive gait on the environment.
+
+=== Biomimetic flying robots (BFRs) ===
+
+BFRs take inspiration from flying mammals, birds, or insects. BFRs can have flapping wings, which generate the lift and thrust, or they can be propeller actuated. BFRs with flapping wings have increased stroke efficiencies, increased maneuverability, and reduced energy consumption in comparison to propeller actuated BFRs. Mammal and bird inspired BFRs share similar flight characteristics and design considerations. For instance, both mammal and bird inspired BFRs minimize edge fluttering and pressure-induced wingtip curl by increasing the rigidity of the wing edge and wingtips. Mammal and insect inspired BFRs can be impact resistant, making them useful in cluttered environments.
+Mammal inspired BFRs typically take inspiration from bats, but the flying squirrel has also inspired a prototype. Examples of bat inspired BFRs include Bat Bot and the DALER. Mammal inspired BFRs can be designed to be multi-modal; therefore, they're capable of both flight and terrestrial movement. To reduce the impact of landing, shock absorbers can be implemented along the wings. Alternatively, the BFR can pitch up and increase the amount of drag it experiences. By increasing the drag force, the BFR will decelerate and minimize the impact upon grounding. Different land gait patterns can also be implemented.
+
+Bird inspired BFRs can take inspiration from raptors, gulls, and everything in-between. Bird inspired BFRs can be feathered to increase the angle of attack range over which the prototype can operate before stalling. The wings of bird inspired BFRs allow for in-plane deformation, and the in-plane wing deformation can be adjusted to maximize flight efficiency depending on the flight gait. An example of a raptor inspired BFR is the prototype by Savastano et al. The prototype has fully deformable flapping wings and is capable of carrying a payload of up to 0.8 kg while performing a parabolic climb, steep descent, and rapid recovery. The gull inspired prototype by Grant et al. accurately mimics the elbow and wrist rotation of gulls, and they find that lift generation is maximized when the elbow and wrist deformations are opposite but equal.
+Insect inspired BFRs typically take inspiration from beetles or dragonflies. An example of a beetle inspired BFR is the prototype by Phan and Park, and a dragonfly inspired BFR is the prototype by Hu et al. The flapping frequency of insect inspired BFRs are much higher than those of other BFRs; this is because of the aerodynamics of insect flight. Insect inspired BFRs are much smaller than those inspired by mammals or birds, so they are more suitable for dense environments. The prototype by Phan and Park took inspiration from the rhinoceros beetle, so it can successfully continue flight even after a collision by deforming its hindwings.
+
+=== Biomimetic architecture ===
+Living beings have adapted to a constantly changing environment during evolution through mutation, recombination, and selection. The core idea of the biomimetic philosophy is that nature's inhabitants including animals, plants, and microbes have the most experience in solving problems and have already found the most appropriate ways to last on planet Earth. Similarly, biomimetic architecture seeks solutions for building sustainability present in nature. While nature serves as a model, there are few examples of biomimetic architecture that aim to be nature positive.
+The 21st century has seen a ubiquitous waste of energy due to inefficient building designs, in addition to the over-utilization of energy during the operational phase of its life cycle. In parallel, recent advancements in fabrication techniques, computational imaging, and simulation tools have opened up new possibilities to mimic nature across different architectural scales. As a result, there has been a rapid growth in devising innovative design approaches and solutions to counter energy problems. Biomimetic architecture is one of these multi-disciplinary approaches to sustainable design that follows a set of principles rather than stylistic codes, going beyond using nature as inspiration for the aesthetic components of built form but instead seeking to use nature to solve problems of the building's functioning and saving energy.
+
+==== Characteristics ====
+The term biomimetic architecture refers to the study and application of construction principles which are found in natural environments and species, and are translated into the design of sustainable solutions for architecture. Biomimetic architecture uses nature as a model, measure and mentor for providing architectural solutions across scales, which are inspired by natural organisms that have solved similar problems in nature. Using nature as a measure refers to using an ecological standard of measuring sustainability, and efficiency of man-made innovations, while the term mentor refers to learning from natural principles and using biology as an inspirational source.
+Biomorphic architecture, also referred to as bio-decoration, on the other hand, refers to the use of formal and geometric elements found in nature, as a source of inspiration for aesthetic properties in designed architecture, and may not necessarily have non-physical, or economic functions. A historic example of biomorphic architecture dates back to Egyptian, Greek and Roman cultures, using tree and plant forms in the ornamentation of structural columns.
--- a/data/en.wikipedia.org/wiki/Biomimetics-2.md
+++ b/data/en.wikipedia.org/wiki/Biomimetics-2.md
@ -0,0 +1,28 @@
+---
+title: "Biomimetics"
+chunk: 3/7
+source: "https://en.wikipedia.org/wiki/Biomimetics"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T14:01:44.463976+00:00"
+instance: "kb-cron"
+---
+
+==== Procedures ====
+Within biomimetic architecture, two basic procedures can be identified, namely, the bottom-up approach (biology push) and top-down approach (technology pull). The boundary between the two approaches is blurry with the possibility of transition between the two, depending on each individual case. Biomimetic architecture is typically carried out in interdisciplinary teams in which biologists and other natural scientists work in collaboration with engineers, material scientists, architects, designers, mathematicians and computer scientists.
+In the bottom-up approach, the starting point is a new result from basic biological research promising for biomimetic implementation. For example, developing a biomimetic material system after the quantitative analysis of the mechanical, physical, and chemical properties of a biological system.
+In the top-down approach, biomimetic innovations are sought for already existing developments that have been successfully established on the market. The cooperation focuses on the improvement or further development of an existing product.
+
+==== Examples ====
+Researchers studied the termite's ability to maintain virtually constant temperature and humidity in their termite mounds in Africa despite outside temperatures that vary from 1.5 to 40 °C (34.7 to 104.0 °F). Researchers initially scanned a termite mound and created 3-D images of the mound structure, which revealed construction that could influence human building design. The Eastgate Centre, a mid-rise office complex in Harare, Zimbabwe, stays cool via a passive cooling architecture that uses only 10% of the energy of a conventional building of the same size.
+
+Researchers in the Sapienza University of Rome were inspired by the natural ventilation in termite mounds and designed a double façade that significantly cuts down over lit areas in a building. Scientists have imitated the porous nature of mound walls by designing a facade with double panels that was able to reduce heat gained by radiation and increase heat loss by convection in cavity between the two panels. The overall cooling load on the building's energy consumption was reduced by 15%.
+
+A similar inspiration was drawn from the porous walls of termite mounds to design a naturally ventilated façade with a small ventilation gap. This design of façade is able to induce air flow due to the Venturi effect and continuously circulates rising air in the ventilation slot. Significant transfer of heat between the building's external wall surface and the air flowing over it was observed. The design is coupled with greening of the façade. Green wall facilitates additional natural cooling via evaporation, respiration and transpiration in plants. The damp plant substrate further support the cooling effect.
+Scientists in Shanghai University were able to replicate the complex microstructure of clay-made conduit network in the mound to mimic the excellent humidity control in mounds. They proposed a porous humidity control material (HCM) using sepiolite and calcium chloride with water vapor adsorption-desorption content at 550 grams per meter squared. Calcium chloride is a desiccant and improves the water vapor adsorption-desorption property of the Bio-HCM. The proposed bio-HCM has a regime of interfiber mesopores which acts as a mini reservoir. The flexural strength of the proposed material was estimated to be 10.3 MPa using computational simulations.
+In structural engineering, the Swiss Federal Institute of Technology (EPFL) has incorporated biomimetic characteristics in an adaptive deployable "tensegrity" bridge. The bridge can carry out self-diagnosis and self-repair. The arrangement of leaves on a plant has been adapted for better solar power collection.
+Analysis of the elastic deformation happening when a pollinator lands on the sheath-like perch part of the flower Strelitzia reginae (known as bird-of-paradise flower) has inspired architects and scientists from the University of Freiburg and University of Stuttgart to create hingeless shading systems that can react to their environment. These bio-inspired products are sold under the name Flectofin.
+Other hingeless bioinspired systems include Flectofold. Flectofold has been inspired from the trapping system developed by the carnivorous plant Aldrovanda vesiculosa.
+
+=== Structural materials ===
+There is a great need for new structural materials that are light weight but offer exceptional combinations of stiffness, strength, and toughness.
--- a/data/en.wikipedia.org/wiki/Biomimetics-3.md
+++ b/data/en.wikipedia.org/wiki/Biomimetics-3.md
@ -0,0 +1,31 @@
+---
+title: "Biomimetics"
+chunk: 4/7
+source: "https://en.wikipedia.org/wiki/Biomimetics"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T14:01:44.463976+00:00"
+instance: "kb-cron"
+---
+
+Such materials would need to be manufactured into bulk materials with complex shapes at high volume and low cost and would serve a variety of fields such as construction, transportation, energy storage and conversion. In a classic design problem, strength and toughness are more likely to be mutually exclusive, i.e., strong materials are brittle and tough materials are weak. However, natural materials with complex and hierarchical material gradients that span from nano- to macro-scales are both strong and tough. Generally, most natural materials utilize limited chemical components but complex material architectures that give rise to exceptional mechanical properties. Understanding the highly diverse and multi functional biological materials and discovering approaches to replicate such structures will lead to advanced and more efficient technologies. Bone, nacre (abalone shell), teeth, the dactyl clubs of stomatopod shrimps and bamboo are great examples of damage tolerant materials. The exceptional resistance to fracture of bone is due to complex deformation and toughening mechanisms that operate at spanning different size scales — nanoscale structure of protein molecules to macroscopic physiological scale. Nacre exhibits similar mechanical properties however with rather simpler structure. Nacre shows a brick and mortar like structure with thick mineral layer (0.2–0.9 μm) of closely packed aragonite structures and thin organic matrix (~20 nm). While thin films and micrometer sized samples that mimic these structures are already produced, successful production of bulk biomimetic structural materials is yet to be realized. However, numerous processing techniques have been proposed for producing nacre like materials. Pavement cells, epidermal cells on the surface of plant leaves and petals, often form wavy interlocking patterns resembling jigsaw puzzle pieces and are shown to enhance the fracture toughness of leaves, key to plant survival. Their pattern, replicated in laser-engraved Poly(methyl methacrylate) samples, was also demonstrated to lead to increased fracture toughness. It is suggested that the arrangement and patterning of cells play a role in managing crack propagation in tissues.
+Biomorphic mineralization is a technique that produces materials with morphologies and structures resembling those of natural living organisms by using bio-structures as templates for mineralization. Compared to other methods of material production, biomorphic mineralization is facile, environmentally benign and economic.
+Freeze casting (ice templating), an inexpensive method to mimic natural layered structures, was employed by researchers at Lawrence Berkeley National Laboratory to create alumina-Al-Si and IT HAP-epoxy layered composites that match the mechanical properties of bone with an equivalent mineral/organic content. Various further studies also employed similar methods to produce high strength and high toughness composites involving a variety of constituent phases.
+Recent studies demonstrated production of cohesive and self supporting macroscopic tissue constructs that mimic living tissues by printing tens of thousands of heterologous picoliter droplets in software-defined, 3D millimeter-scale geometries. Efforts are also taken up to mimic the design of nacre in artificial composite materials using fused deposition modelling and the helicoidal structures of stomatopod clubs in the fabrication of high performance carbon fiber-epoxy composites.
+Various established and novel additive manufacturing technologies like PolyJet printing, direct ink writing, 3D magnetic printing, multi-material magnetically assisted 3D printing and magnetically assisted slip casting have also been utilized to mimic the complex micro-scale architectures of natural materials and provide huge scope for future research.
+Spider silk is tougher than Kevlar used in bulletproof vests. Engineers could in principle use such a material, if it could be reengineered to have a long enough life, for parachute lines, suspension bridge cables, artificial ligaments for medicine, and other purposes. The self-sharpening teeth of many animals have been copied to make better cutting tools.
+New ceramics that exhibit giant electret hysteresis have also been realized.
+
+=== Neuronal computers ===
+Neuromorphic computers and sensors are electrical devices that copy the structure and function of biological neurons in order to compute. One example of this is the event camera in which only the
+pixels that receive a new signal update to a new state. All other pixels do not update until a signal is received.
+
+=== Self healing-materials ===
+In some biological systems, self-healing occurs via chemical releases at the site of fracture, which initiate a systemic response to transport repairing agents to the fracture site. This promotes autonomic healing. To demonstrate the use of micro-vascular networks for autonomic healing, researchers developed a microvascular coating–substrate architecture that mimics human skin. Bio-inspired self-healing structural color hydrogels that maintain the stability of an inverse opal structure and its resultant structural colors were developed. A self-repairing membrane inspired by rapid self-sealing processes in plants was developed for inflatable lightweight structures such as rubber boats or Tensairity constructions. The researchers applied a thin soft cellular polyurethane foam coating on the inside of a fabric substrate, which closes the crack if the membrane is punctured with a spike. Self-healing materials, polymers and composite materials capable of mending cracks have been produced based on biological materials.
+The self-healing properties may also be achieved by the breaking and reforming of hydrogen bonds upon cyclical stress of the material.
+
+=== Surfaces ===
+Surfaces that recreate the properties of shark skin are intended to enable more efficient movement through water. Efforts have been made to produce fabric that emulates shark skin.
+Surface tension biomimetics are being researched for technologies such as hydrophobic or hydrophilic coatings and microactuators.
+
+=== Adhesion ===
--- a/data/en.wikipedia.org/wiki/Biomimetics-4.md
+++ b/data/en.wikipedia.org/wiki/Biomimetics-4.md
@ -0,0 +1,27 @@
+---
+title: "Biomimetics"
+chunk: 5/7
+source: "https://en.wikipedia.org/wiki/Biomimetics"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T14:01:44.463976+00:00"
+instance: "kb-cron"
+---
+
+==== Wet adhesion ====
+Some amphibians, such as tree and torrent frogs and arboreal salamanders, are able to attach to and move over wet or even flooded environments without falling. This kind of organisms have toe pads which are permanently wetted by mucus secreted from glands that open into the channels between epidermal cells. They attach to mating surfaces by wet adhesion and they are capable of climbing on wet rocks even when water is flowing over the surface. Tire treads have also been inspired by the toe pads of tree frogs. 3D printed hierarchical surface models, inspired from tree and torrent frogs toe pad design, have been observed to produce better wet traction than conventional tire design.
+Marine mussels can stick easily and efficiently to surfaces underwater under the harsh conditions of the ocean. Mussels use strong filaments to adhere to rocks in the inter-tidal zones of wave-swept beaches, preventing them from being swept away in strong sea currents. Mussel foot proteins attach the filaments to rocks, boats and practically any surface in nature including other mussels. These proteins contain a mix of amino acid residues which has been adapted specifically for adhesive purposes. Researchers from the University of California Santa Barbara borrowed and simplified chemistries that the mussel foot uses to overcome this engineering challenge of wet adhesion to create copolyampholytes, and one-component adhesive systems with potential for employment in nanofabrication protocols. Other research has proposed adhesive glue from mussels.
+
+==== Dry adhesion ====
+Leg attachment pads of several animals, including many insects (e.g., beetles and flies), spiders and lizards (e.g., geckos), are capable of attaching to a variety of surfaces and are used for locomotion, even on vertical walls or across ceilings. Attachment systems in these organisms have similar structures at their terminal elements of contact, known as setae. Such biological examples have offered inspiration in order to produce climbing robots, boots and tape. Synthetic setae have also been developed for the production of dry adhesives.
+
+=== Liquid repellency ===
+Superliquiphobicity refers to a remarkable surface property where a solid surface exhibits an extreme aversion to liquids, causing droplets to bead up and roll off almost instantaneously upon contact. This behavior arises from intricate surface textures and interactions at the nanoscale, effectively preventing liquids from wetting or adhering to the surface. The term "superliquiphobic" is derived from "superhydrophobic," which describes surfaces highly resistant to water. Superliquiphobic surfaces go beyond water repellency and display repellent characteristics towards a wide range of liquids, including those with very low surface tension or containing surfactants.
+Superliquiphobicity emerges when a solid surface possesses minute roughness, forming interfaces with droplets through wetting while altering contact angles. This behavior hinges on the roughness factor (Rf), defining the ratio of solid-liquid area to its projection, influencing contact angles. On rough surfaces, non-wetting liquids give rise to composite solid-liquid-air interfaces, their contact angles determined by the distribution of wet and air-pocket areas. The achievement of superliquiphobicity involves increasing the fractional flat geometrical area (fLA) and Rf, leading to surfaces that actively repel liquids.
+The inspiration for crafting such surfaces draws from nature's ingenuity, illustrated by the "lotus effect". Leaves of water-repellent plants, like the lotus, exhibit inherent hierarchical structures featuring nanoscale wax-coated formations. Other natural surfaces with these capabilities can include Beetle carapaces, and cacti spines, which may exhibit rough features at multiple size scales. These structures lead to superhydrophobicity, where water droplets perch on trapped air bubbles, resulting in high contact angles and minimal contact angle hysteresis. This natural example guides the development of superliquiphobic surfaces, capitalizing on re-entrant geometries that can repel low surface tension liquids and achieve near-zero contact angles.
+Creating superliquiphobic surfaces involves pairing re-entrant geometries with low surface energy materials, such as fluorinated substances or liquid-like silocones. These geometries include overhangs that widen beneath the surface, enabling repellency even for minimal contact angles. These surfaces find utility in self-cleaning, anti-icing, anti-fogging, antifouling, enhanced condensation, and more, presenting innovative solutions to challenges in biomedicine, desalination, atmospheric water harvesting, and energy conversion.
+In essence, superliquiphobicity, inspired by natural models like the lotus leaf, capitalizes on re-entrant geometries and surface properties to create interfaces that actively repel liquids. These surfaces hold immense promise across a range of applications, promising enhanced functionality and performance in various technological and industrial contexts.
+
+=== Optics ===
+
+Biomimetic materials are gaining increasing attention in the field of optics and photonics. There are still little known bioinspired or biomimetic products involving the photonic properties of plants or animals. However, understanding how nature designed such optical materials from biological resources is a current field of research.
--- a/data/en.wikipedia.org/wiki/Biomimetics-5.md
+++ b/data/en.wikipedia.org/wiki/Biomimetics-5.md
@ -0,0 +1,29 @@
+---
+title: "Biomimetics"
+chunk: 6/7
+source: "https://en.wikipedia.org/wiki/Biomimetics"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T14:01:44.463976+00:00"
+instance: "kb-cron"
+---
+
+==== Inspiration from fruits and plants ====
+One source of biomimetic inspiration is from plants. Plants have proven to be concept generations for the following functions; re(action)-coupling, self (adaptability), self-repair, and energy-autonomy. As plants do not have a centralized decision making unit (i.e. a brain), most plants have a decentralized autonomous system in various organs and tissues of the plant. Therefore, they react to multiple stimulus such as light, heat, and humidity.
+One example is the carnivorous plant species Dionaea muscipula (Venus flytrap). For the last 25 years, there has been research focus on the motion principles of the plant to develop AVFT (artificial Venus flytrap robots). Through the movement during prey capture, the plant inspired soft robotic motion systems. The fast snap buckling (within 100–300 ms) of the trap closure movement is initiated when prey triggers the hairs of the plant within a certain time (twice within 20 s). AVFT systems exist, in which the trap closure movements are actuated via magnetism, electricity, pressurized air, and temperature changes.
+Another example of mimicking plants is the Pollia condensata, also known as the marble berry. The chiral self-assembly of cellulose inspired by the Pollia condensata berry has been exploited to make optically active films. Such films are made from cellulose which is a biodegradable and biobased resource obtained from wood or cotton. The structural colours can potentially be everlasting and have more vibrant colour than the ones obtained from chemical absorption of light. Pollia condensata is not the only fruit showing a structural coloured skin; iridescence is also found in berries of other species such as Margaritaria nobilis. These fruits show iridescent colors in the blue-green region of the visible spectrum which gives the fruit a strong metallic and shiny visual appearance. The structural colours come from the organisation of cellulose chains in the fruit's epicarp, a part of the fruit skin. Each cell of the epicarp is made of a multilayered envelope that behaves like a Bragg reflector. However, the light which is reflected from the skin of these fruits is not polarised unlike the one arising from man-made replicates obtained from the self-assembly of cellulose nanocrystals into helicoids, which only reflect left-handed circularly polarised light.
+The fruit of Elaeocarpus angustifolius also show structural colour that come arises from the presence of specialised cells called iridosomes which have layered structures. Similar iridosomes have also been found in Delarbrea michieana fruits.
+In plants, multi layer structures can be found either at the surface of the leaves (on top of the epidermis), such as in Selaginella willdenowii or within specialized intra-cellular organelles, the so-called iridoplasts, which are located inside the cells of the upper epidermis. For instance, the rain forest plants Begonia pavonina have iridoplasts located inside the epidermal cells.
+Structural colours have also been found in several algae, such as in the red alga Chondrus crispus (Irish Moss).
+
+==== Inspiration from animals ====
+
+Structural coloration produces the rainbow colours of soap bubbles, butterfly wings and many beetle scales. Phase-separation has been used to fabricate ultra-white scattering membranes from polymethylmethacrylate, mimicking the beetle Cyphochilus. LED lights can be designed to mimic the patterns of scales on fireflies' abdomens, improving their efficiency.
+Morpho butterfly wings are structurally coloured to produce a vibrant blue that does not vary with angle. This effect can be mimicked by a variety of technologies. Lotus Cars claim to have developed a paint that mimics the Morpho butterfly's structural blue colour. In 2007, Qualcomm commercialised an interferometric modulator display technology, "Mirasol", using Morpho-like optical interference. In 2010, the dressmaker Donna Sgro made a dress from Teijin Fibers' Morphotex, an undyed fabric woven from structurally coloured fibres, mimicking the microstructure of Morpho butterfly wing scales.
+Canon Inc.'s SubWavelength structure Coating uses wedge-shaped structures the size of the wavelength of visible light. The wedge-shaped structures cause a continuously changing refractive index as light travels through the coating, significantly reducing lens flare. This imitates the structure of a moth's eye. Notable figures such as the Wright Brothers and Leonardo da Vinci attempted to replicate the flight observed in birds. In an effort to reduce aircraft noise, researchers have looked to the leading edge of owl feathers, which have an array of small finlets or rachis adapted to disperse aerodynamic pressure and provide nearly silent flight to the bird.
+
+=== Agricultural systems ===
+Holistic planned grazing, using fencing and/or herders, seeks to restore grasslands by carefully planning movements of large herds of livestock to mimic the vast herds found in nature. The natural system being mimicked and used as a template is grazing animals concentrated by pack predators that must move on after eating, trampling, and manuring an area, and returning only after it has fully recovered. Its founder Allan Savory and some others have claimed potential in building soil, increasing biodiversity, and reversing desertification. However, many researchers have disputed Savory's claim. Studies have often found that the method increases desertification instead of reducing it.
+
+=== Geolocation ===
+Biomimetics can also mean recreating how insects perceive and navigate. Many insects use skylight polarization patterns to estimate North and geolocate. An open-source project has been shown to simulate the measurement of polarization patterns to aid in developing geolocation systems that use them. The project claims potential future alternatives to traditional GPS systems like GNSS, especially in remote areas and may also be to assist in training neural networks to recognize polarization patterns.
--- a/data/en.wikipedia.org/wiki/Biomimetics-6.md
+++ b/data/en.wikipedia.org/wiki/Biomimetics-6.md
@ -0,0 +1,45 @@
+---
+title: "Biomimetics"
+chunk: 7/7
+source: "https://en.wikipedia.org/wiki/Biomimetics"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T14:01:44.463976+00:00"
+instance: "kb-cron"
+---
+
+=== Other uses ===
+Some air conditioning systems use biomimicry in their fans to increase airflow while reducing power consumption.
+Technologists like Jas Johl have speculated that the functionality of vacuole cells could be used to design highly adaptable security systems. "The functionality of a vacuole, a biological structure that guards and promotes growth, illuminates the value of adaptability as a guiding principle for security." The functions and significance of vacuoles are fractal in nature, the organelle has no basic shape or size; its structure varies according to the requirements of the cell. Vacuoles not only isolate threats, contain what's necessary, export waste, maintain pressure—they also help the cell scale and grow. Johl argues these functions are necessary for any security system design. The 500 Series Shinkansen used biomimicry to reduce energy consumption and noise levels while increasing passenger comfort. With reference to space travel, NASA and other firms have sought to develop swarm-type space drones inspired by bee behavioural patterns, and oxtapod terrestrial drones designed with reference to desert spiders.
+
+== Other technologies ==
+Protein folding has been used to control material formation for self-assembled functional nanostructures. Polar bear fur has inspired the design of thermal collectors and clothing. The light refractive properties of the moth's eye has been studied to reduce the reflectivity of solar panels.
+
+The Bombardier beetle's powerful repellent spray inspired a Swedish company to develop a "micro mist" spray technology, which is claimed to have a low carbon impact (compared to aerosol sprays). The beetle mixes chemicals and releases its spray via a steerable nozzle at the end of its abdomen, stinging and confusing the victim.
+Most viruses have an outer capsule 20 to 300 nm in diameter. Virus capsules are remarkably robust and capable of withstanding temperatures as high as 60 °C; they are stable across the pH range 2–10. Viral capsules can be used to create nano device components such as nanowires, nanotubes, and quantum dots. Tubular virus particles such as the tobacco mosaic virus (TMV) can be used as templates to create nanofibers and nanotubes, since both the inner and outer layers of the virus are charged surfaces which can induce nucleation of crystal growth. This was demonstrated through the production of platinum and gold nanotubes using TMV as a template. Mineralized virus particles have been shown to withstand various pH values by mineralizing the viruses with different materials such as silicon, PbS, and CdS and could therefore serve as a useful carriers of material. A spherical plant virus called cowpea chlorotic mottle virus (CCMV) has interesting expanding properties when exposed to environments of pH higher than 6.5. Above this pH, 60 independent pores with diameters about 2 nm begin to exchange substance with the environment. The structural transition of the viral capsid can be utilized in Biomorphic mineralization for selective uptake and deposition of minerals by controlling the solution pH. Possible applications include using the viral cage to produce uniformly shaped and sized quantum dot semiconductor nanoparticles through a series of pH washes. This is an alternative to the apoferritin cage technique currently used to synthesize uniform CdSe nanoparticles. Such materials could also be used for targeted drug delivery since particles release contents upon exposure to specific pH levels.
+
+=== Multimimicry Regenerative Model (MRM) ===
+In 2025, researchers Yassir Turki and Kilzar Arian proposed the *Multimimicry Regenerative Model (MRM)* as an expanded framework inspired by biomimicry. The model integrates six interrelated domains—Biomimicry, Chemomimicry, Physicomimicry, Geomimicry, Cosmomimicry, and Semiomimicry—to describe how regenerative processes operate across biological, chemical, physical, geological, cosmological, and semiotic systems. The MRM seeks to provide a unified scientific and design approach for regenerative innovation.
+
+== See also ==
+
+== References ==
+
+== Further reading ==
+Benyus, J. M. (2001). Along Came a Spider. Sierra, 86(4), 46–47.
+Hargroves, K. D. & Smith, M. H. (2006). Innovation inspired by nature Biomimicry. Ecos, (129), 27–28.
+Marshall, A. (2009). Wild Design: The Ecomimicry Project, North Atlantic Books: Berkeley.
+Passino, Kevin M. (2004). Biomimicry for Optimization, Control, and Automation. Springer.
+Pyper, W. (2006). Emulating nature: The rise of industrial ecology. Ecos, (129), 22–26.
+Smith, J. (2007). It's only natural. The Ecologist, 37(8), 52–55.
+Thompson, D'Arcy W., On Growth and Form. Dover 1992 reprint of 1942 2nd ed. (1st ed., 1917).
+Vogel, S. (2000). Cats' Paws and Catapults: Mechanical Worlds of Nature and People. Norton.
+
+== External links ==
+Biomimetics MIT
+Sex, Velcro and Biomimicry with Janine Benyus
+Janine Benyus: Biomimicry in Action  from TED 2009
+Design by Nature - National Geographic
+Michael Pawlyn: Using nature's genius in architecture from TED 2010
+Robert Full shows how human engineers can learn from animals' tricks from TED 2002
+The Fast Draw: Biomimicry from CBS News
--- a/data/en.wikipedia.org/wiki/Biopunk-0.md
+++ b/data/en.wikipedia.org/wiki/Biopunk-0.md
@ -0,0 +1,30 @@
+---
+title: "Biopunk"
+chunk: 1/1
+source: "https://en.wikipedia.org/wiki/Biopunk"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T14:01:48.177590+00:00"
+instance: "kb-cron"
+---
+
+Biopunk (a portmanteau of "biotechnology" or "biology" and "punk") is a subgenre of science fiction that focuses on biotechnology. It is derived from cyberpunk, but focuses on the implications of biotechnology rather than mechanical cyberware and information technology. Biopunk is concerned with synthetic biology, and often involves bio-hackers, biotech megacorporations, and oppressive organizations that engineer DNA. Most often keeping with the dark atmosphere of cyberpunk, biopunk generally examines risks and downsides of genetic engineering and illustrates potential perils of biotechnologies.
+
+
+== Description ==
+Biopunk is a subgenre of science fiction closely related to cyberpunk that focuses on the near-future (most often unintended) consequences of the biotechnology revolution following the invention of recombinant DNA. Biopunk stories explore the struggles of individuals or groups, often the product of human experimentation, against a typically dystopian backdrop of totalitarian governments or megacorporations which misuse biotechnologies as means of social control and profiteering. The benefits of biotechnology, such as human enhancement and extended longevity, are often not evenly distributed and are controlled by corporations. Unlike cyberpunk, which builds upon information technology, biopunk focuses on synthetic biology. Similar to postcyberpunk fiction, individuals are often modified and enhanced. Instead of cyberware, individuals use genetic manipulation. A common feature of biopunk fiction is the "black clinic", which is a laboratory, clinic, or hospital that performs illegal, unregulated, or ethically dubious biological modification and genetic engineering procedures.
+
+
+== History ==
+Many features of biopunk fiction have their roots in William Gibson's Neuromancer, one of the first cyberpunk novels. One of the prominent writers in this field is Paul Di Filippo, though he called his collection of such stories ribofunk, a blend of "ribosome" and "funk". Di Filippo suggests that precursors of biopunk fiction include H. G. Wells' The Island of Doctor Moreau; Julian Huxley's The Tissue-Culture King; some of David H. Keller's stories, Damon Knight's Natural State and Other Stories; Frederik Pohl and Cyril M. Kornbluth's Gravy Planet; novels of T. J. Bass and John Varley; Greg Bear's Blood Music; Bruce Sterling's Schismatrix and Autonomous by Annalee Newitz. The stories of Cordwainer Smith, including his first and most famous Scanners Live in Vain, also foreshadow biopunk themes. Another example is the New Jedi Order series published from 1999 to 2003, which prominently features the Yuuzhan Vong who use biotechnology exclusively.
+
+
+== See also ==
+
+
+== References ==
+
+
+== External links ==
+
+Hackteria.org, a community for bio-artists
--- a/data/en.wikipedia.org/wiki/Biositemap-0.md
+++ b/data/en.wikipedia.org/wiki/Biositemap-0.md
@ -0,0 +1,55 @@
+---
+title: "Biositemap"
+chunk: 1/1
+source: "https://en.wikipedia.org/wiki/Biositemap"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T14:01:51.787661+00:00"
+instance: "kb-cron"
+---
+
+A Biositemap is a way for a biomedical research institution or organisation to show how biological information is distributed throughout their Information Technology systems and networks. This information can be shared with other organisations and researchers.
+The Biositemap enables web browsers, crawlers and robots to easily access and process the information to use in other systems, media and computational formats. Biositemaps protocols provide clues for the Biositemap web harvesters, allowing them to find resources and content across the whole interlink of the Biositemap system. This means that human or machine users can access any relevant information on any topic across all organisations throughout the Biositemap system and bring it to their own systems for assimilation or analysis.
+
+
+== File framework ==
+
+The information is normally stored in a biositemap.rdf or biositemap.xml file which contains lists of information about the data, software, tools material and services provided or held by that organisation. Information is presented in metafields and can be created online through sites such as the biositemaps online editor.
+The information is a blend of sitemaps and RSS feeds and is created using the Information Model (IM) and Biomedical Resource Ontology (BRO). The IM is responsible for defining the data held in the metafields and the BRO controls the terminology of the data held in the resource_type field. The BRO is critical in aiding the interactivity of both the other organisations and third parties to search and refine those searches.
+
+
+=== Data formats ===
+The Biositemaps Protocol allows scientists, engineers, centers and institutions engaged in modeling, software tool development and analysis of biomedical and informatics data to broadcast and disseminate to the world the information about their latest computational biology resources (data, software tools and web services). The biositemap concept is based on ideas from Efficient, Automated Web Resource Harvesting and  
+Crawler-friendly Web Servers, and it integrates the features of sitemaps and RSS feeds into a decentralized mechanism for computational biologists and bio-informaticians to openly broadcast and retrieve meta-data about biomedical resources.
+These site, institution, or investigator specific biositemap descriptions are published in RDF format online and are searched, parsed, monitored and interpreted by web search engines, web applications specific to biositemaps and ontologies, and other applications interested in discovering updated or novel resources for bioinformatics and biomedical research investigations. The biositemap mechanism separates the providers of biomedical resources (investigators or institutions) from the consumers of resource content (researchers, clinicians, news media, funding agencies, educational and research initiatives).
+A Biositemap is an RDF file that lists the biomedical and bioinformatics resources for a specific research group or consortium. It allows developers of biomedical resources to describe the functionality and usability of each of their software tools, databases or web-services.
+Biositemaps supplement and do not replace the existing frameworks for dissemination of data, tools and services. Using a biositemap does not guarantee that resources will be included in search indexes nor does it influence the way that tools are ranked or perceived by the community. What the Biositemaps protocol will do is provide clues, information and directives to all Biositemap web harvesters that point to the existence and content of biomedical resources at different sites.
+
+
+=== Biositemap Information Model ===
+The Biositemap protocol relies on an extensible information model that includes specific properties that are commonly used and necessary for characterizing biomedical resources:
+
+Name
+Description
+URL
+Stage of development
+Organization
+Resource Ontology Label
+Keywords
+License
+Up-to-date documentation on the information model is available at the Biositemaps website.
+
+
+== See also ==
+Information visualization
+ITools Resourceome
+Sitemaps
+
+
+== References ==
+
+
+== External links ==
+Official website
+Biomedical Resource Ontology
+Biositemaps online editor
--- a/data/en.wikipedia.org/wiki/Biostatistics-0.md
+++ b/data/en.wikipedia.org/wiki/Biostatistics-0.md
@ -0,0 +1,32 @@
+---
+title: "Biostatistics"
+chunk: 1/6
+source: "https://en.wikipedia.org/wiki/Biostatistics"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T14:01:53.013815+00:00"
+instance: "kb-cron"
+---
+
+Biostatistics (sometimes referred to as biometry) is a branch of statistics that applies statistical methods to a wide range of topics in the biological sciences, with a focus on clinical medicine and public health applications. 
+The field encompasses the design of experiments, the collection and analysis of experimental and observational data, and the interpretation of the results.
+It is closely related to medical statistics.
+
+== History ==
+
+=== Biostatistics and genetics ===
+Biostatistical modeling forms an important part of numerous modern biological theories. Genetics studies, since its beginning, used statistical concepts to understand observed experimental results. Some genetics scientists even contributed with statistical advances with the development of methods and tools. Gregor Mendel started the genetics studies investigating genetics segregation patterns in families of peas and used statistics to explain the collected data. In the early 1900s, after the rediscovery of Mendel's Mendelian inheritance work, there were gaps in understanding between genetics and evolutionary Darwinism. Francis Galton tried to expand Mendel's discoveries with human data and proposed a different model with fractions of the heredity coming from each ancestral composing an infinite series. He called this the theory of "Law of Ancestral Heredity". His ideas were strongly disagreed by William Bateson, who followed Mendel's conclusions, that genetic inheritance were exclusively from the parents, half from each of them. This led to a vigorous debate between the biometricians, who supported Galton's ideas, as Raphael Weldon, Arthur Dukinfield Darbishire and Karl Pearson, and Mendelians, who supported Bateson's (and Mendel's) ideas, such as Charles Davenport and Wilhelm Johannsen. Later, biometricians could not reproduce Galton conclusions in different experiments, and Mendel's ideas prevailed. By the 1930s, models built on statistical reasoning had helped to resolve these differences and to produce the neo-Darwinian modern evolutionary synthesis.
+Solving these differences also allowed to define the concept of population genetics and brought together genetics and evolution. The three leading figures in the establishment of population genetics and this synthesis all relied on statistics and developed its use in biology.
+
+Ronald Fisher worked alongside statistician Betty Allan developing several basic statistical methods in support of his work studying the crop experiments at Rothamsted Research, published in Fisher's books Statistical Methods for Research Workers (1925) and The Genetical Theory of Natural Selection (1930), as well as Allan's scientific papers. Fisher went on to give many contributions to genetics and statistics. Some of them include the ANOVA, p-value concepts, Fisher's exact test and Fisher's equation for population dynamics. He is credited for the sentence "Natural selection is a mechanism for generating an exceedingly high degree of improbability".
+Sewall G. Wright developed F-statistics and methods of computing them and defined inbreeding coefficient.
+J. B. S. Haldane's book, The Causes of Evolution, reestablished natural selection as the premier mechanism of evolution by explaining it in terms of the mathematical consequences of Mendelian genetics. He also developed the theory of primordial soup.
+These and other biostatisticians, mathematical biologists, and statistically inclined geneticists helped bring together evolutionary biology and genetics into a consistent, coherent whole that could begin to be quantitatively modeled.
+In parallel to this overall development, the pioneering work of D'Arcy Thompson in On Growth and Form also helped to add quantitative discipline to biological study.
+Despite the fundamental importance and frequent necessity of statistical reasoning, there may nonetheless have been a tendency among biologists to distrust or deprecate results which are not qualitatively apparent. One anecdote describes Thomas Hunt Morgan banning the Friden calculator from his department at Caltech, saying "Well, I am like a guy who is prospecting for gold along the banks of the Sacramento River in 1849. With a little intelligence, I can reach down and pick up big nuggets of gold. And as long as I can do that, I'm not going to let any people in my department waste scarce resources in placer mining."
+
+== Research planning ==
+Any research in life sciences is proposed to answer a scientific question we might have. To answer this question with a high certainty, we need accurate results. The correct definition of the main hypothesis and the research plan will reduce errors while taking a decision in understanding a phenomenon. The research plan might include the research question, the hypothesis to be tested, the experimental design, data collection methods, data analysis perspectives and costs involved. It is essential to carry the study based on the three basic principles of experimental statistics: randomization, replication, and local control.
+
+=== Research question ===
+The research question will define the objective of a study. The research will be headed by the question, so it needs to be concise, at the same time it is focused on interesting and novel topics that may improve science and knowledge and that field. To define the way to ask the scientific question, an exhaustive literature review might be necessary. So the research can be useful to add value to the scientific community.
--- a/data/en.wikipedia.org/wiki/Biostatistics-1.md
+++ b/data/en.wikipedia.org/wiki/Biostatistics-1.md
@ -0,0 +1,115 @@
+---
+title: "Biostatistics"
+chunk: 2/6
+source: "https://en.wikipedia.org/wiki/Biostatistics"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T14:01:53.013815+00:00"
+instance: "kb-cron"
+---
+
+=== Hypothesis definition ===
+Once the aim of the study is defined, the possible answers to the research question can be proposed, transforming this question into a hypothesis. The main propose is called null hypothesis (H0) and is usually based on a permanent knowledge about the topic or an obvious occurrence of the phenomena, sustained by a deep literature review. We can say it is the standard expected answer for the data under the situation in test. In general, HO assumes no association between treatments. On the other hand, the alternative hypothesis is the denial of HO. It assumes some degree of association between the treatment and the outcome. Although, the hypothesis is sustained by question research and its expected and unexpected answers.
+As an example, consider groups of similar animals (mice, for example) under two different diet systems. The research question would be: what is the best diet? In this case, H0 would be that there is no difference between the two diets in mice metabolism (H0: μ1 = μ2) and the alternative hypothesis would be that the diets have different effects over animals metabolism (H1: μ1 ≠ μ2).
+The hypothesis is defined by the researcher, according to his/her interests in answering the main question. Besides that, the alternative hypothesis can be more than one hypothesis. It can assume not only differences across observed parameters, but their degree of differences (i.e. higher or shorter).
+
+=== Sampling ===
+Usually, a study aims to understand an effect of a phenomenon over a population. In biology, a population is defined as all the individuals of a given species, in a specific area at a given time. In biostatistics, this concept is extended to a variety of collections possible of study. Although, in biostatistics, a population is not only the individuals, but the total of one specific component of their organisms, as the whole genome, or all the sperm cells, for animals, or the total leaf area, for a plant, for example.
+It is not possible to take the measures from all the elements of a population. Because of that, the sampling process is very important for statistical inference. Sampling is defined as to randomly get a representative part of the entire population, to make posterior inferences about the population. So, the sample might catch the most variability across a population. The sample size is determined by several things, since the scope of the research to the resources available. In clinical research, the trial type, as inferiority, equivalence, and superiority is a key in determining sample size.
+
+=== Experimental design ===
+Experimental designs sustain those basic principles of experimental statistics. There are three basic experimental designs to randomly allocate treatments in all plots of the experiment. They are completely randomized design, randomized block design, and factorial designs. Treatments can be arranged in many ways inside the experiment. In agriculture, the correct experimental design is the root of a good study and the arrangement of treatments within the study is essential because environment largely affects the plots (plants, livestock, microorganisms). These main arrangements can be found in the literature under the names of "lattices", "incomplete blocks", "split plot", "augmented blocks", and many others. All of the designs might include control plots, determined by the researcher, to provide an error estimation during inference.
+In clinical studies, the samples are usually smaller than in other biological studies, and in most cases, the environment effect can be controlled or measured. It is common to use randomized controlled clinical trials, where results are usually compared with observational study designs such as case–control or cohort.
+
+=== Data collection ===
+Data collection methods must be considered in research planning, because it highly influences the sample size and experimental design.
+Data collection varies according to the type of data. For qualitative data, collection can be done with structured questionnaires or by observation, considering presence or intensity of disease, using score criterion to categorize levels of occurrence. For quantitative data, collection is done by measuring numerical information using instruments.
+In agriculture and biology studies, yield data and its components can be obtained by metric measures. However, pest and disease injuries in plants are obtained by observation, considering score scales for levels of damage. Especially, in genetic studies, modern methods for data collection in field and laboratory should be considered, as high-throughput platforms for phenotyping and genotyping. These tools allow bigger experiments, while turn possible evaluate many plots in lower time than a human-based only method for data collection.
+Finally, all data collected of interest must be stored in an organized data frame for further analysis.
+
+== Analysis and data interpretation ==
+
+=== Descriptive tools ===
+
+Data can be represented through tables or graphical representation, such as line charts, bar charts, histograms, scatter plot. Also, measures of central tendency and variability can be very useful to describe an overview of the data. Follow some examples:
+
+==== Frequency tables ====
+One type of table is the frequency table, which consists of data arranged in rows and columns, where the frequency is the number of occurrences or repetitions of data. Frequency can be:
+Absolute: represents the number of times that a determined value appear;
+
+  
+    
+      
+        N
+        =
+        
+          f
+          
+            1
+          
+        
+        +
+        
+          f
+          
+            2
+          
+        
+        +
+        
+          f
+          
+            3
+          
+        
+        +
+        .
+        .
+        .
+        +
+        
+          f
+          
+            n
+          
+        
+      
+    
+    {\displaystyle N=f_{1}+f_{2}+f_{3}+...+f_{n}}
+  
+
+Relative: obtained by the division of the absolute frequency by the total number;
+
+  
+    
+      
+        
+          n
+          
+            i
+          
+        
+        =
+        
+          
+            
+              f
+              
+                i
+              
+            
+            N
+          
+        
+      
+    
+    {\displaystyle n_{i}={\frac {f_{i}}{N}}}
+  
+
+In the next example, we have the number of genes in ten operons of the same organism.
+
+Genes = {2,3,3,4,5,3,3,3,3,4}
+
+==== Line graph ====
+
+Line graphs represent the variation of a value over another metric, such as time. In general, values are represented in the vertical axis, while the time variation is represented in the horizontal axis.
--- a/data/en.wikipedia.org/wiki/Biostatistics-2.md
+++ b/data/en.wikipedia.org/wiki/Biostatistics-2.md
@ -0,0 +1,194 @@
+---
+title: "Biostatistics"
+chunk: 3/6
+source: "https://en.wikipedia.org/wiki/Biostatistics"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T14:01:53.013815+00:00"
+instance: "kb-cron"
+---
+
+==== Bar chart ====
+A bar chart is a graph that shows categorical data as bars presenting heights (vertical bar) or widths (horizontal bar) proportional to represent values. Bar charts provide an image that could also be represented in a tabular format.
+In the bar chart example, we have the birth rate in Brazil for the December months from 2010 to 2016. The sharp fall in December 2016 reflects the outbreak of Zika virus in the birth rate in Brazil.
+
+==== Histograms ====
+The histogram (or frequency distribution) is a graphical representation of a dataset tabulated and divided into uniform or non-uniform classes. It was first introduced by Karl Pearson.
+
+==== Scatter plot ====
+A scatter plot is a mathematical diagram that uses Cartesian coordinates to display values of a dataset. A scatter plot shows the data as a set of points, each one presenting the value of one variable determining the position on the horizontal axis and another variable on the vertical axis. They are also called scatter graph, scatter chart, scattergram, or scatter diagram.
+
+==== Mean ====
+
+The arithmetic mean is the sum of a collection of values (
+  
+    
+      
+        
+          
+            x
+            
+              1
+            
+          
+          +
+          
+            x
+            
+              2
+            
+          
+          +
+          
+            x
+            
+              3
+            
+          
+          +
+          ⋯
+          +
+          
+            x
+            
+              n
+            
+          
+        
+      
+    
+    {\displaystyle {x_{1}+x_{2}+x_{3}+\cdots +x_{n}}}
+  
+) divided by the number of items of this collection (
+  
+    
+      
+        
+          n
+        
+      
+    
+    {\displaystyle {n}}
+  
+).
+
+  
+    
+      
+        
+          
+            
+              x
+              ¯
+            
+          
+        
+        =
+        
+          
+            1
+            n
+          
+        
+        
+          (
+          
+            
+              ∑
+              
+                i
+                =
+                1
+              
+              
+                n
+              
+            
+            
+              
+                x
+                
+                  i
+                
+              
+            
+          
+          )
+        
+        =
+        
+          
+            
+              
+                x
+                
+                  1
+                
+              
+              +
+              
+                x
+                
+                  2
+                
+              
+              +
+              ⋯
+              +
+              
+                x
+                
+                  n
+                
+              
+            
+            n
+          
+        
+      
+    
+    {\displaystyle {\bar {x}}={\frac {1}{n}}\left(\sum _{i=1}^{n}{x_{i}}\right)={\frac {x_{1}+x_{2}+\cdots +x_{n}}{n}}}
+  
+
+==== Median ====
+
+The median is the value in the middle of a dataset.
+
+==== Mode ====
+
+The mode is the value of a set of data that appears most often.
+
+==== Box plot ====
+Box plot is a method for graphically depicting groups of numerical data. The maximum and minimum values are represented by the lines, and the interquartile range (IQR) represent 25–75% of the data. Outliers may be plotted as circles.
+
+==== Correlation coefficients ====
+Although correlations between two different kinds of data could be inferred by graphs, such as scatter plot, it is necessary validate this though numerical information. For this reason, correlation coefficients are required. They provide a numerical value that reflects the strength of an association.
+
+==== Pearson correlation coefficient ====
+ Pearson correlation coefficient is a measure of association between two variables, X and Y. This coefficient, usually represented by ρ (rho) for the population and r for the sample, assumes values between −1 and 1, where ρ  = 1 represents a perfect positive correlation, ρ = −1 represents a perfect negative correlation, and ρ = 0 is no linear correlation.
+
+=== Inferential statistics ===
+
+It is used to make inferences about an unknown population, by estimation and/or hypothesis testing. In other words, it is desirable to obtain parameters to describe the population of interest, but since the data is limited, it is necessary to make use of a representative sample in order to estimate them. With that, it is possible to test previously defined hypotheses and apply the conclusions to the entire population. The  standard error of the mean is a measure of variability that is crucial to do inferences.
+
+Hypothesis testing
+Hypothesis testing is essential to make inferences about populations aiming to answer research questions, as settled in "Research planning" section. Authors defined four steps to be set:
+
+The hypothesis to be tested: as stated earlier, we have to work with the definition of a null hypothesis (H0), that is going to be tested, and an alternative hypothesis. But they must be defined before the experiment implementation.
+Significance level and decision rule: A decision rule depends on the level of significance, or in other words, the acceptable error rate (α). It is easier to think that we define a critical value that determines the statistical significance when a test statistic is compared with it. So, α also has to be predefined before the experiment.
+Experiment and statistical analysis: This is when the experiment is really implemented following the appropriate experimental design, data is collected and the more suitable statistical tests are evaluated.
+Inference: Is made when the null hypothesis is rejected or not rejected, based on the evidence that the comparison of p-values and α brings. It is pointed that the failure to reject H0 just means that there is not enough evidence to support its rejection, but not that this hypothesis is true.
+Confidence intervals
+A confidence interval is a range of values that can contain the true real parameter value in given a certain level of confidence. The first step is to estimate the best-unbiased estimate of the population parameter. The upper value of the interval is obtained by the sum of this estimate with the multiplication between the standard error of the mean and the confidence level. The calculation of lower value is similar, but instead of a sum, a subtraction must be applied.
+
+== Statistical considerations ==
+
+=== Power and statistical error ===
+When testing a hypothesis, there are two types of statistic errors possible: Type I error and Type II error.
+
+The type I error or false positive is the incorrect rejection of a true null hypothesis
+The type II error or false negative is the failure to reject a false null hypothesis.
+The significance level denoted by α is the type I error rate and should be chosen before performing the test. The type II error rate is denoted by β and statistical power of the test is 1 − β.
+
+=== p-value ===
+The p-value is the probability of obtaining results as extreme as or more extreme than those observed, assuming the null hypothesis (H0) is true. It is also called the calculated probability. It is common to confuse the p-value with the significance level (α), but, the α is a predefined threshold for calling significant results. If p is less than α, the null hypothesis (H0) is rejected.
--- a/data/en.wikipedia.org/wiki/Biostatistics-3.md
+++ b/data/en.wikipedia.org/wiki/Biostatistics-3.md
@ -0,0 +1,27 @@
+---
+title: "Biostatistics"
+chunk: 4/6
+source: "https://en.wikipedia.org/wiki/Biostatistics"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T14:01:53.013815+00:00"
+instance: "kb-cron"
+---
+
+=== Multiple testing ===
+In multiple tests of the same hypothesis, the probability of the occurrence of false positives (familywise error rate) increase and a strategy is needed to account for this occurrence. This is commonly achieved by using a more stringent threshold to reject null hypotheses. The Bonferroni correction defines an acceptable global significance level, denoted by α* and each test is individually compared with a value of α =  α*/m. This ensures that the familywise error rate in all m tests, is less than or equal to α*. When m is large, the Bonferroni correction may be overly conservative. An alternative to the Bonferroni correction is to control the false discovery rate (FDR). The FDR controls the expected proportion of the rejected null hypotheses (the so-called discoveries) that are false (incorrect rejections). This procedure ensures that, for independent tests, the false discovery rate is at most q*. Thus, the FDR is less conservative than the Bonferroni correction and have more power, at the cost of more false positives.
+
+=== Mis-specification and robustness checks ===
+The main hypothesis being tested (e.g., no association between treatments and outcomes) is often accompanied by other technical assumptions (e.g., about the form of the probability distribution of the outcomes) that are also part of the null hypothesis. When the technical assumptions are violated in practice, then the null may be frequently rejected even if the main hypothesis is true. Such rejections are said to be due to model mis-specification. Verifying whether the outcome of a statistical test does not change when the technical assumptions are slightly altered (so-called robustness checks) is the main way of combating mis-specification.
+
+=== Model selection criteria ===
+Model criteria selection will select or model that more approximate true model. The Akaike's Information Criterion (AIC) and The Bayesian Information Criterion (BIC) are examples of asymptotically efficient criteria.
+
+== Developments and big data ==
+
+Recent developments have made a large impact on biostatistics. Two important changes have been the ability to collect data on a high-throughput scale, and the ability to perform much more complex analysis using computational techniques. This comes from the development in areas as sequencing technologies, Bioinformatics and Machine learning (Machine learning in bioinformatics).
+
+=== Use in high-throughput data ===
+New biomedical technologies like microarrays, next-generation sequencers (for genomics) and mass spectrometry (for proteomics) generate enormous amounts of data, allowing many tests to be performed simultaneously. Careful analysis with biostatistical methods is required to separate the signal from the noise. For example, a microarray could be used to measure many thousands of genes simultaneously, determining which of them have different expression in diseased cells compared to normal cells. However, only a fraction of genes will be differentially expressed.
+Multicollinearity often occurs in high-throughput biostatistical settings. Due to high intercorrelation between the predictors (such as gene expression levels), the information of one predictor might be contained in another one. It could be that only 5% of the predictors are responsible for 90% of the variability of the response. In such a case, one could apply the biostatistical technique of dimension reduction (for example via principal component analysis). Classical statistical techniques like linear or logistic regression and linear discriminant analysis do not work well for high dimensional data (i.e. when the number of observations n is smaller than the number of features or predictors p: n < p). As a matter of fact, one can get quite high R2-values despite very low predictive power of the statistical model. These classical statistical techniques (esp. least squares linear regression) were developed for low dimensional data (i.e. where the number of observations n is much larger than the number of predictors p: n >> p). In cases of high dimensionality, one should always consider an independent validation test set and the corresponding residual sum of squares (RSS) and R2 of the validation test set, not those of the training set.
+Often, it is useful to pool information from multiple predictors together. For example, Gene Set Enrichment Analysis (GSEA) considers the perturbation of whole (functionally related) gene sets rather than of single genes. These gene sets might be known biochemical pathways or otherwise functionally related genes. The advantage of this approach is that it is more robust: It is more likely that a single gene is found to be falsely perturbed than it is that a whole pathway is falsely perturbed. Furthermore, one can integrate the accumulated knowledge about biochemical pathways (like the JAK-STAT signaling pathway) using this approach.
--- a/data/en.wikipedia.org/wiki/Biostatistics-4.md
+++ b/data/en.wikipedia.org/wiki/Biostatistics-4.md
@ -0,0 +1,33 @@
+---
+title: "Biostatistics"
+chunk: 5/6
+source: "https://en.wikipedia.org/wiki/Biostatistics"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T14:01:53.013815+00:00"
+instance: "kb-cron"
+---
+
+=== Bioinformatics advances in databases, data mining, and biological interpretation ===
+The development of biological databases enables storage and management of biological data with the possibility of ensuring access for users around the world. They are useful for researchers depositing data, retrieve information and files (raw or processed) originated from other experiments or indexing scientific articles, as PubMed. Another possibility is search for the desired term (a gene, a protein, a disease, an organism, and so on) and check all results related to this search. There are databases dedicated to SNPs (dbSNP), the knowledge on genes characterization and their pathways (KEGG) and the description of gene function classifying it by cellular component, molecular function and biological process (Gene Ontology). In addition to databases that contain specific molecular information, there are others that are ample in the sense that they store information about an organism or group of organisms. As an example of a database directed towards just one organism, but that contains much data about it, is the Arabidopsis thaliana genetic and molecular database – TAIR. Phytozome, in turn, stores the assemblies and annotation files of dozen of plant genomes, also containing visualization and analysis tools. Moreover, there is an interconnection between some databases in the information exchange/sharing and a major initiative was the International Nucleotide Sequence Database Collaboration (INSDC) which relates data from DDBJ, EMBL-EBI, and NCBI.
+Nowadays, increase in size and complexity of molecular datasets leads to use of powerful statistical methods provided by computer science algorithms which are developed by machine learning area. Therefore, data mining and machine learning allow detection of patterns in data with a complex structure, as biological ones, by using methods of supervised and unsupervised learning, regression, detection of clusters and association rule mining, among others. To indicate some of them, self-organizing maps and k-means are examples of cluster algorithms; neural networks implementation and support vector machines models are examples of common machine learning algorithms.
+Collaborative work among molecular biologists, bioinformaticians, statisticians and computer scientists is important to perform an experiment correctly, going from planning, passing through data generation and analysis, and ending with biological interpretation of the results.
+
+=== Use of computationally intensive methods ===
+On the other hand, the advent of modern computer technology and relatively cheap computing resources have enabled computer-intensive biostatistical methods like bootstrapping and re-sampling methods.
+In recent times, random forests have gained popularity as a method for performing statistical classification. Random forest techniques generate a panel of decision trees. Decision trees have the advantage that you can draw them and interpret them (even with a basic understanding of mathematics and statistics). Random Forests have thus been used for clinical decision support systems.
+
+== Applications ==
+
+=== Public health ===
+Public health, including epidemiology, health services research, nutrition, environmental health and health care policy & management. In these medicine contents, it's important to consider the design and analysis of the clinical trials. As one example, there is the assessment of severity state of a patient with a prognosis of an outcome of a disease.
+With new technologies and genetics knowledge, biostatistics are now also used for Systems medicine, which consists in a more personalized medicine. For this, is made an integration of data from different sources, including conventional patient data, clinico-pathological parameters, molecular and genetic data as well as data generated by additional new-omics technologies.
+
+=== Quantitative genetics ===
+The study of population genetics and statistical genetics in order to link variation in genotype with a variation in phenotype. In other words, it is desirable to discover the genetic basis of a measurable trait, a quantitative trait, that is under polygenic control. A genome region that is responsible for a continuous trait is called a quantitative trait locus (QTL). The study of QTLs become feasible by using molecular markers and measuring traits in populations, but their mapping needs the obtaining of a population from an experimental crossing, like an F2 or recombinant inbred strains/lines (RILs). To scan for QTLs regions in a genome, a gene map based on linkage have to be built. Some of the best-known QTL mapping algorithms are Interval Mapping, Composite Interval Mapping, and Multiple Interval Mapping.
+However, QTL mapping resolution is impaired by the amount of recombination assayed, a problem for species in which it is difficult to obtain large offspring. Furthermore, allele diversity is restricted to individuals originated from contrasting parents, which limit studies of allele diversity when we have a panel of individuals representing a natural population. For this reason, the genome-wide association study was proposed in order to identify QTLs based on linkage disequilibrium, that is the non-random association between traits and molecular markers. It was leveraged by the development of high-throughput SNP genotyping.
+In animal and plant breeding, the use of markers in selection aiming for breeding, mainly the molecular ones, collaborated to the development of marker-assisted selection. While QTL mapping is limited due resolution, GWAS does not have enough power when rare variants of small effect that are also influenced by environment. So, the concept of Genomic Selection (GS) arises in order to use all molecular markers in the selection and allow the prediction of the performance of candidates in this selection. The proposal is to genotype and phenotype a training population, develop a model that can obtain the genomic estimated breeding values (GEBVs) of individuals belonging to a genotype and but not phenotype population, called testing population. This kind of study could also include a validation population, thinking in the concept of cross-validation, in which the real phenotype results measured in this population are compared with the phenotype results based on the prediction, what used to check the accuracy of the model.
+As a summary, some points about the application of quantitative genetics are:
+
+This has been used in agriculture to improve crops (Plant breeding) and livestock (Animal breeding).
+In biomedical research, this work can assist in finding candidates gene alleles that can cause or influence predisposition to diseases in human genetics
--- a/data/en.wikipedia.org/wiki/Biostatistics-5.md
+++ b/data/en.wikipedia.org/wiki/Biostatistics-5.md
@ -0,0 +1,81 @@
+---
+title: "Biostatistics"
+chunk: 6/6
+source: "https://en.wikipedia.org/wiki/Biostatistics"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T14:01:53.013815+00:00"
+instance: "kb-cron"
+---
+
+=== Expression data ===
+Studies for differential expression of genes from RNA-Seq data, as for RT-qPCR and microarrays, demands comparison of conditions. The goal is to identify genes which have a significant change in abundance between different conditions. Then, experiments are designed appropriately, with replicates for each condition/treatment, randomization and blocking, when necessary. In RNA-Seq, the quantification of expression uses the information of mapped reads that are summarized in some genetic unit, as exons that are part of a gene sequence. As microarray results can be approximated by a normal distribution, RNA-Seq counts data are better explained by other distributions. The first used distribution was the Poisson one, but it underestimate the sample error, leading to false positives. Currently, biological variation is considered by methods that estimate a dispersion parameter of a negative binomial distribution. Generalized linear models are used to perform the tests for statistical significance and as the number of genes is high, multiple tests correction have to be considered. Some examples of other analysis on genomics data comes from microarray or proteomics experiments. Often concerning diseases or disease stages.
+
+=== Other studies ===
+Ecology, ecological forecasting
+Biological sequence analysis
+Systems biology for gene network inference or pathways analysis.
+Clinical research and pharmaceutical development
+Population dynamics, especially in regards to fisheries science.
+Phylogenetics and evolution
+Pharmacodynamics
+Pharmacokinetics
+Neuroimaging
+
+== Tools ==
+There are a lot of tools that can be used to do statistical analysis in biological data. Most of them are useful in other areas of knowledge, covering a large number of applications (alphabetical). Here are brief descriptions of some of them:
+
+ASReml: Another software developed by VSNi that can be used also in R environment as a package. It is developed to estimate variance components under a general linear mixed model using restricted maximum likelihood (REML). Models with fixed effects and random effects and nested or crossed ones are allowed. Gives the possibility to investigate different variance-covariance matrix structures.
+CycDesigN: A computer package developed by VSNi that helps the researchers create experimental designs and analyze data coming from a design present in one of three classes handled by CycDesigN. These classes are resolvable, non-resolvable, partially replicated and crossover designs. It includes less used designs the Latinized ones, as t-Latinized design.
+Orange: A programming interface for high-level data processing, data mining and data visualization. Include tools for gene expression and genomics.
+R: An open source environment and programming language dedicated to statistical computing and graphics. It is an implementation of S language maintained by CRAN. In addition to its functions to read data tables, take descriptive statistics, develop and evaluate models, its repository contains packages developed by researchers around the world. This allows the development of functions written to deal with the statistical analysis of data that comes from specific applications. In the case of Bioinformatics, for example, there are packages located in the main repository (CRAN) and in others, as Bioconductor. It is also possible to use packages under development that are shared in hosting-services as GitHub.
+SAS: A data analysis software widely used, going through universities, services and industry. Developed by a company with the same name (SAS Institute), it uses SAS language for programming.
+PLA 3.0: Is a biostatistical analysis software for regulated environments (e.g. drug testing) which supports Quantitative Response Assays (Parallel-Line, Parallel-Logistics, Slope-Ratio) and Dichotomous Assays (Quantal Response, Binary Assays). It also supports weighting methods for combination calculations and the automatic data aggregation of independent assay data.
+Weka: A Java software for machine learning and data mining, including tools and methods for visualization, clustering, regression, association rule, and classification. There are tools for cross-validation, bootstrapping and a module of algorithm comparison. Weka also can be run in other programming languages as Perl or R.
+Python (programming language) image analysis, deep-learning, machine-learning
+SQL databases
+NoSQL
+NumPy numerical python
+SciPy
+SageMath
+LAPACK linear algebra
+MATLAB
+Apache Hadoop
+Apache Spark
+Amazon Web Services
+
+== Scope and training programs ==
+Almost all educational programmes in biostatistics are at postgraduate level. They are most often found in schools of public health, affiliated with schools of medicine, forestry, or agriculture, or as a focus of application in departments of statistics.
+In the United States, where several universities have dedicated biostatistics departments, many other top-tier universities integrate biostatistics faculty into statistics or other departments, such as epidemiology. Thus, departments carrying the name "biostatistics" may exist under quite different structures. For instance, relatively new biostatistics departments have been founded with a focus on bioinformatics and computational biology, whereas older departments, typically affiliated with schools of public health, will have more traditional lines of research involving epidemiological studies and clinical trials as well as bioinformatics. In larger universities around the world, where both a statistics and a biostatistics department exist, the degree of integration between the two departments may range from the bare minimum to very close collaboration. In general, the difference between a statistics program and a biostatistics program is twofold: (i) statistics departments will often host theoretical/methodological research which are less common in biostatistics programs and (ii) statistics departments have lines of research that may include biomedical applications but also other areas such as industry (quality control), business and economics and biological areas other than medicine.
+
+== Specialized journals ==
+
+Biostatistics
+International Journal of Biostatistics
+Journal of Epidemiology and Biostatistics
+Biostatistics and Public Health
+Biometrics
+Biometrika
+Biometrical Journal
+Communications in Biometry and Crop Science
+Statistical Applications in Genetics and Molecular Biology
+Statistical Methods in Medical Research
+Pharmaceutical Statistics
+Statistics in Medicine
+
+== See also ==
+Bioinformatics
+Epidemiological method
+Epidemiology
+Group size measures
+Health indicator
+Mathematical and theoretical biology
+
+== References ==
+
+== External links ==
+ Media related to Biostatistics at Wikimedia Commons
+The International Biometric Society
+The Collection of Biostatistics Research Archive
+Guide to Biostatistics (MedPageToday.com) Archived 2012-05-22 at the Wayback Machine
+Biomedical Statistics
--- a/data/en.wikipedia.org/wiki/Bloom_filters_in_bioinformatics-0.md
+++ b/data/en.wikipedia.org/wiki/Bloom_filters_in_bioinformatics-0.md
@ -0,0 +1,38 @@
+---
+title: "Bloom filters in bioinformatics"
+chunk: 1/1
+source: "https://en.wikipedia.org/wiki/Bloom_filters_in_bioinformatics"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T14:01:54.231025+00:00"
+instance: "kb-cron"
+---
+
+Bloom filters are space-efficient probabilistic data structures used to test whether an element is a part of a set. Bloom filters require much less space than other data structures for representing sets, however the downside of Bloom filters is that there is a false positive rate when querying the data structure. Since multiple elements may have the same hash values for a number of hash functions, then there is a probability that querying for a non-existent element may return a positive if another element with the same hash values has been added to the Bloom filter. Assuming that the hash function has equal probability of selecting any index of the Bloom filter, the false positive rate of querying a Bloom filter is a function of the number of bits, number of hash functions and number of elements of the Bloom filter. This allows the user to manage the risk of a getting a false positive by compromising on the space benefits of the Bloom filter.
+Bloom filters are primarily used in bioinformatics to test the existence of a k-mer in a sequence or set of sequences. The k-mers of the sequence are indexed in a Bloom filter, and any k-mer of the same size can be queried against the Bloom filter. This is a preferable alternative to hashing the k-mers of a sequence with a hash table, particularly when the sequence is very long, since it is very demanding to store large numbers of k-mers in memory.
+
+
+== Applications ==
+
+
+=== Sequence characterization ===
+
+The preprocessing step in many bioinformatics applications involves classifying sequences, primarily classifying reads from a DNA sequencing experiment. For example, in metagenomic studies it is important to be able to tell if a sequencing read belongs to a new species. and in clinical sequencing projects it is vital to filter out reads from the genomes of contaminating organisms. There are many bioinformatics tools that use Bloom filters to classify reads by querying k-mers of a read to a set of Bloom filters generated from known reference genomes. Some tools that use this method are FACS and BioBloom tools. While these methods may not outclass other bioinformatics classification tools like Kraken, they offer a memory-efficient alternative.
+A recent area of research with Bloom filters in sequence characterization is in developing ways to query raw reads from sequencing experiments. For example, how can one determine which reads contain a specific 30-mer in the entire NCBI Sequence Read Archive? This task is similar to that which is accomplished by BLAST, however it involves querying a much larger dataset; while BLAST queries against a database of reference genomes, this task demands that specific reads that contain the k-mer are returned. BLAST and similar tools cannot handle this problem efficiently, therefore Bloom filter based data structures have been implemented to this end. Binary bloom trees are binary trees of Bloom filters that facilitates querying transcripts in large RNA-seq experiments. BIGSI borrows bitsliced signatures from the field of document retrieval to index and query the entirety of microbial and viral sequencing data in the European Nucleotide Archive. The signature of a given dataset is encoded as a set of Bloom filters from that dataset.
+
+
+=== Genome assembly ===
+
+The memory efficiency of Bloom filters has been used in genome assembly as a way to reduce the space footprint of k-mers from sequencing data. The contribution of Bloom filter based assembly methods is combining Bloom filters and de Bruijn graphs into a structure called a probabilistic de Bruijn graph, which optimizes memory usage at the cost of the false positive rate inherent to Bloom filters. Instead of storing the de Bruijn graph in a hash table, it is stored in a Bloom filter.
+Using a Bloom filter to store the de Bruijn graph complicates the graph traversal step to build the assembly, since edge information is not encoded in the Bloom filter. Graph traversal is accomplished by querying the Bloom filter for any of the four possible subsequent k-mers from the current node. For example, if the current node is for the k-mer ACT, then the next node must be for one of the k-mers CTA, CTG, CTC or CTT. If a query k-mer exists in the Bloom filter, then the k-mer is added to the path. Therefore, there are two sources for false positives in querying the Bloom filter when traversing the de Bruijn graph. There is the probability that one or more of the three false k-mers exist elsewhere in the sequencing set to return a false positive, and there is the aforementioned inherent false positive rate of the Bloom filter itself. The assembly tools that use Bloom filters must account for these sources of false positives in their methods. ABySS 2 and Minia are examples of assemblers that uses this approach for de novo assembly.
+
+
+=== Sequencing error correction ===
+Next-generation sequencing (NGS) methods have allowed the generation of new genome sequences much faster and cheaper than the previous Sanger sequencing methods. However, these methods have a higher error rate, which complicates downstream analysis of the sequence and can even give rise to erroneous conclusions. Many methods have been developed to correct the errors in NGS reads, but they use large amounts of memory which makes them impractical for large genomes, such as the human genome. Therefore, tools using Bloom filters have been developed to address these limitations, taking advantage of their efficient memory usage. Musket and  BLESS are examples of such tools. Both methods use the k-mer spectrum approach for error correction. The first step of this approach is to count the multiplicity of k-mers, however while BLESS only uses Bloom filters to store the counts, Musket uses Bloom filters only to count unique k-mers, and stores non-unique k-mers in a hash table, as described in a previous work
+
+
+=== RNA-Seq ===
+Bloom filters are also employed in some RNA-Seq pipelines. RNA-Skim clusters RNA transcripts and then uses Bloom filters to find sig-mers: k-mers that are only found in one of the clusters. These sig-mers are then used to estimate the transcript abundance levels. Therefore, it does not analyze every possible k-mer which results in performance and memory-usage improvements, and has been shown to work as well as previous methods.
+
+
+== References ==
--- a/data/en.wikipedia.org/wiki/Boolean_network-0.md
+++ b/data/en.wikipedia.org/wiki/Boolean_network-0.md
@ -0,0 +1,337 @@
+---
+title: "Boolean network"
+chunk: 1/2
+source: "https://en.wikipedia.org/wiki/Boolean_network"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T14:01:56.694067+00:00"
+instance: "kb-cron"
+---
+
+A Boolean network consists of a discrete set of Boolean variables each of which has a Boolean function (possibly different for each variable) assigned to it which takes inputs from a subset of those variables and output that determines the state of the variable it is assigned to.  This set of functions in effect determines a topology (connectivity) on the set of variables, which then become nodes in a network. Usually, the dynamics of the system is taken as a discrete time series where the state of the entire network at time t+1 is determined by evaluating each variable's function on the state of the network at time t.  This may be done synchronously or asynchronously.
+Boolean networks have been used in biology to model regulatory networks. Although Boolean networks are a crude simplification of genetic reality where genes are not simple binary switches, there are several cases where they correctly convey the correct pattern of expressed and suppressed genes. 
+The seemingly mathematical easy (synchronous) model was only fully understood in the mid 2000s.
+
+== Classical model ==
+A Boolean network is a particular kind of sequential dynamical system, where time and states are discrete, i.e. both the set of variables and the set of states in the time series each have a bijection onto an integer series.
+A random Boolean network (RBN) is one that is randomly selected from the set of all possible Boolean networks of a particular size, N.  One then can study statistically, how the expected properties of such networks depend on various statistical properties of the ensemble of all possible networks.  For example, one may study how the RBN behavior changes as the average connectivity is changed.
+The first Boolean networks were proposed by Stuart A. Kauffman in 1969, as random models of genetic regulatory networks but their mathematical understanding only started in the 2000s.
+
+=== Attractors ===
+Since a Boolean network has only 2N possible states, a trajectory will sooner or later  reach a previously visited state, and thus, since the dynamics are deterministic, the trajectory will fall into a steady state or cycle called an attractor (though in the broader field of dynamical systems a cycle is only an attractor if perturbations from it lead back to it). If the attractor has only a single state it is called a point attractor, and if the attractor consists of more than one state it is called a cycle attractor. The set of states that lead to an attractor is called the basin of the attractor. States which occur only at the beginning of trajectories (no trajectories lead to them), are called garden-of-Eden states and the dynamics of the network flow from these states towards attractors. The time it takes to reach an attractor is called transient time.
+With growing computer power and increasing understanding of the seemingly simple model, different authors gave different estimates for the mean number and length of the attractors, here a brief summary of key publications.
+
+== Stability ==
+In dynamical systems theory, the structure and length of the attractors of a network corresponds to the dynamic phase of the network. The stability of Boolean networks depends on the connections of their nodes. A Boolean network can exhibit stable, critical or chaotic behavior. This phenomenon is governed by a critical value of the average number of connections of nodes (
+  
+    
+      
+        
+          K
+          
+            c
+          
+        
+      
+    
+    {\displaystyle K_{c}}
+  
+), and can be characterized by the Hamming distance as distance measure. In the unstable regime, the distance between two initially close states on average grows exponentially in time, while in the stable regime it decreases exponentially. In this, with "initially close states" one means that the Hamming distance is small compared with the number of nodes (
+  
+    
+      
+        N
+      
+    
+    {\displaystyle N}
+  
+) in the network.
+For N-K-model the network is stable if 
+  
+    
+      
+        K
+        <
+        
+          K
+          
+            c
+          
+        
+      
+    
+    {\displaystyle K<K_{c}}
+  
+, critical if 
+  
+    
+      
+        K
+        =
+        
+          K
+          
+            c
+          
+        
+      
+    
+    {\displaystyle K=K_{c}}
+  
+, and unstable if 
+  
+    
+      
+        K
+        >
+        
+          K
+          
+            c
+          
+        
+      
+    
+    {\displaystyle K>K_{c}}
+  
+.
+The state of a given node 
+  
+    
+      
+        
+          n
+          
+            i
+          
+        
+      
+    
+    {\displaystyle n_{i}}
+  
+ is updated according to its truth table, whose outputs are randomly populated. 
+  
+    
+      
+        
+          p
+          
+            i
+          
+        
+      
+    
+    {\displaystyle p_{i}}
+  
+ denotes the probability of assigning an off output to a given series of input signals.
+If 
+  
+    
+      
+        
+          p
+          
+            i
+          
+        
+        =
+        p
+        =
+        c
+        o
+        n
+        s
+        t
+        .
+      
+    
+    {\displaystyle p_{i}=p=const.}
+  
+ for every node, the transition between the stable and chaotic range depends on 
+  
+    
+      
+        p
+      
+    
+    {\displaystyle p}
+  
+. According to Bernard Derrida and Yves Pomeau
+, the critical value of the average  number of connections is 
+  
+    
+      
+        
+          K
+          
+            c
+          
+        
+        =
+        1
+        
+          /
+        
+        [
+        2
+        p
+        (
+        1
+        −
+        p
+        )
+        ]
+      
+    
+    {\displaystyle K_{c}=1/[2p(1-p)]}
+  
+.
+If 
+  
+    
+      
+        K
+      
+    
+    {\displaystyle K}
+  
+ is not constant, and there is no correlation between the in-degrees and out-degrees, the conditions of stability is determined by 
+  
+    
+      
+        ⟨
+        
+          K
+          
+            i
+            n
+          
+        
+        ⟩
+      
+    
+    {\displaystyle \langle K^{in}\rangle }
+  
+ The network is stable if 
+  
+    
+      
+        ⟨
+        
+          K
+          
+            i
+            n
+          
+        
+        ⟩
+        <
+        
+          K
+          
+            c
+          
+        
+      
+    
+    {\displaystyle \langle K^{in}\rangle <K_{c}}
+  
+, critical if  
+  
+    
+      
+        ⟨
+        
+          K
+          
+            i
+            n
+          
+        
+        ⟩
+        =
+        
+          K
+          
+            c
+          
+        
+      
+    
+    {\displaystyle \langle K^{in}\rangle =K_{c}}
+  
+, and unstable if 
+  
+    
+      
+        ⟨
+        
+          K
+          
+            i
+            n
+          
+        
+        ⟩
+        >
+        
+          K
+          
+            c
+          
+        
+      
+    
+    {\displaystyle \langle K^{in}\rangle >K_{c}}
+  
+.
+The conditions of stability are the same in the case of networks with scale-free topology where the in-and out-degree distribution is a power-law distribution: 
+  
+    
+      
+        P
+        (
+        K
+        )
+        ∝
+        
+          K
+          
+            −
+            γ
+          
+        
+      
+    
+    {\displaystyle P(K)\propto K^{-\gamma }}
+  
+, and 
+  
+    
+      
+        ⟨
+        
+          K
+          
+            i
+            n
+          
+        
+        ⟩
+        =
+        ⟨
+        
+          K
+          
+            o
+            u
+            t
+          
+        
+        ⟩
+      
+    
+    {\displaystyle \langle K^{in}\rangle =\langle K^{out}\rangle }
+  
+, since every out-link from a node is an in-link to another.
+Sensitivity shows the probability that the output of the Boolean function of a given node changes if its input changes. For random Boolean networks,
--- a/data/en.wikipedia.org/wiki/Boolean_network-1.md
+++ b/data/en.wikipedia.org/wiki/Boolean_network-1.md
@ -0,0 +1,191 @@
+---
+title: "Boolean network"
+chunk: 2/2
+source: "https://en.wikipedia.org/wiki/Boolean_network"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T14:01:56.694067+00:00"
+instance: "kb-cron"
+---
+
+  
+    
+      
+        
+          q
+          
+            i
+          
+        
+        =
+        2
+        
+          p
+          
+            i
+          
+        
+        (
+        1
+        −
+        
+          p
+          
+            i
+          
+        
+        )
+      
+    
+    {\displaystyle q_{i}=2p_{i}(1-p_{i})}
+  
+. In the general case, stability of the network is governed by the largest eigenvalue 
+  
+    
+      
+        
+          λ
+          
+            Q
+          
+        
+      
+    
+    {\displaystyle \lambda _{Q}}
+  
+ of matrix 
+  
+    
+      
+        Q
+      
+    
+    {\displaystyle Q}
+  
+, where 
+  
+    
+      
+        
+          Q
+          
+            i
+            j
+          
+        
+        =
+        
+          q
+          
+            i
+          
+        
+        
+          A
+          
+            i
+            j
+          
+        
+      
+    
+    {\displaystyle Q_{ij}=q_{i}A_{ij}}
+  
+, and  
+  
+    
+      
+        A
+      
+    
+    {\displaystyle A}
+  
+ is the adjacency matrix of the network. The network is stable if 
+  
+    
+      
+        
+          λ
+          
+            Q
+          
+        
+        <
+        1
+      
+    
+    {\displaystyle \lambda _{Q}<1}
+  
+, critical if 
+  
+    
+      
+        
+          λ
+          
+            Q
+          
+        
+        =
+        1
+      
+    
+    {\displaystyle \lambda _{Q}=1}
+  
+, unstable if 
+  
+    
+      
+        
+          λ
+          
+            Q
+          
+        
+        >
+        1
+      
+    
+    {\displaystyle \lambda _{Q}>1}
+  
+.
+
+== Variations of the model ==
+
+=== Other topologies ===
+One theme is to study different underlying graph topologies.
+
+The homogeneous case simply refers to a grid which is simply the reduction to the famous Ising model.
+Scale-free topologies may be chosen for Boolean networks. One can distinguish the case where only in-degree distribution in power-law distributed, or only the out-degree-distribution or both.
+
+=== Other updating schemes ===
+Classical Boolean networks (sometimes called CRBN, i.e. Classic Random Boolean Network) are synchronously updated. Motivated by the fact that genes don't usually change their state simultaneously, different alternatives have been introduced. A common classification is the following:
+
+Deterministic asynchronous updated Boolean networks (DRBNs) are not synchronously updated but a deterministic solution still exists. A node i will be updated when t ≡ Qi (mod Pi) where t is the time step.
+The most general case is full stochastic updating (GARBN, general asynchronous random Boolean networks). Here, one (or more) node(s) are selected at each computational step to be updated.
+The Partially-Observed Boolean Dynamical System (POBDS) signal model differs from all previous deterministic and stochastic Boolean network models by removing the assumption of direct observability of the Boolean state vector and allowing uncertainty in the observation process, addressing the scenario encountered in practice.
+Autonomous Boolean networks (ABNs) are updated in continuous time (t is a real number, not an integer), which leads to race conditions and complex dynamical behavior such as deterministic chaos.
+
+== Application of Boolean Networks ==
+
+=== Classification ===
+The Scalable Optimal Bayesian Classification  developed an optimal classification of trajectories accounting for potential model uncertainty and also proposed a particle-based trajectory classification that is highly scalable for large networks with much lower complexity than the optimal solution.
+
+== See also ==
+NK model
+
+== References ==
+
+Dubrova, E., Teslenko, M., Martinelli, A., (2005). *Kauffman Networks: Analysis and Applications,  in "Proceedings of International Conference on Computer-Aided Design", pages 479-484.
+
+== External links ==
+Analysis of Dynamic Algebraic Models (ADAM) v1.1
+bioasp/bonesis: Synthesis of Most Permissive Boolean Networks from network architecture and dynamical properties
+CoLoMoTo (Consortium for Logical Models and Tools)
+DDLab
+NetBuilder Boolean Networks Simulator
+Open Source Boolean Network Simulator
+JavaScript Kauffman Network
+Probabilistic Boolean Networks (PBN)
+RBNLab
+A SAT-based tool for computing attractors in Boolean Networks
--- a/data/en.wikipedia.org/wiki/Brain_mapping-0.md
+++ b/data/en.wikipedia.org/wiki/Brain_mapping-0.md
@ -0,0 +1,31 @@
+---
+title: "Brain mapping"
+chunk: 1/2
+source: "https://en.wikipedia.org/wiki/Brain_mapping"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T14:01:57.957041+00:00"
+instance: "kb-cron"
+---
+
+Brain mapping is a set of neuroscience techniques predicated on the mapping of (biological) quantities or properties onto spatial representations of the (human or non-human) brain resulting in maps.
+According to the definition established in 2013 by Society for Brain Mapping and Therapeutics (SBMT), brain mapping is specifically defined, in summary, as the study of the anatomy and function of the brain and spinal cord through the use of imaging, immunohistochemistry, molecular & optogenetics, stem cell and cellular biology, engineering, neurophysiology and nanotechnology.
+In 2024, a team of 287 researchers completed a full brain mapping of an adult animal (a Drosophila melanogaster, or fruit fly) and published their results in Nature.
+
+== Overview ==
+All neuroimaging is considered part of brain mapping. Brain mapping can be conceived as a higher form of neuroimaging, producing brain images supplemented by the result of additional (imaging or non-imaging) data processing or analysis, such as maps projecting (measures of) behavior onto brain regions (see fMRI). One such map, called a connectogram, depicts cortical regions around a circle, organized by lobes. Concentric circles within the ring represent various common neurological measurements, such as cortical thickness or curvature. In the center of the circles, lines representing white matter fibers illustrate the connections between cortical regions, weighted by fractional anisotropy and strength of connection. At higher resolutions brain maps are called connectomes. These maps incorporate individual neural connections in the brain and are often presented as wiring diagrams.
+Brain mapping techniques are constantly evolving, and rely on the development and refinement of image acquisition, representation, analysis, visualization and interpretation techniques.  Functional and structural neuroimaging are at the core of the mapping aspect of brain mapping.
+Some scientists have criticized the brain image-based claims made in scientific journals and the popular press, like the discovery of "the part of the brain responsible" things like love or musical abilities or a specific memory.  Many mapping techniques have a relatively low resolution, including hundreds of thousands of neurons in a single voxel.  Many functions also involve multiple parts of the brain, meaning that this type of claim is probably both unverifiable with the equipment used, and generally based on an incorrect assumption about how brain functions are divided.  It may be that most brain functions will only be described correctly after being measured with much more fine-grained measurements that look not at large regions but instead at a very large number of tiny individual brain circuits.  Many of these studies also have technical problems like small sample size or poor equipment calibration which means they cannot be reproduced - considerations which are sometimes ignored to produce a sensational journal article or news headline.  In some cases the brain mapping techniques are used for commercial purposes, lie detection, or medical diagnosis in ways which have not been scientifically validated.
+
+== History ==
+In the late 1980s in the United States, the Institute of Medicine of the National Academy of Science was commissioned to establish a panel to investigate the value of integrating neuroscientific information across a variety of techniques.
+Of specific interest is using structural and functional magnetic resonance imaging (fMRI), diffusion MRI (dMRI), magnetoencephalography (MEG), electroencephalography (EEG), positron emission tomography (PET), Near-infrared spectroscopy (NIRS)  and other non-invasive scanning techniques to map anatomy, physiology, perfusion, function and phenotypes of the human brain. Both healthy and diseased brains may be mapped to study memory, learning, aging, and drug effects in various populations such as people with schizophrenia, autism, and clinical depression. This led to the establishment of the Human Brain Project.  It may also be crucial to understanding traumatic brain injuries (as in the case of Phineas Gage) and improving brain injury treatment.
+Following a series of meetings, the International Consortium for Brain Mapping (ICBM) evolved.  The ultimate goal is to develop flexible computational brain atlases.
+
+=== Achievements ===
+
+The interactive and citizen science website Eyewire maps mices' retinal cells and was launched in 2012. In 2021, the most comprehensive 3D map of the human brain was published by researchers at Google. It shows neurons and their connections along with blood vessels and other components of a millionth of a brain. For the map, the 1 mm3 sized fragment was sliced into about 5,300 pieces of about 30 nanometer thickness which were then each scanned with an electron microscope. The interactive map required 1.4 petabytes of storage-space. About two months later, scientists reported that they created the first complete neuron-level-resolution 3D map of a monkey brain which they scanned via a new method within 100 hours. They made only a fraction of the 3D map publicly available as the entire map takes more than 1 petabyte of storage space even when compressed.
+In October 2021, the BRAIN Initiative Cell Census Network concluded the first phase of a long-term project to generate an atlas of the entire mouse (mammalian) brain with 17 studies, including an atlas and census of cell types in the primary motor cortex.
+In 2024, FlyWire, a team of 287 researchers spanning 76 institutions completed a brain mapping, or connectome, of an adult animal (a Drosophila melanogaster, or fruit fly) and published their results in Nature. Prior to this, the only adult animal to have its brain entirely reconstructed was the nematode Caenorhabditis elegans, but the fruit fly brain map is the first "complete map of any complex brain", according to Murthy, one of the researchers involved. Primary mapping data was collected through electron microscopy, assisted by artificial intelligence and citizen scientists, who corrected errors that artificial intelligence made. The resulting model had more than 140,000 neurons with over 50 million synapses. From the model, research expect to identify how the brain creates new connections for functions such as vision, creating digital twin equivalents to track how segments of the neuron connection map interact to external signals, including the nervous system.
+
+==== Brain development ====
--- a/data/en.wikipedia.org/wiki/Brain_mapping-1.md
+++ b/data/en.wikipedia.org/wiki/Brain_mapping-1.md
@ -0,0 +1,40 @@
+---
+title: "Brain mapping"
+chunk: 2/2
+source: "https://en.wikipedia.org/wiki/Brain_mapping"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T14:01:57.957041+00:00"
+instance: "kb-cron"
+---
+
+In 2021, the first connectome that shows how an animal's brain changes throughout its lifetime was reported. Scientists mapped and compared the whole brains of eight isogenic C. elegans worms, each at a different stage of development. Later that year, scientists combined electron microscopy and brainbow imaging to show for the first time the development of a mammalian neural circuit. They reported the complete wiring diagrams between the CNS and muscles of ten individual mice.
+
+==== Vision ====
+In August 2021, scientists of the MICrONS program, launched in 2016, published a functional connectomics dataset that "contains calcium imaging of an estimated 75,000 neurons from primary visual cortex (VISp) and three higher visual areas (VISrl, VISal and VISlm), that were recorded while a mouse viewed natural movies and parametric stimuli". Based on this data they also published "interactive visualizations of anatomical and functional data that span all 6 layers of mouse primary visual cortex and 3 higher visual areas (LM, AL, RL) within a cubic millimeter volume" – the MICrONS Explorer.
+
+==== Brain regeneration ====
+In 2022, a first spatiotemporal cellular atlas of the axolotl brain development and regeneration, the interactive Axolotl Regenerative Telencephalon Interpretation via Spatiotemporal Transcriptomic Atlas , revealed key insights about axolotl brain regeneration.
+
+== Current atlas tools ==
+Talairach Atlas, 1988
+Harvard Whole Brain Atlas, 1995
+MNI Template, 1998 (the standard template of SPM and International Consortium for Brain Mapping)
+Atlas of the Developing Human Brain, 2012
+Infant Brain Atlas, 2023
+
+== See also ==
+
+== References ==
+
+== Further reading ==
+Rita Carter (1998). Mapping the Mind.
+F.J. Chen (2006). Brain Mapping And Language
+F.J. Chen (2006). Focus on Brain Mapping Research.
+F.J. Chen (2006). Trends in Brain Mapping Research.
+F.J. Chen (2006). Progress in Brain Mapping Research.
+Koichi Hirata (2002). Recent Advances in Human Brain Mapping: Proceedings of the 12th World Congress of the International Society for Brain Electromagnetic Topography (ISBET 2001).
+Konrad Maurer and Thomas Dierks (1991). Atlas of Brain Mapping: Topographic Mapping of Eeg and Evoked Potentials.
+Konrad Maurer (1989). Topographic Brain Mapping of Eeg and Evoked Potentials.
+Arthur W. Toga and John C. Mazziotta (2002). Brain Mapping: The Methods.
+Tatsuhiko Yuasa, James Prichard and S. Ogawa (1998). Current Progress in Functional Brain Mapping: Science and Applications.
--- a/data/en.wikipedia.org/wiki/CAFASP-0.md
+++ b/data/en.wikipedia.org/wiki/CAFASP-0.md
@ -0,0 +1,20 @@
+---
+title: "CAFASP"
+chunk: 1/1
+source: "https://en.wikipedia.org/wiki/CAFASP"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T14:02:00.384510+00:00"
+instance: "kb-cron"
+---
+
+CAFASP, or the Critical Assessment of Fully Automated Structure Prediction, is a large-scale blind experiment in protein structure prediction that studies the performance of automated structure prediction webservers in homology modeling, fold recognition, and ab initio prediction of protein tertiary structures based only on amino acid sequence. The experiment runs once every two years in parallel with CASP, which focuses on predictions that incorporate human intervention and expertise. Compared to related benchmarking techniques LiveBench and EVA, which run weekly against newly solved protein structures deposited in the Protein Data Bank, CAFASP generates much less data, but has the advantage of producing predictions that are directly comparable to those produced by human prediction experts. Recently CAFASP has been run essentially integrated into the CASP results rather than as a separate experiment.
+
+
+== References ==
+
+
+== External links ==
+Protein Structure Prediction Center
+CAFASP4 (2004)
+CAFASP5 (2006)
--- a/data/en.wikipedia.org/wiki/CAMEO3D-0.md
+++ b/data/en.wikipedia.org/wiki/CAMEO3D-0.md
@ -0,0 +1,33 @@
+---
+title: "CAMEO3D"
+chunk: 1/1
+source: "https://en.wikipedia.org/wiki/CAMEO3D"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T14:02:01.587230+00:00"
+instance: "kb-cron"
+---
+
+Continuous Automated Model EvaluatiOn (CAMEO) is a community-wide project to continuously evaluate the accuracy and reliability of protein structure prediction servers in a fully automated manner. CAMEO is a continuous and fully automated complement to the bi-annual CASP experiment.
+Currently, CAMEO evaluates predictions for predicted three-dimensional protein structures (3D), ligand binding site predictions in proteins (LB), and model quality estimation tools (QE).
+
+
+== Workflow ==
+CAMEO performs blind assessment of protein structure prediction techniques based on the weekly releases of newly determined experimental structures by the  Protein Databank (PDB). The amino acid sequences of soon to be released protein structures are submitted  to the participating web-servers. The web-servers return their predictions to CAMEO, and predictions received before the experimental structures have been released are included in the assessment of prediction accuracy. In contrast to the CASP experiment, the comparison between prediction and reference data is fully automated, and therefore requires numerical distance measures which are robust against relative domain movements.
+
+
+== History ==
+CAMEO was developed as part of the Protein Model Portal module of the Structural Biology Knowledge Base as part of the Protein Structure Initiative. CAMEO is being developed by the computational structural biology group at the SIB Swiss Institute of Bioinformatics and the Biozentrum, University of Basel.
+Earlier projects with similar aims were EVA and LiveBench.
+
+
+== See also ==
+Protein structure prediction software
+
+
+== References ==
+
+
+== External links ==
+CAMEO home page
+Protein Model Portal
--- a/data/en.wikipedia.org/wiki/CASP-0.md
+++ b/data/en.wikipedia.org/wiki/CASP-0.md
@ -0,0 +1,50 @@
+---
+title: "CASP"
+chunk: 1/2
+source: "https://en.wikipedia.org/wiki/CASP"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T14:02:04.034765+00:00"
+instance: "kb-cron"
+---
+
+Critical Assessment of Structure Prediction (CASP), sometimes called Critical Assessment of Protein Structure Prediction, is a community-wide, worldwide experiment for protein structure prediction taking place every two years since 1994. CASP provides research groups with an opportunity to objectively test their structure prediction methods and delivers an independent assessment of the state of the art in protein structure modeling to the research community and software users. Even though the primary goal of CASP is to help advance the methods of identifying protein three-dimensional structure from its amino acid sequence many view the experiment more as a "world championship" in this field of science. More than 100 research groups from all over the world participate in CASP on a regular basis and it is not uncommon for entire groups to suspend their other research for months while they focus on getting their servers ready for the experiment and on performing the detailed predictions.
+
+== Selection of target proteins ==
+In order to ensure that no predictor can have prior information about a protein's structure that would put them at an advantage, it is important that the experiment be conducted in a double-blind fashion: Neither predictors nor the organizers and assessors know the structures of the target proteins at the time when predictions are made. Targets for structure prediction are either structures soon-to-be solved by X-ray crystallography or NMR spectroscopy, or structures that have just been solved (mainly by one of the structural genomics centers) and are kept on hold by the Protein Data Bank. If the given sequence is found to be related by common descent to a protein sequence of known structure (called a template), comparative protein modeling may be used to predict the tertiary structure. Templates can be found using sequence alignment methods (e.g. BLAST or HHsearch) or protein threading methods, which are better in finding distantly related templates. Otherwise, de novo protein structure prediction must be applied (e.g. Rosetta), which is much less reliable but can sometimes yield models with the correct fold (usually, for proteins less than 100-150 amino acids). Truly new folds are becoming quite rare among the targets, making that category smaller than desirable.
+
+== Evaluation ==
+The primary method of evaluation is a comparison of the predicted model α-carbon positions with those in the target structure. The comparison is shown visually by cumulative plots of distances between pairs of equivalents α-carbon in the alignment of the model and the structure, such as shown in the figure (a perfect model would stay at zero all the way across), and is assigned a numerical score GDT-TS (Global Distance Test—Total Score) describing percentage of well-modeled residues in the model with respect to the target. Free modeling (template-free, or de novo) is also evaluated visually by the assessors, since the numerical scores do not work as well for finding loose resemblances in the most difficult cases. High-accuracy template-based predictions were evaluated in CASP7 by whether they worked for molecular-replacement phasing of the target crystal structure with successes followed up later, and by full-model (not just α-carbon) model quality and full-model match to the target in CASP8.
+Evaluation of the results is carried out in the following prediction categories:
+
+tertiary structure prediction (all CASPs)
+secondary structure prediction (dropped after CASP5)
+prediction of structure complexes (CASP2 only; a separate experiment—CAPRI—carries on this subject)
+residue-residue contact prediction (starting CASP4)
+disordered regions prediction (starting CASP5)
+domain boundary prediction (CASP6–CASP8)
+function prediction (starting CASP6)
+model quality assessment (starting CASP7)
+model refinement (starting CASP7)
+high-accuracy template-based prediction (starting CASP7)
+Tertiary structure prediction category was further subdivided into:
+
+homology modeling
+fold recognition (also called protein threading; note that this naming is incorrect as threading is a method)
+de novo structure prediction, now referred to as 'New Fold' as many methods apply evaluation, or scoring, functions that are biased by knowledge of native protein structures, such as an artificial neural network.
+Starting with CASP7, categories have been redefined to reflect developments in methods. The 'Template based modeling' category includes all former comparative modeling, homologous fold based models and some analogous fold based models. The 'template free modeling (FM)' category includes models of proteins with previously unseen folds and hard analogous fold based models. Due to limited numbers of template free targets (they are quite rare), in 2011 so called CASP ROLL was introduced. This continuous (rolling) CASP experiment aims at more rigorous evaluation of template free prediction methods through assessment of a larger number of targets outside of the regular CASP prediction season. Unlike LiveBench and EVA, this experiment is in the blind-prediction spirit of CASP, i.e. all the predictions are made on yet unknown structures.
+The CASP results are published in special supplement issues of the scientific journal Proteins, all of which are accessible through the CASP website. A lead article in each of these supplements describes specifics of the experiment
+while a closing article evaluates progress in the field.
+
+== AlphaFold ==
+In December 2018, CASP13 made headlines when it was won by AlphaFold, an artificial intelligence program created by DeepMind. In November 2020, an improved version 2 of AlphaFold won CASP14. According to one of CASP co-founders John Moult, AlphaFold scored around 90 on a 100-point scale of prediction accuracy for moderately difficult protein targets. AlphaFold was made open source in 2021, and in CASP15 in 2022, while DeepMind did not enter, virtually all of the high-ranking teams used AlphaFold or modifications of AlphaFold.
+
+== NIH funding cancellation ==
+Until 2025, funding for the competition was provided by a grant from the National Institutes of Health. However, the NIH did not renew funding for the program in 2025, due to budget cuts made by the Trump administration. The competition was on the verge of closing down before Google DeepMind stepped in to provide interim funding.
+
+== See also ==
+Critical Assessment of Prediction of Interactions (CAPRI)
+Critical Assessment of Function Annotation (CAFA)
+Critical Assessment of Genome Interpretation (CAGI)
+
+== References ==
--- a/data/en.wikipedia.org/wiki/CASP-1.md
+++ b/data/en.wikipedia.org/wiki/CASP-1.md
@ -0,0 +1,65 @@
+---
+title: "CASP"
+chunk: 2/2
+source: "https://en.wikipedia.org/wiki/CASP"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T14:02:04.034765+00:00"
+instance: "kb-cron"
+---
+
+== External links ==
+Official website
+CASP ROLL
+FORCASP Forum Archived 2020-12-01 at the Wayback Machine
+
+=== Result ranking ===
+Automated assessments for CASP15 (2022)
+
+Official ranking for servers only
+Official ranking for humans and servers
+Automated assessments for CASP14 (2020)
+
+Official ranking for servers only
+Official ranking for humans and servers
+Ranking by Zhang Lab
+Automated assessments for CASP13 (2018)
+
+Official ranking for servers only
+Official ranking for humans and servers
+Ranking by Zhang Lab
+Automated assessments for CASP12 (2016)
+
+Official ranking for servers only
+Official ranking for humans and servers
+Ranking by Zhang Lab
+Automated assessments for CASP11 (2014)
+
+Official ranking for servers only (126 targets)
+Official ranking for humans and servers (78 targets)
+Ranking by Zhang Lab
+Automated assessments for CASP10 (2012)
+
+Official ranking for servers only (127 targets)
+Official ranking for humans and servers (71 targets)
+Ranking by Zhang Lab
+Automated assessments for CASP9 (2010)
+
+Official ranking for servers only (147 targets)
+Official ranking for humans and servers (78 targets)
+Ranking by Grishin Lab (for server only)
+Ranking by Grishin Lab (for human and servers)
+Ranking by Zhang Lab
+Ranking by Cheng Lab
+Automated assessments for CASP8 (2008)
+
+Official ranking for servers only
+Official ranking for humans and servers
+Ranking by Zhang Lab
+Ranking by Grishin Lab
+Ranking McGuffin Lab
+Ranking by Cheng Lab
+Automated assessments for CASP7 (2006)
+
+Ranking by Livebench
+Ranking by Zhang Lab
--- a/data/en.wikipedia.org/wiki/CERNO_test-0.md
+++ b/data/en.wikipedia.org/wiki/CERNO_test-0.md
@ -0,0 +1,89 @@
+---
+title: "CERNO test"
+chunk: 1/1
+source: "https://en.wikipedia.org/wiki/CERNO_test"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T14:02:06.531884+00:00"
+instance: "kb-cron"
+---
+
+CERNO (Coincident Extreme Ranks in Numerical Observations) is a non-parametric, rank-based statistical test that evaluates the distribution of ranks for a subset of samples that have been labeled (the labels defining the subset). The method has been used in gene set and pathway analysis. In this applied context, the method assesses whether a predefined set of genes, proteins, or other features shows coincident enrichment for high or low ranks within a globally ranked list.
+
+
+== Publication of the Method ==
+The CERNO statistic was published in a 2008 study on interferon-beta-regulated gene expression in relapsing–remitting multiple sclerosis. It was subsequently used in transcriptomic and proteomic studies. The test was further described in the supplementary materials of a 2013 pharmacogenomics study.
+An independent, comprehensive evaluation of the algorithm was published by Zyla et al. in 2019.
+
+
+== Methodology ==
+
+The CERNO test evaluates whether the ranks of a set of genes or features within a genome-wide ranking (from most to least significant by any metric) are collectively more extreme than would be expected by chance. This makes it sensitive to sets with even a few strongly ranked members, rather than requiring uniform or over-a-threshold significance of all genes in the set.
+The test statistic for a gene set of size k in a ranked list of N genes is:
+
+  
+    
+      
+        S
+        =
+        −
+        2
+        
+          ∑
+          
+            i
+            =
+            1
+          
+          
+            k
+          
+        
+        ln
+        ⁡
+        
+          (
+          
+            
+              
+                r
+                
+                  i
+                
+              
+              N
+            
+          
+          )
+        
+      
+    
+    {\displaystyle S=-2\sum _{i=1}^{k}\ln \left({\frac {r_{i}}{N}}\right)}
+  
+
+where ri is the rank of the ith gene in the set. Under the null hypothesis of random rank distribution, S follows a chi-square distribution with 2k degrees of freedom.
+
+
+== Comparison with Other Methods ==
+Zyla et al. noted some advantages of CERNO, including that it showed the highest reproducibility of the methods they investigated, as well as good sensitivity, prioritization and low computational time. That study notes the non-parametric method is robust to ranking metrics, as well as sample and gene set size.
+
+
+== CERNO is Related to Fisher's Method of Combining Tests ==
+The CERNO test is mathematically related to Fisher's method of combining p-values for independent statistical tests. Fisher's method is known for its favorable asymptotic properties, especially as measured by Bahadur efficiency, which describes the rate at which the observed significance of a test statistic converges to zero as the sample size increases. Tests with higher Bahadur efficiency exhibit rapid convergence.
+Littell and Folks (1971) demonstrated the asymptotic optimality of Fisher's method of combining tests, showing that for independent tests, the negative logarithm of the significance level (−2log(significance)) diverges to infinity at the fastest possible rate among combination tests.
+In contrast, the Kolmogorov–Smirnov test, which is the basis for several gene set analysis methods, was shown by Hwang (1982) to have much lower Bahadur efficiency compared to the chi-squared test. The Kolmogorov–Smirnov test is "always well worse" than the chi-squared test in this measure. This is relevant as the CERNO statistic S follows a chi-square distribution with 2k degrees of freedom.
+As the Kolmogorov–Smirnov test is the basis of many commonly used gene set enrichment analysis methods, CERNO—which reflects Fisher's combined test properties—may offer statistical power or efficiency advantages in this context.
+
+
+== Software ==
+The CERNO method is easily implemented due to its simple mathematical form. CERNO has been implemented in the tmod R package.
+
+
+== See also ==
+Fisher's method
+Mann–Whitney U test
+Order statistics
+Pathway analysis
+
+
+== References ==
--- a/data/en.wikipedia.org/wiki/CIT_Program_Tumor_Identity_Cards-0.md
+++ b/data/en.wikipedia.org/wiki/CIT_Program_Tumor_Identity_Cards-0.md
@ -0,0 +1,31 @@
+---
+title: "CIT Program Tumor Identity Cards"
+chunk: 1/1
+source: "https://en.wikipedia.org/wiki/CIT_Program_Tumor_Identity_Cards"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T14:02:13.832859+00:00"
+instance: "kb-cron"
+---
+
+The "Cartes d'Identité des Tumeurs (CIT)" program, launched and funded by the French charity "Ligue Nationale contre le Cancer," aims to improve or develop better targeted therapeutic approaches by refining molecular knowledge of multiple types of tumors. The CIT program mainly relies on the large-scale and systematic profiling of large cohorts of tumors at various molecular levels including at least the genome, the epigenome, and the transcriptome.
+
+
+== See also ==
+Precision medicine
+Oncology
+Cancer Research
+Bioinformatics
+Computational genomics
+Oncogenomics
+Genomics
+Transcriptome
+Gene expression profiling
+
+
+== References ==
+
+
+== External links ==
+Official web site
+List of main scientific publications
--- a/data/en.wikipedia.org/wiki/CRAM_(file_format)-0.md
+++ b/data/en.wikipedia.org/wiki/CRAM_(file_format)-0.md
@ -0,0 +1,41 @@
+---
+title: "CRAM (file format)"
+chunk: 1/1
+source: "https://en.wikipedia.org/wiki/CRAM_(file_format)"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T14:02:24.770985+00:00"
+instance: "kb-cron"
+---
+
+Compressed Reference-oriented Alignment Map (CRAM) is a compressed columnar file format for storing biological sequences aligned to a reference sequence, initially devised by Markus Hsi-Yang Fritz et al.
+CRAM was designed to be an efficient reference-based alternative to the Sequence Alignment Map (SAM) and Binary Alignment Map (BAM) file formats.  It optionally uses a genomic reference to describe differences between the aligned sequence fragments and the reference sequence, reducing storage costs.  Additionally each column in the SAM format is separated into its own blocks, improving compression ratio.  CRAM files typically vary from 30 to 60% smaller than BAM, depending on the data held within them.
+Implementations of CRAM exist in htsjdk, htslib, JBrowse, and Scramble.
+The file format specification is maintained by the Global Alliance for Genomics and Health (GA4GH) with the specification document available from the EBI cram toolkit page.
+
+
+== File format ==
+The basic structure of a CRAM file is a series of containers, the first of which holds a compressed copy of the SAM header.  Subsequent containers consist of a container Compression Header followed by a series of slices which in turn hold the alignment records themselves, formatted as a series of blocks.
+CRAM file:
+
+Container:
+
+Slice:
+
+CRAM constructs records from a set of data series, describing the components of an alignment.  The container Compression Header specifies which data series is encoded in which block, what codec will be used, and any codec specific meta-data (for example a table of Huffman symbol code lengths).  While data series can be mixed together within the same block, keeping them separate usually improves compression and provides the opportunity for efficient selective decoding where only some data types are required.
+Selective access to a CRAM file is granted via the index (with file-name suffix ".crai").  On chromosome and position sorted data this indicates which region is covered by each slice.  On unsorted data the index may be used to simply fetch the Nth container.  Selective decoding may also be achieved using the Compression Header to skip specified data series if partial records are required.
+
+
+== History ==
+
+CRAM version 4.0 exists as a prototype in Scramble, initially demonstrated in 2015, but has yet to be adopted as a standard.
+
+
+== See also ==
+SAM (file format)
+Binary Alignment Map
+Compression of Genomic Re-Sequencing Data
+List of file formats for molecular biology
+
+
+== References ==
--- a/data/en.wikipedia.org/wiki/CaBIG-0.md
+++ b/data/en.wikipedia.org/wiki/CaBIG-0.md
@ -0,0 +1,49 @@
+---
+title: "CaBIG"
+chunk: 1/2
+source: "https://en.wikipedia.org/wiki/CaBIG"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T14:01:59.198753+00:00"
+instance: "kb-cron"
+---
+
+The cancer Biomedical Informatics Grid (caBIG) was a US government program to develop an open-source, open access information network called caGrid for secure data exchange on cancer research. The initiative was developed by the National Cancer Institute (part of the National Institutes of Health) and was maintained by the Center for Biomedical Informatics and Information Technology (CBIIT) and program managed by Booz Allen Hamilton. In 2011 a report on caBIG raised significant questions about effectiveness and oversight, and its budget and scope were significantly trimmed. In May 2012, the National Cancer Informatics Program (NCIP) was created as caBIG's successor program.
+
+== History ==
+The National Cancer Institute (NCI) of the United States funded the cancer Biomedical Informatics Grid (caBIG) initiative in spring 2004, headed by Kenneth Buetow.
+Its goal was to connect US biomedical cancer researchers using technology known as grid computing. The program, led by the Center for Bioinformatics and Information Technology (CBIIT), began with a 3-year pilot phase. The pilot phase concluded in March 2007, and a trial was announced.
+Buetow promoted the program in 2008.
+In addition to caGrid, the underlying infrastructure for data sharing among organizations, caBIG developed software tools, data sharing policies, and common standards and vocabularies to facilitate data sharing.
+Software tools targeted:
+
+Collection, analysis, and management of basic research data
+Clinical trials management, from patient enrollment to adverse event reporting and analysis
+Collection, annotation, sharing, and storage of medical imaging data
+Biospecimen management
+caBIG sought to provide foundational technology for an approach to biomedicine it called a “learning healthcare system.” This relies on the rapid exchange of information among all sectors of research and care, so that researchers and clinicians are able to collaboratively review and accurately incorporate the latest findings into their work. The ultimate goal was to speed the biomedical research process. It was also promoted for what is often called Personalized Medicine.
+caBIG technology was used in adaptive clinical trials such as the Investigation of Serial studies to Predict Your Therapeutic Response with Imaging and molecular AnaLysis 2 (I-SPY2), which was designed to use  biomarkers to determine the appropriate therapy for women with advanced breast cancer.
+
+== Health information technology ==
+Health information technology (HIT) was promoted for management and secure exchange of medical information among researchers, health care providers, and consumers. HIT initiatives mentioning caBIG were:
+NCI and the American Society of Clinical Oncology initiated a collaboration to create an oncology-specific electronic health record system using caBIG standards for interoperability and that will enable oncologists to manage patient information in an electronic format that accurately captures the specific interventional issues unique to oncology.
+The Nationwide Health Information Network was an initiative to share patient clinical data across geographically disparate sources and create electronically linked national health information exchange. It might be somehow related.
+
+== Collaborations ==
+A BIG Health Consortium was formed in 2008 to promote personalized medicine, but disbanded in 2012. 
+In July 2009, caBIG announced a collaboration with the Dr. Susan Love Research Foundation to build an online cohort of women willing to participate in clinical trials. Called the Army of Women, it had a goal of one million in its database; by December 2009 the site was "launched", and about 30,000 women and men signed up by 2010.
+The Cancer Genome Atlas aimed to characterize more than 10,000 tumors across at least 20 cancers by 2015. caBIG provided connectivity, data standards, and tools to collect, organize, share, and analyze the diverse research data in its database.
+Since 2007, NCI worked with UK National Cancer Research Institute (NCRI). The two organizations shared technologies for collaborative research and the secure exchange of research data using caGrid and the NCRI Oncology Information Exchange (ONIX) web portal announced in August 2009.
+ONIX shut down in March 2012.
+The Duke Cancer Institute used caBIG clinical trials tools in their collaboration with the Beijing Cancer Hospital of Peking University.
+
+== Implementation ==
+The project intended to connect  65 NCI-designated cancer centers to enable collaborative research.
+Participating institutions could either “adopt” caBIG tools to share data directly through caGrid, or “adapt” commercial or in-house developed software to be caBIG-compatible. The caBIG program developed software development kits  (SDKs) for interoperable software tools, and instructions on the process of adapting existing tools or developing applications to be caBIG-compatible.
+The Enterprise Support Network program included domain-specific expertise, and support service providers, third party organizations that provide assistance on a contract-for-services basis.
+A web portal using the Liferay software was available from 2008 to 2013.
+
+=== Open source ===
+Since 2004, the caBIG program used open-source communities, adapted from other public-private partnerships.  The caBIG program produced software under contract to software development teams largely within the commercial research community.
+In general, software developed under US government contracts is the property of the US government and the US taxpayers. Depending on the terms in specific contracts, they might be accessible only by request under the Freedom of Information Act (FOIA).  The timeliness of response to such requests might preclude a requester from ever gaining any secondary value from software released under a FOIA request. 
+The caBIG program placed the all caBIG software in a software repository freely accessible for download.  Open source means anyone can modify the downloaded software; however, the licensing applied to the downloaded software allows greater flexibility than is typical.  An individual or enterprise is allowed to contribute the modified code back to the caBIG program but is not required to do so.  Likewise, the modifications can be made available as open source but are not required to be made available as open source.  The caBIG licensing even allows the use of the caBIG applications and components, combined with additions and modifications, to be released as commercial products.  These aspects of the caBIG program actually encourage commercialization of caBIG technology.
--- a/data/en.wikipedia.org/wiki/CaBIG-1.md
+++ b/data/en.wikipedia.org/wiki/CaBIG-1.md
@ -0,0 +1,99 @@
+---
+title: "CaBIG"
+chunk: 2/2
+source: "https://en.wikipedia.org/wiki/CaBIG"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T14:01:59.198753+00:00"
+instance: "kb-cron"
+---
+
+=== Results ===
+In 2008, GlaxoSmithKline announced it would share cancer cell genomic data with caBIG.
+Some private companies claimed benefits from caBIG technology in 2010.
+A caGrid community web site was created in 2007.
+The 1.x version of the core software was added to a GitHub project in mid-2013, under the BSD 3-Clause license.
+It used version 4.03 of the Globus Toolkit, and the Taverna workbench system to manage workflow and the Business Process Execution Language.
+Software called Introduce was developed around 2006.
+Contributors included the Ohio State University Center for Clinical and Translational Science, Duke University, University of Chicago - Argonne National Laboratory, and private companies Booze Allen Hamilton, Ekagra Software Technologies and Semantic Bits.
+
+=== Criticism ===
+By 2008, some questioned if the program was benefiting large pharmaceutical companies.
+By 2011, the project had spent an estimated $350 million. Although the goal was considered laudable, much of the software was unevenly adopted after being developed at great expense to compete with commercial offerings.
+In March 2011, an NCI working group assessment concluded that caBIG "...expanded far beyond those goals to implement an overly complex and ambitious software enterprise of NCI-branded tools, especially in the Clinical Trial Management System (CTMS) space. These have produced limited traction in the cancer community, compete against established commercial vendors, and create financially untenable long-term maintenance and support commitments for the NCI". In 2012, the NCI announced a new program the National Cancer Informatics Program (NCIP) as a successor to caBIG.
+
+== caGrid ==
+ 
+The caGrid  computer network and software supported the cancer Biomedical Informatics Grid (caBIG) initiative of the National Cancer Institute of the US National Institutes of Health.
+caBIG was a voluntary virtual informatics infrastructure that connects data, research tools, scientists, and organizations.
+In 2013, the National Cancer Informatics Program (NCIP) re-released caGrid under the BSD 3-Clause license, and migrated the source repository to github.
+caGrid used version 4.03 of the Globus Toolkit, produced by the Globus Alliance.
+
+=== Program Management ===
+The caGrid project and much of its funding was managed by Booz Allen Hamilton
+
+=== Portal ===
+The caGrid Portal was a Web-based application built on Liferay that enables users to discover and interact with the services that are available on the caGrid infrastructure. Portal serves as the primary visualization tool for the caGrid middleware. It also served as a caBIG information source. Through the caGrid Portal, users had access to information about caBIG participants, caGrid points of contact (POCs), and caGrid-related news and events.
+
+=== Workflow ===
+caGrid workflow uses:
+
+Active BPEL
+Taverna
+
+=== Contributors ===
+NCI CBIIT Program
+Booz Allen Hamilton
+Ohio State University
+University of Chicago, Argonne National Laboratory
+Duke University
+SemanticBits, LLC
+Ekagra Software Technologies
+
+=== Criticism ===
+In March 2011, the NCI published an extensive review of CaBIG, the NCI CBIIT program that funded the caGrid software development (see [1], [2]), which included a long list of problems with the program, and recommended that most of the software development projects should be discontinued.
+
+== References ==
+
+== Further reading ==
+Abernethy AP, Coeytauz R, Rowe K, Wheeler JL, Lyerly HK. Electronic patient-reported data capture as the foundation of a learning health care system. JCO. 2009;27:6522.
+Buetow KH. caBIG: proof of concept for personalized cancer care. JCO. 2009:27 Suppl 15S:e20712.
+Holford ME, Rajeevan H, Zhao H, Kidd KK, Cheung KH (2009). "Semantic web-based integration of cancer pathways and allele frequency data". Cancer Informatics. 8: 19–30. doi:10.4137/CIN.S1006. PMC 2664696. PMID 19458791.
+Huang T, Shenoy PJ, Sinha R, Graiser M, Bumpers KW, Flowers CR (2009). "Development of the Lymphoma Enterprise Architecture Database: A caBIG Silver level compliant System". Cancer Informatics. 8: 45–64. doi:10.4137/CIN.S940. PMC 2675136. PMID 19492074.
+Kunz I, Lin MC, Frey L (2009). "Metadata mapping and reuse in caBIG". BMC Bioinformatics. 10 (Suppl 2): S4. doi:10.1186/1471-2105-10-S2-S4. PMC 2646244. PMID 19208192.
+Ohmann C, Kuchinke W (2009). "Future developments of medical informatics from the viewpoint of networked clinical research. Interoperability and integration". Methods of Information in Medicine. 48 (1): 45–54. doi:10.3414/me9137. PMID 19151883. S2CID 23089030.{{cite journal}}:  CS1 maint: deprecated archival service (link)
+Phan JH, Moffitt RA, Stokes TH, et al. (June 2009). "Convergence of biomarkers, bioinformatics and nanotechnology for individualized cancer treatment". Trends in Biotechnology. 27 (6): 350–8. doi:10.1016/j.tibtech.2009.02.010. PMC 3779321. PMID 19409634.
+Staes CJ, Xu W, LeFevre SD, et al. (2009). "A case for using grid architecture for state public health informatics: the Utah perspective". BMC Medical Informatics and Decision Making. 9: 32. doi:10.1186/1472-6947-9-32. PMC 2707374. PMID 19545428.
+Peter A. Covitz; Frank Hartel; Carl Schaefer; Sherri De Coronado; Gilberto Fragoso; Himanso Sahni; Scott Gustafson & Kenneth H. Buetow (April 23, 2003). "caCORE: A common infrastructure for cancer informatics". Bioinformatics. 19 (18): 2404–2412. doi:10.1093/bioinformatics/btg335. PMID 14668224.
+“Health IT gets personal,” InformationWeek (11/13/09)
+“Health data in the raw,” Archived 2010-12-26 at the Wayback Machine Government Health IT (11/6/09)
+“NCI to open research grid to cancer patient 'army',” Archived 2010-11-26 at the Wayback Machine Government Health IT (10/9/09)
+“GridBriefing: The future of Healthcare - eHealth and Grid Computing,” GridTalk (9/09)
+“Collaboration and Sustainability are Front and Center as caBIG Celebrates Fifth Anniversary,” GenomeWeb/BioInform (7/09)
+“Sharing the Wealth of Data,” Scientific American (5/09)
+“Translational Research Drives Demand for 'Virtual' Biobanks Built on caBIG Tools,” GenomeWeb/BioInfom (4/3/09)
+"caGrid". Archived from the original on 2012-02-05. Retrieved 2016-11-22.
+"caGrid 1.0: An Enterprise Grid Infrastructure for Biomedical Research".
+"Enabling the Provisioning and Management of a Federated Grid Trust Fabric".
+"Introduce: An Open Source Toolkit for Rapid Development of Strongly Typed Grid Services".
+"caGrid: design and implementation of the core architecture of the cancer biomedical informatics grid".
+Tan, Wei; Foster, Ian; Madduri, Ravi (2008). "Combining the Power of Taverna and caGrid: Scientific Workflows that Enable Web-Scale Collaboration". IEEE Internet Computing. 12 (6): 61–68. doi:10.1109/MIC.2008.120. S2CID 2690862.
+
+== External links ==
+caBIG Consumer/User Website (non-technical)
+caBIG Community Website (technical)
+caGrid wiki
+caGrid gforge project
+caGrid Portal
+
+=== Components ===
+Introduce Toolkit, also a Globus Incubator Project
+Data Services
+Metadata
+Security
+Credential Delegation Service (CDS)
+Dorian
+GAARDS
+Grid Grouper
+Grid Trust Service (GTS)
+WebSSO - Web Single Sign-on component, based on JASIG CAS
--- a/data/en.wikipedia.org/wiki/Canadian_Bioinformatics_Workshops-0.md
+++ b/data/en.wikipedia.org/wiki/Canadian_Bioinformatics_Workshops-0.md
@ -4,7 +4,7 @@ chunk: 1/1
 source: "https://en.wikipedia.org/wiki/Canadian_Bioinformatics_Workshops"
 category: "reference"
 tags: "science, encyclopedia"
-date_saved: "2026-05-05T06:55:46.973813+00:00"
+date_saved: "2026-05-05T14:02:02.852583+00:00"
 instance: "kb-cron"
 ---

--- a/data/en.wikipedia.org/wiki/Cellular_model-0.md
+++ b/data/en.wikipedia.org/wiki/Cellular_model-0.md
@ -0,0 +1,22 @@
+---
+title: "Cellular model"
+chunk: 1/2
+source: "https://en.wikipedia.org/wiki/Cellular_model"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T14:02:05.293321+00:00"
+instance: "kb-cron"
+---
+
+A cellular model or virtual cell is a computational model of aspects of a biological cell, for the purposes of in silico research.
+Developing such models has been a task of systems biology and mathematical biology. It involves developing efficient algorithms, data structures, visualization and communication tools to orchestrate the integration of large quantities of biological data with the goal of computer modeling. It involves the use of computer simulations of cellular subsystems, such as the networks of metabolites and enzymes which comprise metabolism, signal transduction pathways and gene regulatory networks.
+Recent efforts to create virtual cell models have applied deep learning and generative AI to single-cell omics data.
+
+== Overview ==
+The eukaryotic cell cycle is very complex and is one of the most studied topics, since its misregulation leads to cancers. It is possibly a good example of a mathematical model as it deals with simple calculus but gives valid results. Two research groups have produced several models of the cell cycle simulating several organisms. They have recently produced a generic eukaryotic cell cycle model which can represent a particular eukaryote depending on the values of the parameters, demonstrating that the idiosyncrasies of the individual cell cycles are due to different protein concentrations and affinities, while the underlying mechanisms are conserved (Csikasz-Nagy et al., 2006). 
+By means of a system of ordinary differential equations these models show the change in time (dynamical system) of the protein inside a single typical cell; this type of model is called a deterministic process (whereas a model describing a statistical distribution of protein concentrations in a population of cells is called a stochastic process).
+To obtain these equations an iterative series of steps must be done: first the several models and observations are combined to form a consensus diagram and the appropriate kinetic laws are chosen to write the differential equations, such as rate kinetics for stoichiometric reactions, Michaelis-Menten kinetics for enzyme substrate reactions and Goldbeter–Koshland kinetics for ultrasensitive transcription factors, afterwards the parameters of the equations (rate constants, enzyme efficiency coefficients and Michaelis constants) must be fitted to match observations; when they cannot be fitted the kinetic equation is revised and when that is not possible the wiring diagram is modified. The parameters are fitted and validated using observations of both wild type and mutants, such as protein half-life and cell size.
+In order to fit the parameters the differential equations need to be studied. This can be done either by simulation or by analysis.
+In a simulation, given a starting vector (list of the values of the variables), the progression of the system is calculated by solving the equations at each time-frame in small increments.
+In analysis, the properties of the equations are used to investigate the behavior of the system depending on the values of the parameters and variables. A system of differential equations can be represented as a vector field, where each vector described the change (in concentration of two or more protein) determining where and how fast the trajectory (simulation) is heading. Vector fields can have several special points: a stable point, called a sink, that attracts in all directions (forcing the concentrations to be at a certain value), an unstable point, either a source or a saddle point which repels (forcing the concentrations to change away from a certain value), and a limit cycle, a closed trajectory towards which several trajectories spiral towards (making the concentrations oscillate).
+A better representation which can handle the large number of variables and parameters is called a bifurcation diagram (bifurcation theory): the presence of these special steady-state points at certain values of a parameter (e.g. mass) is represented by a point and once the parameter passes a certain value, a qualitative change occurs, called a bifurcation, in which the nature of the space changes, with profound consequences for the protein concentrations: the cell cycle has phases (partially corresponding to G1 and G2) in which mass, via a stable point, controls cyclin levels, and phases (S and M phases) in which the concentrations change independently, but once the phase has changed at a bifurcation event (cell cycle checkpoint), the system cannot go back to the previous levels since at the current mass the vector field is profoundly different and the mass cannot be reversed back through the bifurcation event, making a checkpoint irreversible. In particular the S and M checkpoints are regulated by means of special bifurcations called a Hopf bifurcation and an infinite period bifurcation.
--- a/data/en.wikipedia.org/wiki/Cellular_model-1.md
+++ b/data/en.wikipedia.org/wiki/Cellular_model-1.md
@ -0,0 +1,38 @@
+---
+title: "Cellular model"
+chunk: 2/2
+source: "https://en.wikipedia.org/wiki/Cellular_model"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T14:02:05.293321+00:00"
+instance: "kb-cron"
+---
+
+== Molecular level simulations ==
+Cell Collective is a modeling software that enables one to house dynamical biological data, build computational models, stimulate, break and recreate models. The development is led by Tomas Helikar, a researcher within the field of computational biology. It is designed for biologists, students learning about computational biology, teachers focused on teaching life sciences, and researchers within the field of life science. The complexities of math and computer science are built into the backend and one can learn about the methods used for modeling biological species, but complex math equations, algorithms, programming are not required and hence won't impede model building.
+The mathematical framework behind Cell Collective is based on a common qualitative (discrete) modeling technique where the regulatory mechanism of each node is described with a logical function [for more comprehensive information on logical modeling, see ].
+In the July 2012 issue of Cell, a team led by Markus Covert at Stanford published the most complete computational model of a cell to date. The model of the roughly 500-gene Mycoplasma genitalium contains 28 algorithmically-independent components incorporating work from over 900 sources. It accounts for interactions of the complete genome, transcriptome, proteome, and metabolome of the organism, marking a significant advancement for the field.
+Most attempts at modeling cell cycle processes have focused on the broad, complicated molecular interactions of many different chemicals, including several cyclin and cyclin-dependent kinase molecules as they correspond to the S, M, G1 and G2 phases of the cell cycle. In a 2014 published article in PLOS computational biology, collaborators at University of Oxford, Virginia Tech and Institut de Génétique et Développement de Rennes produced a simplified model of the cell cycle using only one cyclin/CDK interaction. This model showed the ability to control totally functional cell division through regulation and manipulation only the one interaction, and even allowed researchers to skip phases through varying the concentration of CDK. This model could help understand how the relatively simple interactions of one chemical translate to a cellular level model of cell division.
+
+== Projects ==
+Multiple projects are in progress.
+
+CytoSolve - Commercial platform, possibly using MATLAB
+Synthecell - Experimental group
+Karyote - Indiana University - No longer active
+E-Cell Project - Last updated 2020
+VCell - University of Connecticut Health Center - Simulation platform rather than a build a cell project
+Silicon Cell - No longer active
+WholeCell - Stanford University - No longer active
+MCell - National Center for Multiscale Modeling of Biological Systems (MMBioS) - Active as of 2023
+Virtual Cell Challenge - competition to develop predictive models of single-cell transcriptional responses to CRISPR perturbations, hosted by Arc Institute and sponsored by Nvidia, 10x Genomics, and Ultima Genomics
+
+== See also ==
+Biological data visualization
+Biological Applications of Bifurcation Theory
+Molecular modeling software
+Membrane computing is the task of modeling specifically a cell membrane.
+Biochemical Switches in the Cell Cycle
+Masaru Tomita
+
+== References ==
--- a/data/en.wikipedia.org/wiki/ChIP-exo-0.md
+++ b/data/en.wikipedia.org/wiki/ChIP-exo-0.md
@ -0,0 +1,41 @@
+---
+title: "ChIP-exo"
+chunk: 1/2
+source: "https://en.wikipedia.org/wiki/ChIP-exo"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T14:02:10.121307+00:00"
+instance: "kb-cron"
+---
+
+ChIP-exo is a chromatin immunoprecipitation based method for mapping the locations at which a protein of interest (transcription factor) binds to the genome. It is a modification of the ChIP-seq protocol, improving the resolution of binding sites from hundreds of base pairs to almost one base pair. It employs the use of exonucleases to degrade strands of the protein-bound DNA in the 5'-3' direction to within a small number of nucleotides of the protein binding site. The nucleotides of the exonuclease-treated ends are determined using some combination of DNA sequencing, microarrays, and PCR. These sequences are then mapped to the genome to identify the locations on the genome at which the protein binds.
+
+== Theory ==
+Chromatin immunoprecipitation (ChIP) techniques have been in use since 1984 to detect protein-DNA interactions. There have been many variations on ChIP to improve the quality of results. One such improvement, ChIP-on-chip (ChIP-chip), combines ChIP with microarray technology. This technique has limited sensitivity and specificity, especially in vivo where microarrays are constrained by thousands of proteins present in the nuclear compartment, resulting in a high rate of false positives. Next came ChIP-sequencing (ChIP-seq), which combines ChIP with high-throughput sequencing. However, the heterogeneous nature of sheared DNA fragments maps binding sites to within ±300 base pairs, limiting specificity. Secondly, contaminating DNA presents a grave problem since so few genetic loci are cross-linked to the protein of interest, making any non-specific genomic DNA a significant source of background noise.
+To address these problems, Rhee and Pugh revised the classic nuclease protection assay to develop ChIP-exo. This new ChIP technique relies on a lambda exonuclease that degrades only, and all, unbound double-stranded DNA in the 5′ to 3′ direction.
+
+== Workflow ==
+
+=== ChIP ===
+Cells are crosslinked in vivo with formaldehyde to covalently bind proteins to DNA at their natural binding locations across a genome. Cells are then collected, broken open, and the chromatin sheared and solubilized by sonication. An antibody is then used to immunoprecipitate the protein of interest (engineering cells with an epitope  tag can be useful for immunoprecipitation), along with the crosslinked DNA. DNA PCR adaptors are then ligated to the ends, which serve as a priming point for second strand DNA synthesis after the exonuclease digestion.  Lambda exonuclease then digests double DNA strands from the 5′ end until digestion is blocked at the border of the protein-DNA covalent interaction. Most contaminating DNA is degraded by the addition of a second single-strand specific exonuclease. After the cross-linking is reversed, the primers to the PCR adaptors are extended to form double stranded DNA, and a second adaptor is ligated to 5′ ends to demarcate the precise location of exonuclease digestion cessation. The library is then amplified by PCR, and the products are identified by high throughput sequencing. This method allows for resolution of up to a single base pair for any protein binding site within any genome, which is a much higher resolution than either ChIP-chip or ChIP-seq.
+
+=== Sequencing ===
+ChIP-exo utilizes  short read (e.g.  Illumina NGS) sequencing. Sequencing requirements are lower for ChIP-exo than that of other assays like ChIP-seq because the dramatically reduced "shouldering" of a higher resolution assay like ChIP-exo means that the sampling of DNA fragments for constructing the DNA library are better dominated by target-bound sites (this effect can vary across different targets).
+For  paired end data from a standard ChIP-exo prep, the 5' end of Read 1 sequenced from the DNA fragments marks the position of the cross-linking site (lambda exonuclease digestion stop site). Paired-end sequencing improves the mappability and specificity of read alignments, especially for large genomes.
+
+== Protocols ==
+
+=== ChIP-exo 1.x ===
+ChIP-exo 1.x improves on ChIP-seq by generating data with higher positional resolution by adding a lambda exonuclease digestion step. This higher resolution enables the capture of the organization of factors within a complex. Where version 1.0 was originally designed for the ABI SOLiD platform,  ChIP-exo 1.1 makes the assay compatible with the Illumina NGS platform.
+
+=== ChIP-exo 2.x (ChIP-nexus) ===
+ChIP-nexus utilizes a circular rather than linear DNA library, and increases efficiency of adapter ligation through CircLigase. However, the assay requires additional endonuclease digestion, and published ChIP-nexus data reports data loss due to poor barcode quality.
+
+=== ChIP-exo 3.x ===
+ChIP-exo 3.x employs one-step adapter attachment using Tn5 tagmentation. This version of ChIP uses fewer steps than previous protocols while simultaneously retaining high resolution. However, the libraries produced may be enriched for longer fragments, since tagmentation by Tn5 may occur at a higher frequency for such fragments.
+
+=== ChIP-exo 4.x ===
+ChIP-exo 4.x aims to streamline library construction and avoid the library biases of Tn5. ssDNA splint ligation is incorporated into the workflow. ChIP-exo 4.x is the simplest ChIP-exo version, but 4.0 may produce some steric exclusion of the adapter,and 4.1 may have lower precision.
+
+=== ChIP-exo 5.0 ===
+ChIP-exo 5.0 was developed to improve precision by reducing the "shouldering" found in versions 3.x and 4.x. Enzymatic steps are largely reduced, and as a result, library yield is greatly increased and signal concentration is maximized. 5.0 offers what the authors considered the best compromise in achieving high precision with a streamlined protocol at the time of publication.
--- a/data/en.wikipedia.org/wiki/ChIP-exo-1.md
+++ b/data/en.wikipedia.org/wiki/ChIP-exo-1.md
@ -0,0 +1,57 @@
+---
+title: "ChIP-exo"
+chunk: 2/2
+source: "https://en.wikipedia.org/wiki/ChIP-exo"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T14:02:10.121307+00:00"
+instance: "kb-cron"
+---
+
+== Advantages ==
+High resolution: ChIP-exo has been shown to give up to single base pair resolution in identifying protein binding locations. This is in contrast to ChIP-seq which can locate a protein's binding site only to with ±300 base pairs.
+Lower rate of false positives: Contamination of non-protein-bound DNA fragments can result in a high rate of false positives and negatives in ChIP experiments. The addition of exonucleases to the process not only improves resolution of binding-site calling, but removes contaminating DNA from the solution before sequencing.
+Proteins that are inefficiently bound to a nucleotide fragment are more likely to be detected by ChIP-exo. This has allowed, for example, the recognition of more CTCF transcription factor binding sites than previously discovered.
+Lower sequencing requirements: Due to the higher resolution and reduced background, less depth of sequencing coverage is needed when using ChIP-exo.
+Protein complex/Co-factor information: The direct crosslinking profiles from ChIP-exo data (primary peaks) can sometimes provide information about where on the DNA proteins interact. ChIP-exo data sometimes also captures positions of indirect crosslinking sites for secondary proteins ("piggybacking") in complex with the ChIP target (secondary peaks). These profiles can provide clues to the interaction between protein partners.
+
+== Limitations ==
+Antibodies: As with any ChIP-based method, a suitable antibody for the protein of interest needs to be available in order to use this technique. Thus, the specificity, availability, and reproducibility of antibodies must be taken into consideration, or strains with epitope tags must be engineered.
+Crosslinking: ChIP-exo uses formaldehyde crosslinking which has raised a variety of concerns within the genomics field, including the fact that cross-linking efficiency varies widely between different proteins. Certain targets may be "invisible" to ChIP-based approaches.
+Inaccessible heterochromatin: In part due to the crosslinking, more densely packed regions of the genome like heterochromatin are less extractable. As a result, it can be difficult to observe evidence of target binding in such regions using ChIP-based approaches.
+If a protein-DNA complex has multiple locations of cross-linking within a single binding event, then it can appear as though there are multiple distinct binding events. This likely results from these proteins being denatured and cross-linking at one of the available binding sites within the same event. The exonuclease would then stop at one of the bound sites, depending on which site the protein is cross-linked to. To get around this, there are certain peak calling methods (e.g. ChExMix) that take into account local crosslinking profiles to identify these multi-crosslink profiles as a single binding site.
+
+== Applications ==
+Rhee and Pugh introduce ChIP-exo by performing analyses on a small collection of transcription factors: Reb1, Gal4, Phd1, Rap1 in yeast and CTCF in human. Reb1 sites were often found in clusters and these clusters had ~10-fold higher occupancy than expected. Secondary sites in clusters were found ~40 bp from a primary binding site. Binding motifs of Gal4 showed a strong preference for three of the four nucleotides, suggesting a negative interaction between Gal4 and the excluded nucleotide. Phd1 recognizes three different motifs which explains previous reports of the ambiguity of Phd1's binding motif. Rap1 was found to recognize four motifs. 
+Ribosomal protein genes bound by this protein had a tendency to use a particular motif with a stronger consensus sequence. Other genes often used clusters of weaker consensus motifs, possibly to achieve a similar occupancy. Binding motifs of CTCF employed four "modules". Half of the bound CTCF sites used modules 1 and 2, while the rest used some combination of the four. It is believed that CTCF uses its zinc fingers to recognize different combinations of these modules.
+Rhee and Pugh analyzed pre-initiation complex (PIC) structure and organization in Saccharomyces genomes. Using ChIP-exo, they were able to, among other discoveries, precisely identify TATA-like features in promoters reported to be TATA-less.
+
+== Similar Methods ==
+
+=== PB-exo ===
+PB-exo was developed as an in vitro version of ChIP-exo (or "-exo" version of PB-seq). Purified and sonicated naked genomic DNA is incubated with purified factors and then formaldehyde cross-linked. After this, the standard ChIP-exo protocol is followed. Like PB-seq, PB-exo provides information about genomic factor binding in the absence of chromatin structure or other secondary factor binding partners.
+
+=== WhIP-exo ===
+WhIP-exo is related to  PB-exo except instead of or in addition to a purified target protein, the naked genomic DNA is incubated with crude whole-cell extract. This assays the genomic binding of a protein target in the presence of any potential cofactors. The set of available cofactors in the whole cell extract can be further curated through the modification of the source's genetic background.
+
+=== PIP-seq ===
+PIP-seq is a single-nucleotide resolution assay that determines the position of single-stranded DNA bound protein targets by combining ChIP-seq, ChIP-exo, and permanganate (KMNO4) footprinting techniques.  Permanganate treatment oxidizes single-stranded thymines, and after the target protein is immunoprecipitated out with the crosslinked DNA,  piperidine treatment cleaves the DNA fragments at the oxidized thymines. Then, the library is prepared and sequenced. Bioinformatic filtering of reads for which the immediate upstream reference nucleotide is thymine ("T") enriches the signal for single-strand bound fragments and then downstream analysis can be performed.
+
+== See also ==
+Chromatin immunoprecipitation
+Protein-DNA interaction
+Epigenomics
+ChIP-seq
+ChIP-on-chip
+ CUT&RUN
+ CUT&Tag
+
+== References ==
+
+== External links ==
+DNA-protein interactions in high definition
+Resolving transcription factor binding
+High-resolution chromatin immunoprecipitation
+Important Gene-Regulation Proteins Pinpointed by New Method
+CexoR: An R/Bioconductor Package to Uncover High-resolution Protein-DNA Interactions in ChIP-exo Replicates
+Peconic Genomics
--- a/data/en.wikipedia.org/wiki/ChIP-on-chip-0.md
+++ b/data/en.wikipedia.org/wiki/ChIP-on-chip-0.md
@ -0,0 +1,27 @@
+---
+title: "ChIP-on-chip"
+chunk: 1/3
+source: "https://en.wikipedia.org/wiki/ChIP-on-chip"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T14:02:11.302460+00:00"
+instance: "kb-cron"
+---
+
+ChIP-on-chip (also known as ChIP-chip) is a technology that combines chromatin immunoprecipitation ('ChIP') with DNA microarray ("chip"). Like regular ChIP, ChIP-on-chip is used to investigate interactions between proteins and DNA in vivo. Specifically, it allows the identification of the cistrome, the sum of binding sites, for DNA-binding proteins on a genome-wide basis. Whole-genome analysis can be performed to determine the locations of binding sites for almost any protein of interest. As the name of the technique suggests, such proteins are generally those operating in the context of chromatin. The most prominent representatives of this class are transcription factors, replication-related proteins, like origin recognition complex protein (ORC), histones, their variants, and histone modifications.
+The goal of ChIP-on-chip is to locate protein binding sites that may help identify functional elements in the genome. For example, in the case of a transcription factor as a protein of interest, one can determine its transcription factor binding sites throughout the genome. Other proteins allow the identification of promoter regions, enhancers, repressors and silencing elements, insulators, boundary elements, and sequences that control DNA replication. If histones are subject of interest, it is believed that the distribution of modifications and their localizations may offer new insights into the mechanisms of regulation.
+One of the long-term goals ChIP-on-chip was designed for is to establish a catalogue of (selected) organisms that lists all protein-DNA interactions under various physiological conditions. This knowledge would ultimately help in the understanding of the machinery behind gene regulation, cell proliferation, and disease progression. Hence, ChIP-on-chip offers both potential to complement our knowledge about the orchestration of the genome on the nucleotide level and information on higher levels of information and regulation as it is propagated by research on epigenetics.
+
+== Technological platforms ==
+
+The technical platforms to conduct ChIP-on-chip experiments are DNA microarrays, or "chips". They can be classified and distinguished according to various characteristics:
+Probe type: DNA arrays can comprise either mechanically spotted cDNAs or PCR-products, mechanically spotted oligonucleotides, or oligonucleotides that are synthesized in situ. The early versions of microarrays were designed to detect RNAs from expressed genomic regions (open reading frames aka ORFs). Although such arrays are perfectly suited to study gene expression profiles, they have limited importance in ChIP experiments since most "interesting" proteins with respect to this technique bind in intergenic regions. Nowadays, even custom-made arrays can be designed and fine-tuned to match the requirements of an experiment. Also, any sequence of nucleotides can be synthesized to cover genic as well as intergenic regions.
+Probe size: Early version of cDNA arrays had a probe length of about 200bp. Latest array versions use oligos as short as 70- (Microarrays, Inc.) to 25-mers (Affymetrix). (Feb 2007)
+Probe composition: There are tiled and non-tiled DNA arrays. Non-tiled arrays use probes selected according to non-spatial criteria, i.e., the DNA sequences used as probes have no fixed distances in the genome. Tiled arrays, however, select a genomic region (or even a whole genome) and divide it into equal chunks. Such a region is called tiled path. The average distance between each pair of neighboring chunks (measured from the center of each chunk) gives the resolution of the tiled path. A path can be overlapping, end-to-end or spaced.
+Array size: The first microarrays used for ChIP-on-Chip contained about 13,000 spotted DNA segments representing all ORFs and intergenic regions from the yeast genome. Nowadays, Affymetrix offers whole-genome tiled yeast arrays with a resolution of 5bp (all in all 3.2 million probes). Tiled arrays for the human genome become more and more powerful, too. Just to name one example, Affymetrix offers a set of seven arrays with about 90 million probes, spanning the complete non-repetitive part of the human genome with about 35bp spacing. (Feb 2007)
+Besides the actual microarray, other hard- and software equipment is necessary to run ChIP-on-chip experiments. It is generally the case that one company's microarrays can not be analyzed by another company's processing hardware. Hence, buying an array requires also buying the associated workflow equipment. The most important elements are, among others, hybridization ovens, chip scanners, and software packages for subsequent numerical analysis of the raw data.
+
+== Workflow of a ChIP-on-chip experiment ==
+Starting with a biological question, a ChIP-on-chip experiment can be divided into three major steps: The first is to set up and design the experiment by selecting the appropriate array and probe type. Second, the actual experiment is performed in the wet-lab. Last, during the dry-lab portion of the cycle, gathered data are analyzed to either answer the initial question or lead to new questions so that the cycle can start again.
+
+=== Wet-lab portion of the workflow ===
--- a/data/en.wikipedia.org/wiki/ChIP-on-chip-1.md
+++ b/data/en.wikipedia.org/wiki/ChIP-on-chip-1.md
@ -0,0 +1,24 @@
+---
+title: "ChIP-on-chip"
+chunk: 2/3
+source: "https://en.wikipedia.org/wiki/ChIP-on-chip"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T14:02:11.302460+00:00"
+instance: "kb-cron"
+---
+
+In the first step, the protein of interest (POI) is cross-linked with the DNA site it binds to in an in vitro environment. Usually this is done by a gentle formaldehyde fixation that is reversible with heat.
+Then, the cells are lysed and the DNA is sheared by sonication or using micrococcal nuclease. This results in double-stranded chunks of DNA fragments, normally 1 kb or less in length. Those that were cross-linked to the POI form a POI-DNA complex.
+In the next step, only these complexes are filtered out of the set of DNA fragments, using an antibody specific to the POI. The antibodies may be attached to a solid surface, may have a magnetic bead, or some other physical property that allows separation of cross-linked complexes and unbound fragments. This procedure is essentially an immunoprecipitation (IP) of the protein.  This can be done either by using a tagged protein with an antibody against the tag (ex. FLAG, HA, c-myc) or with an antibody to the native protein.
+The cross-linking of POI-DNA complexes is reversed (usually by heating) and the DNA strands are purified. For the rest of the workflow, the POI is no longer necessary.
+After an amplification and denaturation step, the single-stranded DNA fragments are labeled with a fluorescent tag such as Cy5 or Alexa 647.
+Finally, the fragments are poured over the surface of the DNA microarray, which is spotted with short, single-stranded sequences that cover the genomic portion of interest. Whenever a labeled fragment "finds" a complementary fragment on the array, they will hybridize and form again a double-stranded DNA fragment.
+
+=== Dry-lab portion of the workflow ===
+
+After a sufficiently large time frame to allow hybridization, the array is illuminated with fluorescent light. Those probes on the array that are hybridized to one of the labeled fragments emit a light signal that is captured by a camera. This image contains all raw data for the remaining part of the workflow.
+This raw data, encoded as false-color image, needs to be converted to numerical values before the actual analysis can be done. The analysis and information extraction of the raw data often remains the most challenging part for ChIP-on-chip experiments. Problems arise throughout this portion of the workflow, ranging from the initial chip read-out, to suitable methods to subtract background noise, and finally to appropriate algorithms that normalize the data and make it available for subsequent statistical analysis, which then hopefully lead to a better understanding of the biological question that the experiment seeks to address. Furthermore, due to the different array platforms and lack of standardization between them, data storage and exchange is a huge problem. Generally speaking, the data analysis can be divided into three major steps:
+During the first step, the captured fluorescence signals from the array are normalized, using control signals derived from the same or a second chip. Such control signals tell which probes on the array were hybridized correctly and which bound nonspecifically.
+In the second step, numerical and statistical tests are applied to control data and IP fraction data to identify POI-enriched regions along the genome. The following three methods are used widely: median percentile rank, single-array error, and sliding-window. These methods generally differ in how low-intensity signals are handled, how much background noise is accepted, and which trait for the data is emphasized during the computation. In the recent past, the sliding-window approach seems to be favored and is often described as most powerful.
+In the third step, these regions are analyzed further. If, for example, the POI was a transcription factor, such regions would represent its binding sites. Subsequent analysis then may want to infer nucleotide motifs and other patterns to allow functional annotation of the genome.
--- a/data/en.wikipedia.org/wiki/ChIP-on-chip-2.md
+++ b/data/en.wikipedia.org/wiki/ChIP-on-chip-2.md
@ -0,0 +1,39 @@
+---
+title: "ChIP-on-chip"
+chunk: 3/3
+source: "https://en.wikipedia.org/wiki/ChIP-on-chip"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T14:02:11.302460+00:00"
+instance: "kb-cron"
+---
+
+== Strengths and weaknesses ==
+Using tiled arrays, ChIP-on-chip allows for high resolution of genome-wide maps. These maps can determine the binding sites of many DNA-binding proteins like transcription factors and also chromatin modifications.
+Although ChIP-on-chip can be a powerful technique in the area of genomics, it is very expensive. Most published studies using ChIP-on-chip repeat their experiments at least three times to ensure biologically meaningful maps. The cost of the DNA microarrays is often a limiting factor to whether a laboratory should proceed with a ChIP-on-chip experiment. Another limitation is the size of DNA fragments that can be achieved. Most ChIP-on-chip protocols utilize sonication as a method of breaking up DNA into small pieces. However, sonication is limited to a minimal fragment size of 200 bp. For higher resolution maps, this limitation should be overcome to achieve smaller fragments, preferably to single nucleosome resolution. As mentioned previously, the statistical analysis of the huge amount of data generated from arrays is a challenge and normalization procedures should aim to minimize artifacts and determine what is really biologically significant. So far, application to mammalian genomes has been a major limitation, for example, due to the significant percentage of the genome that is occupied by repeats. However, as ChIP-on-chip technology advances, high resolution whole mammalian genome maps should become achievable.
+Antibodies used for ChIP-on-chip can be an important limiting factor. ChIP-on-chip requires highly specific antibodies that must recognize its epitope in free solution and also under fixed conditions. If it is demonstrated to successfully immunoprecipitate cross-linked chromatin, it is termed "ChIP-grade". Companies that provide ChIP-grade antibodies include Abcam, Cell Signaling Technology, Santa Cruz, and Upstate. To overcome the problem of specificity, the protein of interest can be fused to a tag like FLAG or HA that are recognized by antibodies. An alternative to ChIP-on-chip that does not require antibodies is DamID.
+Also available are antibodies against a specific histone modification like H3 tri methyl K4. As mentioned before, the combination of these antibodies and ChIP-on-chip has become extremely powerful in determining whole genome analysis of histone modification patterns and will contribute tremendously to our understanding of the histone code and epigenetics.
+A study demonstrating the non-specific nature of DNA binding proteins has been published in PLoS Biology. This indicates that alternate confirmation of functional relevancy is a necessary step in any ChIP-chip experiment.
+
+== History ==
+A first ChIP-on-chip experiment was performed in 1999 to analyze the distribution of cohesin along budding yeast chromosome III. Although the genome was not completely represented, the protocol in this study remains equivalent as those used in later studies. The ChIP-on-chip technique using all of the ORFs of the genome (that nevertheless remains incomplete, missing intergenic regions) was then applied successfully in three papers published in 2000 and 2001. The authors identified binding sites for individual transcription factors in the budding yeast Saccharomyces cerevisiae. In 2002, Richard Young's group determined the genome-wide positions of 106 transcription factors using a c-Myc tagging system in yeast. The first demonstration of the mammalian ChIp-on-chip technique reported the isolation of nine chromatin fragments containing weak and strong E2F binding site was done by Peggy Farnham's lab in collaboration with Michael Zhang's lab and published in 2001.   This study was followed several months later in a collaboration between the Young lab with the laboratory of Brian Dynlacht which used the ChIP-on-chip technique to show for the first time that E2F targets encode components of the DNA damage checkpoint and repair pathways, as well as factors involved in chromatin assembly/condensation, chromosome segregation, and the mitotic spindle checkpoint  Other applications for ChIP-on-chip include DNA replication, recombination, and chromatin structure. Since then, ChIP-on-chip has become a powerful tool in determining genome-wide maps of histone modifications and many more transcription factors. ChIP-on-chip in mammalian systems has been difficult due to the large and repetitive genomes. Thus, many studies in mammalian cells have focused on select promoter regions that are predicted to bind transcription factors and have not analyzed the entire genome. However, whole mammalian genome arrays have recently become commercially available from companies like Nimblegen. In the future, as ChIP-on-chip arrays become more and more advanced, high resolution whole genome maps of DNA-binding proteins and chromatin components for mammals will be analyzed in more detail.
+
+== Alternatives ==
+Introduced in 2007, ChIP sequencing (ChIP-seq) is a technology that uses chromatin immunoprecipitation to crosslink the proteins of interest to the DNA but then instead of using a micro-array, it uses the more accurate, higher throughput method of sequencing to localize interaction points.
+DamID is an alternative method that does not require antibodies.
+ChIP-exo uses exonuclease treatment to achieve up to single base pair resolution.
+CUT&RUN sequencing uses antibody recognition with targeted enzymatic cleavage to address some technical limitations of ChIP.
+
+== References ==
+
+== Further reading ==
+Johnson, W. E.; Li, W.; Meyer, C. A.; Gottardo, R.; Carroll, J. S.; Brown, M.; Liu, X. S. (2006). "Model-based analysis of tiling-arrays for ChIP-chip". Proceedings of the National Academy of Sciences. 103 (33): 12457–12462. Bibcode:2006PNAS..10312457J. doi:10.1073/pnas.0601180103. ISSN 0027-8424. PMC 1567901. PMID 16895995.
+Benoukraf, Touati; Cauchy, Pierre; Fenouil, Romain; Jeanniard, Adrien; Koch, Frederic; Jaeger, Sébastien; Thieffry, Denis; Imbert, Jean; Andrau, Jean-Christophe; Spicuglia, Salvatore; Ferrier, Pierre (2009). "CoCAS: a ChIP-on-chip analysis suite". Bioinformatics. 25 (7): 954–955. doi:10.1093/bioinformatics/btp075. ISSN 1460-2059. PMC 2660873. PMID 19193731.
+
+== External links ==
+http://www.genome.gov/10005107 ENCODE project
+Chip-on-Chip (CoC) Package Information from Amkor Technology
+
+=== Analysis and software ===
+[1] CoCAS: a free Analysis software for Agilent ChIP-on-Chip experiments
+[2] rMAT: R implementation from MAT program to normalize and analyze tiling arrays and ChIP-chip data.
--- a/data/en.wikipedia.org/wiki/Chemical_library-0.md
+++ b/data/en.wikipedia.org/wiki/Chemical_library-0.md
@ -0,0 +1,51 @@
+---
+title: "Chemical library"
+chunk: 1/1
+source: "https://en.wikipedia.org/wiki/Chemical_library"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T14:02:07.707315+00:00"
+instance: "kb-cron"
+---
+
+A chemical library or compound library is a collection of stored chemicals usually used ultimately in high-throughput screening or industrial manufacture. The chemical library can consist in simple terms of a series of stored chemicals. Each chemical has associated information stored in some kind of database with information such as the chemical structure, purity, quantity, and physiochemical characteristics of the compound.
+
+
+== Purpose ==
+In drug discovery high-throughput screening, it is desirable to screen a drug target against a selection of chemicals that try to take advantage of as much of the appropriate chemical space as possible. The chemical space of all possible chemical structures is extraordinarily large. Most stored chemical libraries do not typically have a fully represented or sampled chemical space mostly because of storage and cost concerns. However, since many molecular interactions cannot be predicted, the wider the chemical space that is sampled by the chemical library, the better the chance that high-throughput screening will find a "hit"—a chemical with an appropriate interaction in a biological model that might be developed into a drug.
+An example of a chemical library in drug discovery would be a series of chemicals known to inhibit kinases, or in industrial processes, a series of catalysts known to polymerize resins.
+
+
+== Generation of chemical libraries ==
+Chemical libraries are usually generated for a specific goal and larger chemical libraries could be made of several groups of smaller libraries stored in the same location.  In the drug discovery process for instance, a wide range of organic chemicals are needed to test against models of disease in high-throughput screening. Therefore, most of the chemical synthesis needed to generate chemical libraries in drug discovery is based on organic chemistry. A company that is interested in screening for kinase inhibitors in cancer may limit their chemical libraries and synthesis to just those types of chemicals known to have affinity for ATP binding sites or allosteric sites.
+Generally, however, most chemical libraries focus on large groups of varied organic chemical series where an organic chemist can make many variations on the same molecular scaffold or molecular backbone. Sometimes chemicals can be purchased from outside vendors as well and included into an internal chemical library.
+Depending upon their scope and design, chemical libraries can also be classified as diverse oriented, Drug-like, Lead-like, peptide-mimetic, Natural Product-like, Targeted against a specific family of biological targets such Kinases, GPCRs, Proteases, PPI etc. These chemical libraries are often used in target based drug discovery (reverse pharmacology). Among the compound libraries should be annotated the Fragment Compound Libraries, which are mainly used for Fragment-based lead discovery.
+
+
+== Design and optimization of chemical libraries ==
+Chemical libraries are usually designed by chemists and chemoinformatics scientists and synthesized by organic chemistry and medicinal chemistry. The method of chemical library generation usually depends on the project and there are many factors to consider when using rational methods to select screening compounds. Typically, a range of chemicals is screened against a particular drug target or disease model, and the preliminary "hits", or chemicals that show the desired activity, are re-screened to verify their activity. Once they are qualified as a "hit" by their repeatability and activity, these particular chemicals are registered and analysed. Chemoproteomics is a field of study that incorporates the use of chemical libraries to identify protein targets. Commonalities among the different chemical groups are studied as they are often reflective of a particular chemical subspace. Additional chemistry work may be needed to further optimize the chemical library in the active portion of the subspace. When it is needed, more synthesis is completed to extend out the chemical library in that particular subspace by generating more compounds that are very similar to the original hits. This new selection of compounds within this narrow range are further screened and then taken on to more sophisticated models for further validation in the Drug Discovery Hit to Lead process.
+
+
+== Storage and management ==
+The "chemical space" of all possible organic chemicals is large and increases exponentially with the size of the molecule. Most chemical libraries do not typically have a fully represented chemical space mostly because of storage and cost concerns.
+Because of the expense and effort involved in chemical synthesis, the chemicals must be correctly stored and banked away for later use to prevent early degradation. Each chemical has a particular shelf life and storage requirement and in a good-sized chemical library, there is a timetable by which library chemicals are disposed of and replaced on a regular basis. Some chemicals are fairly unstable, radioactive, volatile or flammable and must be stored under careful conditions in accordance with safety standards such as OSHA.
+Most chemical libraries are managed with information technologies such as barcoding and relational databases. Additionally, robotics are necessary to fetch compounds in larger chemical libraries.
+Because a chemical library's individual entries can easily reach up into the millions of compounds, the management of even modest-sized chemical libraries can be a full-time endeavor. Compound management is one such field that attempts to manage and upkeep these chemical libraries as well as maximizing safety and effectiveness in their management.
+
+
+== See also ==
+Compound management
+Druglikeness
+Lead compound
+Scientific collection
+Chemoproteomics
+
+
+== Further reading ==
+Ian Yates. Compound Management comes of age. Drug Discovery World Spring 2003 p35-43 Archived 2007-09-29 at the Wayback Machine
+Archer JR. History, evolution, and trends in compound management for high-throughput screening. Assay Drug Dev Technol. 2004 Dec;2(6):675-81
+Casey R. Designing Chemical Compound Libraries for Drug Discovery. Business Intelligence Network December 1, 2005.
+GLARE - A free open source software for combinatorial library design.
+
+
+== References ==
--- a/data/en.wikipedia.org/wiki/Chip_description_file-0.md
+++ b/data/en.wikipedia.org/wiki/Chip_description_file-0.md
@ -0,0 +1,28 @@
+---
+title: "Chip description file"
+chunk: 1/1
+source: "https://en.wikipedia.org/wiki/Chip_description_file"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T14:02:08.937288+00:00"
+instance: "kb-cron"
+---
+
+A Chip description file  is a file that contains a possible annotation of a microarray chip type. The file typically specifies which probes map to the same genomic unit of interest. This mapping methodology solves the problem of a reliable reconstruction of expression levels as, if more than one probeset per gene exists, expression signals for the same transcript may be enhanced incorrectly.
+Genomic units of interest may include: Expression, Genotyping, CustomSeq, Copy Number and/or Tag probe sets. All probe set names within an array are unique. Multiple copies of a probe set may exist on a single array as long as each copy has a unique name. CDFs
+exist for Gene expression (mapping sets of probes to genes), for genotyping (mapping set of probes to Single-nucleotide_polymorphisms and allele type. Different CDFs may exist for a single chip type: one CDF may be used to interrogate gene transcripts and another CDF to interrogate exons.
+Initially, Affymetrix created the specification to describe the layout for an Affymetrix GeneChip array. Affymetrix still provides "official" CDFs for most of their chip types. In addition to these, various groups provide custom CDF files that are optimized for various genomic features. CDFs are typically updated when the genome annotation is updated.
+
+
+== Format ==
+A Chip Description file is divided up into sections. The start of each section is defined by a line containing a section name enclosed in square braces. The section names are: 
+
+CDF
+Chip
+QCI (where I ranges from 1 to the number of QC probe sets
+UnitJ (where J is an internal index to uniquely distinguish probe sets)
+UnitJ_BlockK (where J and K are internal indices used to distinguish subsets of a probe set)
+The data in each section is of the format TAG=VALUE.
+
+
+== References ==
--- a/data/en.wikipedia.org/wiki/Chou–Fasman_method-0.md
+++ b/data/en.wikipedia.org/wiki/Chou–Fasman_method-0.md
@ -0,0 +1,92 @@
+---
+title: "Chou–Fasman method"
+chunk: 1/1
+source: "https://en.wikipedia.org/wiki/Chou–Fasman_method"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T14:02:12.580124+00:00"
+instance: "kb-cron"
+---
+
+The Chou–Fasman method is an empirical technique for the prediction of secondary structures in proteins, originally developed in the 1970s by Peter Y. Chou and Gerald D. Fasman. The method is based on analyses of the relative frequencies of each amino acid in alpha helices, beta sheets, and turns based on known protein structures solved with X-ray crystallography. From these frequencies a set of probability parameters were derived for the appearance of each amino acid in each secondary structure type, and these parameters are used to predict the probability that a given sequence of amino acids would form a helix, a beta strand, or a turn in a protein. The method is at most about 50–60% accurate in identifying correct secondary structures, which is significantly less accurate than the modern machine learning–based techniques.
+
+
+== Amino acid propensities ==
+The original Chou–Fasman parameters found some strong tendencies among individual amino acids to prefer one type of secondary structure over others. Alanine, glutamate, leucine, and methionine were identified as helix formers, while proline and glycine, due to the unique conformational properties of their peptide bonds, commonly end a helix. The original Chou–Fasman parameters were derived from a very small and non-representative sample of protein structures due to the small number of such structures that were known at the time of their original work. These original parameters have since been shown to be unreliable and have been updated from a current dataset, along with modifications to the initial algorithm.
+The Chou–Fasman method takes into account only the probability that each individual amino acid will appear in a helix, strand, or turn. Unlike the more complex GOR method, it does not reflect the conditional probabilities of an amino acid to form a particular secondary structure given that its neighbors already possess that structure. This lack of cooperativity increases its computational efficiency but decreases its accuracy, since the propensities of individual amino acids are often not strong enough to render a definitive prediction.
+
+
+== Algorithm ==
+The Chou–Fasman method predicts helices and strands in a similar fashion, first searching linearly through the sequence for a "nucleation" region of high helix or strand probability and then extending the region until a subsequent four-residue window carries a probability of less than 1. As originally described, four out of any six contiguous amino acids were sufficient to nucleate helix, and three out of any contiguous five were sufficient for a sheet. The probability thresholds for helix and strand nucleations are constant but not necessarily equal; originally 1.03 was set as the helix cutoff and 1.00 for the strand cutoff.
+Turns are also evaluated in four-residue windows, but are calculated using a multi-step procedure because many turn regions contain amino acids that could also appear in helix or sheet regions. Four-residue turns also have their own characteristic amino acids; proline and glycine are both common in turns. A turn is predicted only if the turn probability is greater than the helix or sheet probabilities and a probability value based on the positions of particular amino acids in the turn exceeds a predetermined threshold. The turn probability p(t) is determined as:
+
+  
+    
+      
+        p
+        (
+        t
+        )
+        =
+        
+          p
+          
+            t
+          
+        
+        (
+        j
+        )
+        ×
+        
+          p
+          
+            t
+          
+        
+        (
+        j
+        +
+        1
+        )
+        ×
+        
+          p
+          
+            t
+          
+        
+        (
+        j
+        +
+        2
+        )
+        ×
+        
+          p
+          
+            t
+          
+        
+        (
+        j
+        +
+        3
+        )
+      
+    
+    {\displaystyle p(t)=p_{t}(j)\times p_{t}(j+1)\times p_{t}(j+2)\times p_{t}(j+3)}
+  
+
+where j is the position of the amino acid in the four-residue window. If p(t) exceeds an arbitrary cutoff value (originally 7.5e–3), the mean of the p(j)'s exceeds 1, and p(t) exceeds the alpha helix and beta sheet probabilities for that window, then a turn is predicted. If the first two conditions are met but the probability of a beta sheet p(b) exceeds p(t), then a sheet is predicted instead.
+
+
+== See also ==
+List of protein structure prediction software
+
+
+== References ==
+
+
+== External links ==
+Gerald D. Fasman on the Internet
--- a/data/en.wikipedia.org/wiki/ClearVolume-0.md
+++ b/data/en.wikipedia.org/wiki/ClearVolume-0.md
@ -0,0 +1,27 @@
+---
+title: "ClearVolume"
+chunk: 1/1
+source: "https://en.wikipedia.org/wiki/ClearVolume"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T14:02:15.064821+00:00"
+instance: "kb-cron"
+---
+
+ClearVolume is an open source real-time live 3D visualization library designed for high-end volumetric light sheet microscopes. ClearVolume enables the live visualization of microscope data - allowing the biologists to immediately decide whether a sample is worth imaging. ClearVolume can easily be integrated into existing Java, C/C++, Python, or LabVIEW based microscope software. It has a dedicated interface to MicroManager/OpenSpim/OpenSpin control software. ClearVolume supports multi-channels, live 3D data streaming from remote microscopes, and uses a multi-pass Fibonacci rendering algorithm that can handle large volumes. Moreover, ClearVolume is integrated into the FiJi/ImageJ2/KNIME ecosystem.
+
+
+== See also ==
+FiJi
+KNIME
+Light sheet fluorescence microscopy
+Volume rendering
+
+
+== References ==
+
+
+== External links ==
+[1] Website of the open source ClearVolume project with links to the wiki, code repositories and issue tracking.
+[2] ClearVolume KNIME plugin project page.
+[3] ClearVolume FiJi plugin project page.
--- a/data/en.wikipedia.org/wiki/Computational_biology-0.md
+++ b/data/en.wikipedia.org/wiki/Computational_biology-0.md
@ -0,0 +1,75 @@
+---
+title: "Computational biology"
+chunk: 1/4
+source: "https://en.wikipedia.org/wiki/Computational_biology"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T14:02:16.270762+00:00"
+instance: "kb-cron"
+---
+
+Computational biology refers to the use of techniques in computer science, data analysis, mathematical modeling and computational simulations to understand biological systems and relationships. An intersection of computer science, biology, and data science, the field also has foundations in applied mathematics, molecular biology, cell biology, chemistry, and genetics.
+
+== History ==
+Bioinformatics, the analysis of informatics processes in biological systems, began in the early 1970s. At this time, research in artificial intelligence was using network models of the human brain in order to generate new algorithms. This use of biological data pushed biological researchers to use computers to evaluate and compare large data sets in their own field.
+By 1982, researchers shared information via punch cards. The amount of data grew exponentially by the end of the 1980s, requiring new computational methods for quickly interpreting relevant information.
+Perhaps the best-known example of computational biology, the Human Genome Project, officially began in 1990. By 2003, the project had mapped around 85% of the human genome, satisfying its initial goals. Work continued, however, and by 2021 level "a complete genome" was reached with only 0.3% remaining bases covered by potential issues. The missing Y chromosome was added in January 2022.
+Since the late 1990s, computational biology has become an important part of biology, leading to numerous subfields. Today, the International Society for Computational Biology recognizes 21 different 'Communities of Special Interest', each representing a slice of the larger field. In addition to helping sequence the human genome, computational biology has helped create accurate models of the human brain, map the 3D structure of genomes, and model biological systems.  Much of the original progress in computational biology emerged from the United States and Western Europe, due to their large computational infrastructures.  Recent decades have seen growing contributions from less-wealthy nations, however.  For example, Colombia has had an international computational biology effort since 1998, focusing on genomics and disease in nationally-important crops like coffee and potatoes.  Poland, similarly, has recently been a leader in biomolecular simulations and macromolecular sequence analysis.
+
+== Applications ==
+
+=== Anatomy ===
+
+Computational anatomy is the study of anatomical shape and form at the visible or gross anatomical 
+  
+    
+      
+        50
+        −
+        100
+        μ
+      
+    
+    {\displaystyle 50-100\mu }
+  
+ scale of morphology. It involves the development of computational mathematical and data-analytical methods for modeling and simulating biological structures. It focuses on the anatomical structures being imaged, rather than the medical imaging devices. Due to the availability of dense 3D measurements via technologies such as magnetic resonance imaging, computational anatomy has emerged as a subfield of medical imaging and bioengineering for extracting anatomical coordinate systems at the morpheme scale in 3D.
+The original formulation of computational anatomy is as a generative model of shape and form from exemplars acted upon via transformations. The diffeomorphism group is used to study different coordinate systems via coordinate transformations as generated via the Lagrangian and Eulerian velocities of flow from one anatomical configuration in 
+  
+    
+      
+        
+          
+            
+              R
+            
+          
+          
+            3
+          
+        
+      
+    
+    {\displaystyle {\mathbb {R} }^{3}}
+  
+ to another. It relates with shape statistics and morphometrics, with the distinction that diffeomorphisms are used to map coordinate systems, whose study is known as diffeomorphometry.
+
+=== Data and modeling ===
+
+Mathematical biology is the use of mathematical models of living organisms to examine the systems that govern structure, development, and behavior in biological systems. This entails a more theoretical approach to problems, rather than its more empirically minded counterpart of experimental biology. Mathematical biology draws on discrete mathematics, topology (also useful for computational modeling), Bayesian statistics, linear algebra and Boolean algebra.
+These mathematical approaches have enabled the creation of databases and other methods for storing, retrieving, and analyzing biological data, a field known as bioinformatics. Usually, this process involves genetics and analyzing genes.
+Gathering and analyzing large datasets have made room for growing research fields such as data mining, and computational biomodeling, which refers to building computer models and visual simulations of biological systems. This allows researchers to predict how such systems will react to different environments, which is useful for determining if a system can "maintain their state and functions against external and internal perturbations". While current techniques focus on small biological systems, researchers are working on approaches that will allow for larger networks to be analyzed and modeled. A majority of researchers believe this will be essential in developing modern medical approaches to creating new drugs and gene therapy. A useful modeling approach is to use Petri nets via tools such as esyN.
+Until recent decades theoretical ecology has largely dealt with analytic models that were detached from the statistical models used by empirical ecologists. More recently, computational methods have aided in developing theories via simulation of ecological systems, in addition to increasing application of methods from computational statistics in ecological analyses.
+
+=== Systems biology ===
+
+Systems biology consists of computing the interactions between various biological systems ranging from the cellular level to entire populations with the goal of discovering emergent properties. This process usually involves networking cell signaling and metabolic pathways. Systems biology often uses computational techniques from biological modeling and graph theory to study these complex interactions at cellular levels.
+
+=== Evolutionary biology ===
+
+Computational biology has assisted evolutionary biology by:
+
+Using DNA data to reconstruct the tree of life with computational phylogenetics
+Fitting population genetics models (either forward time or backward time) to DNA data to make inferences about demographic or selective history
+Building population genetics models of evolutionary systems from first principles in order to predict what is likely to evolve
+
+=== Genomics ===
--- a/data/en.wikipedia.org/wiki/Computational_biology-1.md
+++ b/data/en.wikipedia.org/wiki/Computational_biology-1.md
@ -0,0 +1,40 @@
+---
+title: "Computational biology"
+chunk: 2/4
+source: "https://en.wikipedia.org/wiki/Computational_biology"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T14:02:16.270762+00:00"
+instance: "kb-cron"
+---
+
+Computational genomics is the study of the genomes of cells and organisms. The Human Genome Project is one example of computational genomics. This project looks to sequence the entire human genome into a set of data. Once fully implemented, this could allow for doctors to analyze the genome of an individual patient.  This opens the possibility of personalized medicine, prescribing treatments based on an individual's pre-existing genetic patterns. Researchers are looking to sequence the genomes of animals, plants, bacteria, and all other types of life.
+One of the main ways that genomes are compared is by sequence homology. Homology is the study of biological structures and nucleotide sequences in different organisms that come from a common ancestor. Research suggests that between 80 and 90% of genes in newly sequenced prokaryotic genomes can be identified this way.
+Sequence alignment is another process for comparing and detecting similarities between biological sequences or genes. Sequence alignment is useful in a number of bioinformatics applications, such as computing the longest common subsequence of two genes or comparing variants of certain diseases.
+An untouched project in computational genomics is the analysis of intergenic regions, which comprise roughly 97% of the human genome. Researchers are working to understand the functions of non-coding regions of the human genome through the development of computational and statistical methods and via large consortia projects such as ENCODE and the Roadmap Epigenomics Project. 
+Understanding how individual genes contribute to the biology of an organism at the molecular, cellular, and organism levels is known as gene ontology. The Gene Ontology Consortium's mission is to develop an up-to-date, comprehensive, computational model of biological systems, from the molecular level to larger pathways, cellular, and organism-level systems. The Gene Ontology resource provides a computational representation of current scientific knowledge about the functions of genes (or, more properly, the protein and non-coding RNA molecules produced by genes) from many different organisms, from humans to bacteria.
+3D genomics is a subsection in computational biology that focuses on the organization and interaction of genes within a eukaryotic cell. One method used to gather 3D genomic data is through Genome Architecture Mapping (GAM). GAM measures 3D distances of chromatin and DNA in the genome by combining cryosectioning, the process of cutting a strip from the nucleus to examine the DNA, with laser microdissection. A nuclear profile is simply this strip or slice that is taken from the nucleus. Each nuclear profile contains genomic windows, which are certain sequences of nucleotides - the base unit of DNA. GAM captures a genome network of complex, multi enhancer chromatin contacts throughout a cell.
+
+=== Biomarker discovery ===
+Computational biology also plays a role in identifying biomarkers for diseases such as cardiovascular conditions, with the integration of various 'Omic' data - such as genomics, proteomics, and metabolomics - researchers can uncover potential biomarkers that aid in disease diagnosis, prognosis, and treatment strategies. For instance, metabolomic analyses have identified specific metabolites capable of distinguishing between coronary artery disease and myocardial infarction.
+
+=== Neuroscience ===
+
+Computational neuroscience is the study of brain function in terms of the information processing properties of the nervous system. A subset of neuroscience, it looks to model the brain to examine specific aspects of the neurological system. Models of the brain include:
+
+Realistic Brain Models: These models look to represent every aspect of the brain, including as much detail at the cellular level as possible. Realistic models provide the most information about the brain, but also have the largest margin for error. More variables in a brain model create the possibility for more error to occur. These models do not account for parts of the cellular structure that scientists do not know about. Realistic brain models are the most computationally heavy and the most expensive to implement.
+Simplifying Brain Models:  These models look to limit the scope of a model in order to assess a specific physical property of the neurological system. This allows for the intensive computational problems to be solved, and reduces the amount of potential error from a realistic brain model.
+It is the work of computational neuroscientists to improve the algorithms and data structures currently used to increase the speed of such calculations.
+Computational neuropsychiatry is an emerging field that uses mathematical and computer-assisted modeling of brain mechanisms involved in mental disorders. Several initiatives have demonstrated that computational modeling is an important contribution to understand neuronal circuits that could generate mental functions and dysfunctions.
+
+=== Oncology ===
+
+Computational biology plays a crucial role in discovering signs of new, previously unknown living creatures and in cancer research. This field involves large-scale measurements of cellular processes, including RNA, DNA, and proteins, which pose significant computational challenges. To overcome these, biologists rely on computational tools to accurately measure and analyze biological data. In cancer research, computational biology aids in the complex analysis of tumor samples, helping researchers develop new ways to characterize tumors and understand various cellular properties. The use of high-throughput measurements, involving millions of data points from DNA, RNA, and other biological structures, helps in diagnosing cancer at early stages and in understanding the key factors that contribute to cancer development. Areas of focus include analyzing molecules that are deterministic in causing cancer and understanding how the human genome relates to tumor causation.
+
+=== Toxicology ===
+
+Computational toxicology is a multidisciplinary area of study, which is employed in the early stages of drug discovery and development to predict the safety and potential toxicity of drug candidates.
+
+=== Pharmacology ===
+
+Computational pharmacology is "the study of the effects of genomic data to find links between specific genotypes and diseases and then screening drug data".
--- a/data/en.wikipedia.org/wiki/Computational_biology-2.md
+++ b/data/en.wikipedia.org/wiki/Computational_biology-2.md
@ -0,0 +1,51 @@
+---
+title: "Computational biology"
+chunk: 3/4
+source: "https://en.wikipedia.org/wiki/Computational_biology"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T14:02:16.270762+00:00"
+instance: "kb-cron"
+---
+
+=== Drug discovery ===
+A growing application of computational biology is drug discovery. For example, simulations of intracellular and intercellular signaling events, using data from proteomic or metabolomic experiments, may reduce dependence on experimentation in elucidating pharmacokinetics and pharmacodynamics of drug candidates in living organisms.
+Increasingly, artificial intelligence plays a central role in the drug discovery process.  Using chemical structures of known pharmaceutical agents as inputs, AI models can suggest structures of lead compounds or predict novel modes of drug-protein binding.  AI is also used for virtual screening of candidate molecules, avoiding the need to synthesize large numbers of molecules for screening.
+
+== Techniques ==
+Computational biologists use a wide range of software and algorithms to carry out their research.
+
+=== Unsupervised learning ===
+Unsupervised learning is a type of algorithm that finds patterns in unlabeled data. One example is k-means clustering, which aims to partition n data points into k clusters, in which each data point belongs to the cluster with the nearest mean. Another version is the k-medoids algorithm, which, when selecting a cluster center or cluster centroid, will pick one of its data points in the set, and not just an average of the cluster.
+
+The algorithm follows these steps:
+
+Randomly select k distinct data points. These are the initial clusters.
+Measure the distance between each point and each of the 'k' clusters. (This is the distance of the points from each point k).
+Assign each point to the nearest cluster.
+Find the center of each cluster (medoid).
+Repeat until the clusters no longer change.
+Assess the quality of the clustering by adding up the variation within each cluster.
+Repeat the processes with different values of k.
+Pick the best value for 'k' by finding the "elbow" in the plot of which k value has the lowest variance.
+One example of this in biology is used in the 3D mapping of a genome. Information of a mouse's HIST1 region of chromosome 13 is gathered from Gene Expression Omnibus. This information contains data on which nuclear profiles show up in certain genomic regions. With this information, the Jaccard distance can be used to find a normalized distance between all the loci.
+
+=== Graph analytics ===
+Graph analytics, or network analysis, is the study of graphs that represent connections between different objects. Graphs can represent all kinds of networks in biology such as protein-protein interaction networks, regulatory networks, Metabolic and biochemical networks and much more. There are many ways to analyze these networks. One of which is looking at centrality in graphs. Finding centrality in graphs assigns nodes rankings to their popularity or centrality in the graph. This can be useful in finding which nodes are most important. For example, given data on the activity of genes over a time period, degree centrality can be used to see what genes are most active throughout the network, or what genes interact with others the most throughout the network. This contributes to the understanding of the roles certain genes play in the network.
+There are many ways to calculate centrality in graphs all of which can give different kinds of information on centrality. Finding centralities in biology can be applied in many different circumstances, some of which are gene regulatory, protein interaction and metabolic networks.
+
+=== Supervised learning ===
+Supervised learning is a type of algorithm that learns from labeled data and learns how to assign labels to future data that is unlabeled. In biology supervised learning can be helpful when we have data that we know how to categorize and we would like to categorize more data into those categories.
+
+A common supervised learning algorithm is the random forest, which uses numerous decision trees to train a model to classify a dataset. Forming the basis of the random forest, a decision tree is a structure which aims to classify, or label, some set of data using certain known features of that data. A practical biological example of this would be taking an individual's genetic data and predicting whether or not that individual is predisposed to develop a certain disease or cancer. At each internal node the algorithm checks the dataset for exactly one feature, a specific gene in the previous example, and then branches left or right based on the result. Then at each leaf node, the decision tree assigns a class label to the dataset. So in practice, the algorithm walks a specific root-to-leaf path based on the input dataset through the decision tree, which results in the classification of that dataset. Commonly, decision trees have target variables that take on discrete values, like yes/no, in which case it is referred to as a classification tree, but if the target variable is continuous then it is called a regression tree. To construct a decision tree, it must first be trained using a training set to identify which features are the best predictors of the target variable.
+
+=== Using fast matrix multiplication ===
+These are attempts to utilize fast matrix multiplication algorithms in computational biology. Examples of this type of work are  and .
+
+=== Open source software ===
+Open source software provides a platform for computational biology where everyone can access and benefit from software developed in research. PLOS cites four main reasons for the use of open source software:
+
+Reproducibility: This allows for researchers to use the exact methods used to calculate the relations between biological data.
+Faster development: developers and researchers do not have to reinvent existing code for minor tasks. Instead they can use pre-existing programs to save time on the development and implementation of larger projects.
+Increased quality: Having input from multiple researchers studying the same topic provides a layer of assurance that errors will not be in the code.
+Long-term availability: Open source programs are not tied to any businesses or patents. This allows for them to be posted to multiple web pages and ensure that they are available in the future.
--- a/data/en.wikipedia.org/wiki/Computational_biology-3.md
+++ b/data/en.wikipedia.org/wiki/Computational_biology-3.md
@ -0,0 +1,26 @@
+---
+title: "Computational biology"
+chunk: 4/4
+source: "https://en.wikipedia.org/wiki/Computational_biology"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T14:02:16.270762+00:00"
+instance: "kb-cron"
+---
+
+== Research ==
+There are several large conferences that are concerned with computational biology. Some notable examples are Intelligent Systems for Molecular Biology, European Conference on Computational Biology and Research in Computational Molecular Biology.
+There are also numerous journals dedicated to computational biology. Some notable examples include Journal of Computational Biology and PLOS Computational Biology, a peer-reviewed open access journal that has many notable research projects in the field of computational biology. They provide reviews on software, tutorials for open source software, and display information on upcoming computational biology conferences. Other journals relevant to this field include Bioinformatics, Computers in Biology and Medicine, BMC Bioinformatics, Nature Methods, Nature Communications, Scientific Reports, PLOS One, etc.
+
+== Related fields ==
+Computational biology, bioinformatics and mathematical biology are all interdisciplinary approaches to the life sciences that draw from quantitative disciplines such as mathematics and information science. The NIH describes computational/mathematical biology as the use of computational/mathematical approaches to address theoretical and experimental questions in biology and, by contrast, bioinformatics as the application of information science to understand complex life-sciences data.
+Specifically, the NIH defines
+
+Computational biology: The development and application of data-analytical and theoretical methods, mathematical modeling and computational simulation techniques to the study of biological, behavioral, and social systems.
+Bioinformatics: Research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioral or health data, including those to acquire, store, organize, archive, analyze, or visualize such data.
+While each field is distinct, there may be significant overlap at their interface, so much so that to many, bioinformatics and computational biology are terms that are used interchangeably.
+The terms computational biology and evolutionary computation appear similar but are not identical. Evolutionary computation is a field of computer science comprising algorithms inspired by evolution in biology. Algorithms from within the field of evolutionary computation can be applied to computational biology.
+
+== See also ==
+
+== References ==
--- a/data/en.wikipedia.org/wiki/Computational_epigenetics-0.md
+++ b/data/en.wikipedia.org/wiki/Computational_epigenetics-0.md
@ -0,0 +1,58 @@
+---
+title: "Computational epigenetics"
+chunk: 1/1
+source: "https://en.wikipedia.org/wiki/Computational_epigenetics"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T14:02:17.512094+00:00"
+instance: "kb-cron"
+---
+
+Computational epigenetics uses statistical methods and mathematical modelling in epigenetic research. Due to the recent explosion of epigenome datasets, computational methods play an increasing role in all areas of epigenetic research.
+Research in computational epigenetics comprises the development and application of bioinformatics methods for solving epigenetic questions, as well as computational data analysis and theoretical modeling in the context of epigenetics. This includes modelling of the effects of histone and DNA CpG island methylation.
+
+
+== Current research areas ==
+
+
+=== Importance ===
+Computational methods and next-generation sequencing (NGS) technologies to are being employed to study DNA methylation and histone modifications, which are essential in cancer research. High-throughput sequencing offers valuable insights into epigenetic changes, and the growing volume of these datasets drives the continuous development of bioinformatics techniques for their effective management and analysis.
+There is a need for data integration tools that can merge various types of epigenetic modifications and -omics data (including transcriptomics, genomics, epigenomics, and proteomics) to gain a comprehensive understanding of biological processes. This requires the standardization, annotation, and harmonization of epigenetic data, along with the enhancement of computational and machine learning approaches.
+Understanding the functional implications of epigenetics in diseases can be greatly advanced by using epigenetic editing tools, such as CRISPR-dCas9 technology. These tools enable precise modifications of epigenetic marks at specific loci, allowing researchers to assess the effects of these alterations in cellular and animal models, thus complementing insights obtained from computational analyses.
+
+
+=== Data processing and analysis ===
+
+Various experimental techniques have been developed for genome-wide mapping of epigenetic information, the most widely used being ChIP-on-chip, ChIP-seq and bisulfite sequencing. All of these methods generate large amounts of data and require efficient ways of data processing and quality control by bioinformatic methods.
+
+
+=== Predictions ===
+A substantial amount of bioinformatic research has been devoted to the prediction of epigenetic information from characteristics of the genome sequence. Such predictions serve a dual purpose. First, accurate epigenome predictions can substitute for experimental data, to some degree, which is particularly relevant for newly discovered epigenetic mechanisms and for species other than human and mouse. Second, prediction algorithms build statistical models of epigenetic information from training data and can therefore act as a first step toward quantitative modeling of an epigenetic mechanism. Successful computational prediction of DNA and lysine methylation and acetylation has been achieved by combinations of various features.
+
+
+=== Applications in cancer epigenetics ===
+The important role of epigenetic defects for cancer opens up new opportunities for improved diagnosis and therapy. These active areas of research give rise to two questions that are particularly amenable to bioinformatic analysis. First, given a list of genomic regions exhibiting epigenetic differences between tumor cells and controls (or between different disease subtypes), can we detect common patterns or find evidence of a functional relationship of these regions to cancer? Second, can we use bioinformatic methods in order to improve diagnosis and therapy by detecting and classifying important disease subtypes?
+
+
+== Emerging topics ==
+The first wave of research in the field of computational epigenetics was driven by rapid progress of experimental methods for data generation, which required adequate computational methods for data processing and quality control, prompted epigenome prediction studies as a means of understanding the genomic distribution of epigenetic information, and provided the foundation for initial projects on cancer epigenetics. While these topics will continue to be major areas of research and the mere quantity of epigenetic data arising from epigenome projects poses a significant bioinformatic challenge, several additional topics are currently emerging.
+
+Epigenetic regulatory circuitry: Reverse engineering the regulatory networks that read, write and execute epigenetic codes.
+Population epigenetics: Distilling regulatory mechanisms from the integration of epigenome data with gene expression profiles and haplotype maps for a large sample from a heterogeneous population.
+Evolutionary epigenetics: Learning about epigenome regulation in human (and its medical consequences) by cross-species comparisons.
+Theoretical modeling: Testing our mechanistic and quantitative understanding of epigenetic mechanisms by in silico simulation.
+Genome browsers: Developing a new blend of web services that enable biologists to perform sophisticated genome and epigenome analysis within an easy-to-use genome browser environment.
+Medical epigenetics: Searching for epigenetic mechanisms that play a role in diseases other than cancer, as there is strong circumstantial evidence for epigenetic regulation being involved in mental disorders, autoimmune diseases and other complex diseases. 
+
+
+== Data portals and projects ==
+
+
+== Databases ==
+
+
+== Sources and further reading ==
+The original version of this article was based on a review paper on computational epigenetics that appeared in the January 2008 issue of the Bioinformatics journal: Bock C, Lengauer T (January 2008). "Computational epigenetics". Bioinformatics. 24 (1): 1–10. doi:10.1093/bioinformatics/btm546. PMID 18024971.. This review paper provides >100 references to scientific papers and extensive background information. 
+
+
+== References ==
--- a/data/en.wikipedia.org/wiki/Computational_genomics-0.md
+++ b/data/en.wikipedia.org/wiki/Computational_genomics-0.md
@ -0,0 +1,53 @@
+---
+title: "Computational genomics"
+chunk: 1/2
+source: "https://en.wikipedia.org/wiki/Computational_genomics"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T14:02:18.727709+00:00"
+instance: "kb-cron"
+---
+
+Computational genomics refers to the use of computational and statistical analysis to decipher biology from genome sequences and related data, including both DNA and RNA sequence as well as other "post-genomic" data (i.e., experimental data obtained with technologies that require the genome sequence, such as genomic DNA microarrays). These, in combination with computational and statistical approaches to understanding the function of the genes and statistical association analysis, this field is also often referred to as Computational and Statistical Genetics/genomics. As such, computational genomics may be regarded as a subset of bioinformatics and computational biology, but with a focus on using whole genomes (rather than individual genes) to understand the principles of how the DNA of a species controls its biology at the molecular level and beyond.  With the current abundance of massive biological datasets, computational studies have become one of the most important means to biological discovery.
+
+== History ==
+The roots of computational genomics are shared with those of bioinformatics. During the 1960s, Margaret Dayhoff and others at the National Biomedical Research Foundation assembled databases of homologous protein sequences for evolutionary study.  Their research developed a phylogenetic tree that determined the evolutionary changes that were required for a particular protein to change into another protein based on the underlying amino acid sequences.  This led them to create a scoring matrix that assessed the likelihood of one protein being related to another.
+Beginning in the 1980s, databases of genome sequences began to be recorded, but this presented new challenges in the form of searching and comparing the databases of gene information.  Unlike text-searching algorithms that are used on websites such as Google or Wikipedia, searching for sections of genetic similarity requires one to find strings that are not simply identical, but similar.  This led to the development of the Needleman-Wunsch algorithm, which is a dynamic programming algorithm for comparing sets of amino acid sequences with each other by using scoring matrices derived from the earlier research by Dayhoff.  Later, the BLAST algorithm was developed for performing fast, optimized searches of gene sequence databases.  BLAST and its derivatives are probably the most widely used algorithms for this purpose.
+The emergence of the phrase "computational genomics" coincides with the availability of complete sequenced genomes in the mid-to-late 1990s. The first meeting of the Annual Conference on Computational Genomics was organized by scientists from The Institute for Genomic Research (TIGR) in 1998, providing a forum for this speciality and effectively distinguishing this area of science from the more general fields of Genomics or Computational Biology.  The first use of this term in scientific literature, according to MEDLINE abstracts, was just one year earlier in Nucleic Acids Research.  The final Computational Genomics conference was held in 2006, featuring a keynote talk by Nobel Laureate Barry Marshall, co-discoverer of the link between Helicobacter pylori and stomach ulcers.  As of 2014, the leading conferences in the field include Intelligent Systems for Molecular Biology (ISMB) and Research in Computational Molecular Biology (RECOMB).
+The development of computer-assisted mathematics (using products such as Mathematica or Matlab) has helped engineers, mathematicians and computer scientists to start operating in this domain, and a public collection 
+of case studies and demonstrations is growing, ranging from whole genome comparisons to gene expression analysis. This has increased the introduction of different ideas, including concepts from systems and control, information theory, strings analysis and data mining. It is anticipated that computational approaches will become and remain a standard topic for research and teaching, while students fluent in both topics start being formed in the multiple courses created in the past few years.
+
+== Contributions of computational genomics research to biology ==
+Contributions of computational genomics research to biology include:
+
+proposing cellular signalling networks
+proposing mechanisms of genome evolution
+predict precise locations of all human genes using comparative genomics techniques with several mammalian and vertebrate species
+predict conserved genomic regions that are related to early embryonic development
+discover potential links between repeated sequence motifs and tissue-specific gene expression
+measure regions of genomes that have undergone unusually rapid evolution
+
+== Genome comparison ==
+Computational tools have been developed to assess the similarity of genomic sequences. Some of them are alignment-based distances such as Average Nucleotide Identity. These methods are highly specific, while being computationally slow. 
+Other, alignment-free methods, include statistical and probabilistic approaches. One example is Mash, a probabilistic approach using minhash. In this method, given a number k, a genomic sequence is transformed into a shorter sketch through a random hash function on the possible k-mers. For example, if 
+  
+    
+      
+        k
+        =
+        2
+      
+    
+    {\displaystyle k=2}
+  
+, sketches of size 4 are being constructed and given the following hash function 
+
+the sketch of the sequence
+
+CTGACCTTAACGGGAGACTATGATGACGACCGCAT
+is {0,1,1,2} which are the smallest hash values of its k-mers of size 2. These sketches are then compared to estimate the fraction of shared k-mers (Jaccard index) of the corresponding sequences. 
+It is worth noticing that a hash value is a binary number. In a real genomic setting a useful size of k-mers ranges from 14 to 21, and the size of the sketches would be around 1000.
+By reducing the size of the sequences, even hundreds of times, and comparing them in an alignment-free way, this method reduces significantly the time of estimation of the similarity of sequences.
+
+== Clusterization of genomic data ==
+Clustering data is a tool used to simplify statistical analysis of a genomic sample. For example, in the authors developed a tool (BiG-SCAPE) to analyze sequence similarity networks of biosynthetic gene clusters (BGC). In  successive layers of clusterization of biosynthetic gene clusters are used in the automated tool BiG-MAP, both to filter redundant data and identify gene clusters families. This tool profiles the abundance and expressions levels of BGC's in microbiome samples.
--- a/data/en.wikipedia.org/wiki/Computational_genomics-1.md
+++ b/data/en.wikipedia.org/wiki/Computational_genomics-1.md
@ -0,0 +1,35 @@
+---
+title: "Computational genomics"
+chunk: 2/2
+source: "https://en.wikipedia.org/wiki/Computational_genomics"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T14:02:18.727709+00:00"
+instance: "kb-cron"
+---
+
+== Biosynthetic gene clusters ==
+Bioinformatic tools have been developed to predict, and determine the abundance and expression of, this kind of gene cluster in microbiome samples, from metagenomic data. Since the size of metagenomic data is considerable, filtering and clusterization thereof are important parts of these tools. These processes can consist of dimensionality -reduction techniques, such as Minhash, and clusterization algorithms such as k-medoids and affinity propagation. Also several metrics and similarities have been developed to compare them.
+Genome mining for biosynthetic gene clusters (BGCs) has become an integral part of natural product discovery. The >200,000 microbial genomes now publicly available hold information on abundant novel chemistry. One way to navigate this vast genomic diversity is through comparative analysis of homologous BGCs, which allows identification of cross-species patterns that can be matched to the presence of metabolites or biological activities. However, current tools are hindered by a bottleneck caused by the expensive network-based approach used to group these BGCs into gene cluster families (GCFs).
+BiG-SLiCE (Biosynthetic Genes Super-Linear Clustering Engine), a tool designed to cluster massive numbers of BGCs. By representing them in Euclidean space, BiG-SLiCE can group BGCs into GCFs in a non-pairwise, near-linear fashion.
+Satria et. al, 2021 across BiG-SLiCE demonstrate the utility of such analyses by reconstructing a global map of secondary metabolic diversity across taxonomy to identify uncharted biosynthetic potential, opens up new possibilities to accelerate natural product discovery and offers a first step towards constructing a global and searchable interconnected network of BGCs. As more genomes are sequenced from understudied taxa, more information can be mined to highlight their potentially novel chemistry.
+
+== Compression algorithms ==
+
+== See also ==
+Bioinformatics
+Computational biology
+Earth BioGenome Project
+Genomics
+Microarray
+BLAST
+Computational epigenetics
+Nvidia Parabricks - suite of free software for genome analysis developed by Nvidia
+List of Metagenomics software
+List of genomic re-sequencing data compression tools
+
+== References ==
+
+== External links ==
+Harvard Extension School Biophysics 101, Genomics and Computational Biology, http://www.courses.fas.harvard.edu/~bphys101/info/syllabus.html
+University of Bristol course in Computational Genomics, http://www.computational-genomics.net/
--- a/data/en.wikipedia.org/wiki/Computational_immunology-0.md
+++ b/data/en.wikipedia.org/wiki/Computational_immunology-0.md
@ -0,0 +1,69 @@
+---
+title: "Computational immunology"
+chunk: 1/1
+source: "https://en.wikipedia.org/wiki/Computational_immunology"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T14:02:19.916697+00:00"
+instance: "kb-cron"
+---
+
+Computational immunology is a field within immunology that applies computational methods to analyze immune-related data and model immune system processes . Because the immune system consists of highly interconnected cells, molecules, and signaling networks, many of its mechanisms are complex and difficult to study using experimental approaches alone . Computational immunology aims to represent these immunological processes as computational problems that can be examined using algorithms, statistical models, and data-driven techniques .
+
+
+== Introduction ==
+The immune system, which protects the body against infections, harmful pathogens, and other foreign substances, is a complex biological system , and its study is considered one of the more challenging areas in biology and medicine . Immunology research seeks to understand the mechanisms underlying immune responses and to support the development of vaccines and therapies for a wide range of diseases . However, many aspects of the immune system remain difficult to investigate , as interactions between immune cells, molecules, and signaling pathways are highly dynamic and not yet fully understood .
+At the same time, advances in high-throughput experimental and 'omics' technologies have led to a substantial increase in the volume and complexity of immunological data . For example, sequencing of human and model organism genomes  has generated large datasets relevant to immunology, while functional and clinical data have been extensively reported in the scientific literature and recorded in clinical settings . To address these challenges and opportunities, computational approaches have been used to organize and analyze these large-scale datasets, contributing to the emergence of computational immunology . The outputs of these analyses can provide new insights into immune system function, disease mechanisms, and the development of vaccines and therapies .
+
+
+== History ==
+Earlier contributions to Computational Immunology can be traced back to the late 19th century, when Pyotr Dimitrievich En'ko introduced a probabilistic framework to describe the spread of infectious diseases, including measles, representing one of the first stochastic approaches to epidemic modeling . Later, in the early twentieth century, the field advanced through early applications of mathematical modeling to infectious diseases such as malaria, with a focus on understanding patterns of disease transmission .
+During the twentieth century, the use of mathematics, statistics, and computational methods in studies of the immune system and disease processes increased progressively. These developments contributed to the emergence of computational immunology as an interdisciplinary field applying computational approaches to immunological questions.
+
+
+== Immunological database ==
+The rapid growth in the volume and complexity of immunological data has led to the development of specialized databases for data storage, organization, and analysis. These data are highly diverse and are typically organized into databases designed to support different areas of research.
+A wide range of immunological databases have been developed to store and curate such data, supporting the growth of Computational Immunology by enabling the analysis and use of these resources to generate new knowledge. The table below provides selected examples of widely used and well-established immunological databases.
+
+
+== Tools ==
+In computational immunology, a wide range of tools has been developed to support the analysis, modeling, and interpretation of immunological data. These tools, often used in combination with specialized databases, enable the study of complex immune system processes and facilitate advanced simulations of immune responses. Representative examples of such tools are listed in the table below.
+
+
+== Applications ==
+
+
+=== Allergies ===
+Allergies represent an important area of immunology, with significant variability in immune responses among individuals, including those with similar genetic backgrounds. The assessment of protein allergenicity typically involves three main aspects: immunogenicity, cross-reactivity, and clinical manifestations. Immunogenicity is primarily associated with the activation of immunoglobulin E (IgE)-producing B cells and T cells in response to specific allergens. Consequently, many studies focus on identifying B-cell and T-cell epitopes, as well as the structural features of allergens that influence their allergenic potential.
+Computational immunology approaches have been increasingly applied to predict protein allergenicity and to support the evaluation of novel proteins, particularly in food and biotechnology applications . These methods integrate immunological databases with predictive tools to identify potential allergens and assess cross-reactivity.
+
+
+=== Computational Modeling of the Immune System ===
+Computational modeling has played a significant role in improving our understanding of diseases and the dynamics of immune responses. Given the complexity of the immune system, mathematical frameworks, algorithms, and computational techniques can be used to represent it in simplified, interpretable models that facilitate deeper analysis and prediction .
+For example, mathematical models have been used to study within-host dynamics and immune selection in malaria, providing insights into how pathogens evade immune responses and establish persistent infections .
+Another important application of computational modeling in immunoinformatics is the analysis of antigen processing and presentation pathways. In this process, pathogen-derived proteins are degraded into smaller peptide fragments, known as epitopes, which are then transported into the endoplasmic reticulum by proteins such as TAP. Within this compartment, peptides bind to MHC molecules and are subsequently presented on the cell surface for recognition by T cells. This multistep biological pathway can be represented using computational models  that incorporate key processes such as proteasomal cleavage, peptide transport efficiency, and MHC binding affinity. By simulating these steps, such models enable the prediction of which peptides are more likely to be presented to the immune system. These approaches have become essential tools in immunological research and are widely applied in epitope prediction, immunogenicity assessment, and rational vaccine design.
+
+
+=== Epitope Mapping ===
+Epitopes are immune-recognized regions that correspond to specific parts of an antigen identified by the immune system . The identification of these regions plays an important role in vaccine design and antibody development. With the advancement of computational methods, the identification and analysis of epitopes have become more efficient and widely accessible.
+In this context, the National Institute of Allergy and Infectious Diseases has supported epitope-related research through funding the development and continued expansion of the Immune Epitope Database (IEDB), a publicly available resource of experimentally validated epitopes, which integrates large-scale experimental data and supports research in vaccine and therapeutic development .
+
+
+=== Vaccine Design ===
+Computational immunology has significantly facilitated the vaccine design process by reducing both the time and cost associated with vaccine development. Immunoinformatics, which integrates immunology with computational sciences, enables the identification of key antigenic components, as well as the design, evaluation, and optimization of vaccine candidates.
+In recent years, numerous studies have applied immunoinformatics approaches to vaccine design against a wide range of pathogens, including Leptospira (leptospirosis) , Brucella , human rhinovirus C , human metapneumovirus , African swine fever virus , Helicobacter pylori , SARS-CoV-2 , and Leishmania species associated with visceral leishmaniasis . These examples are illustrative, and a large number of studies have been conducted across diverse infectious diseases.
+
+
+== See also ==
+Bioinformatics
+Immunology
+
+
+== References ==
+
+
+== External links ==
+Boston University Center for Computational Immunology
+York Computational Immunology Lab
+Immunoinformatics Immunological Software and Web Services from Gajendra Pal Singh Raghava group
+VacTarBac A web based platform for predicted vaccine candidates against major pathogens.
--- a/data/en.wikipedia.org/wiki/Computer_Atlas_of_Surface_Topography_of_Proteins-0.md
+++ b/data/en.wikipedia.org/wiki/Computer_Atlas_of_Surface_Topography_of_Proteins-0.md
@ -0,0 +1,52 @@
+---
+title: "Computer Atlas of Surface Topography of Proteins"
+chunk: 1/1
+source: "https://en.wikipedia.org/wiki/Computer_Atlas_of_Surface_Topography_of_Proteins"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T14:02:21.130879+00:00"
+instance: "kb-cron"
+---
+
+Computer Atlas of Surface Topography of Proteins (CASTp) aims to provide comprehensive and detailed quantitative characterization of topographic features of protein, is now updated to version 3.0. Since its release in 2006, the CASTp server has ≈45000 visits and fulfills ≈33000 calculation requests annually. CASTp has been proven as a confident tool for a wide range of researches, including investigations of signaling receptors, discoveries of cancer therapeutics, understanding of mechanism of drug actions, studies of immune disorder diseases, analysis of protein–nanoparticle interactions, inference of protein functions and development of high-throughput computational tools. This server is maintained by Jie Liang's lab in University of Illinois at Chicago.
+
+
+== Geometric Modeling Principles ==
+For the calculation strategy of CASTp, alpha-shape and discrete-flow methods are applied to the protein binding site, also the measurement of pocket size by the program of CAST by Liang et al. in 1998, then updated by Tian et al. in 2018. Firstly, CAST identifies atoms which form the protein pocket, then calculates the volume and area, identifies the atoms forming the rims of pocket mouth, computes how many mouth openings for each pocket, predict the area and circumference of mouth openings, finally locates cavities and calculate their size. The secondary structures were calculated by DSSP. The single amino acid annotations were fetched from UniProt database, then mapped to PDB structures following residue-level information from SIFTS database.
+
+
+== Instructions of Protein Pocket Calculation ==
+Input
+Protein structures in PDB format, and a probe radius.
+Searching
+Users can either search for pre-computed result by 4-letter PDB ID, or upload their own PDB file for customized computation. The core algorithm helps in finding the pocket or cavity with capability of housing a solvent, with a default or adjusted diameter.
+Output
+CASTp identifies all surface pockets, interior cavities and cross channels, provides detailed delineation of all atoms participating in their formation, including the area and volume of pocket or void as well as measurement of numbers of mouth opening of a particular pocket ID by solvent accessible surface model (Richards' surface) and by molecular surface model (Connolly surface), all calculated analytically. The core algorithm helps in finding the pocket or cavity with capability  of housing a solvent with a diameter of 1.4 Å. This online tool also supports PyMOL and UCSF Chimera plugin for molecular visualization.
+
+
+== Why CASTp is Useful? ==
+
+Protein science, from an amino acid to sequences and structures
+Proteins are large, complex molecules that playing critical roles to maintain the normal functioning of the human body. They are essential not just for the structure and function, but also the regulation among the body's tissues and organs. Proteins are made up of hundreds of smaller units called amino acids that are attached to one another by peptide bonds, forming a long chain.
+Protein active sites
+Usually, the active site of a protein locates on its center of action and, the key to its function. The first step is the detection of active sites on the protein surface and an exact description of their features and boundaries. These specifications are vital inputs for subsequent target druggability prediction or target comparison. Most of the algorithms for active site detection are based on geometric modeling or energetic features based calculation.
+The role of protein pockets
+The shape and properties of the protein surface determine what interactions are possible with ligands and other macromolecules. Pockets are an important yet ambiguous feature of this surface. During drug discovery process, the first step in screening for lead compounds and potential molecules as drugs is usually a selection of the shape of the binding pocket. Shape plays a role in many computational pharmacological methods. Based on existing results, most features important to predicting drug-binding were depended on size and shape of the binding pocket, with the chemical properties of secondary importance. The surface shape is also important for interactions between protein and water. However, defining discrete pockets or possible interaction sites still remains unclear, due to the shape and location of nearby pockets affected promiscuity and diversity of binding sites. Since most pockets are open to solvent, to define the border of a pocket is the primary difficulty. Those closed to solvent we refer to as buried cavities. With the benefit of well-defined extent, area and volume, buried cavities are more straightforward to locate. In contrast, the border of an open pocket defines its mouth and it provides the cut-off for determination of the surface area and volume. Even defining the pocket as a set of residues does not define the volume or the mouth of the pocket.
+Druggability role prediction
+In pharmaceutical industry, the current priority strategy for target assessment is high-throughput screening (HTS). NMR screenings are applied against large compound datasets. Chemical characteristics of compounds binding against specific targets are measured, so how well the compound sets bind to the chemical space will decide the binding efficiency. Success rates of virtually docking of the drug-like ligands into the active sites of the target proteins would be detected for prioritization, while most of the active sites are located at the pockets.
+With the benefits of large amount of structural data, computational methods from different perspectives for druggability prediction have been introduced during the last 30 years with positive results, as a vital instrument to accelerate the prediction accessibility.  Many candidates have been integrated into drug discovery pipeline already since then.
+
+
+== New Features in CASTp 3.0 ==
+
+Pre-computed results for biological assemblies
+For a lot of proteins deposited in Protein Data Bank, the asymmetric unit might be different from biological unit, which would make the computational result biologically irrelevant. So the new CASTp 3.0 computed the topological features for biological assemblies, overcome the barriers between asymmetric unit and biological assemblies.
+Imprints of negative volumes of topological features
+In the 1st release of CASTp server in 2006, only geometric and topological features of those surface atoms participated in the formation of protein pockets, cavities, and channels. The new CASTp added the "negative volume" of the space, referred to the space encompassed by the atoms formed these geometric and topological features.
+Comprehensive annotation on single amino-acid polymorphism
+The latest CASTp integrated protein annotations aligned with the sequence, including the brief feature, positions, description, and reference of the domains, motifs, and single amino-acid polymorphisms.
+Improved user interface & convenient visualization
+The new CASTp now incorporated 3Dmol.js for structural visualization, made users able to browse, to interact the protein 3D model, and to examine the computational results in latest web-browsers including Chrome, Firefox, Safari, et al. Users can pick their own representation style of the atoms which form each topographic feature, and to edit the colors by their own preferences.
+
+
+== References ==
--- a/data/en.wikipedia.org/wiki/Consensus_sequence-0.md
+++ b/data/en.wikipedia.org/wiki/Consensus_sequence-0.md
@ -0,0 +1,44 @@
+---
+title: "Consensus sequence"
+chunk: 1/1
+source: "https://en.wikipedia.org/wiki/Consensus_sequence"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T14:02:22.388873+00:00"
+instance: "kb-cron"
+---
+
+In molecular biology and bioinformatics, the consensus sequence (or canonical sequence) is the calculated sequence of most frequent residues, either nucleotide or amino acid, found at each position in a sequence alignment. It represents the results of multiple sequence alignments in which related sequences are compared to each other and similar sequence motifs are calculated. Such information is important when considering sequence-dependent enzymes such as RNA polymerase.
+To address the limitations of consensus sequences—which reduce variability to a single residue per position—sequence logos provide a richer visual representation of aligned sequences. Logos display each position as a stack of letters (nucleotides or amino acids), where the height of a letter corresponds to its frequency in the alignment, and the total stack height reflects the information content (measured in bits). The most frequent residue appears at the top of the stack, preserving the consensus while also revealing subtle patterns, such as functionally important but less frequent residues (e.g., alternative start codons or transcription factor binding sites). 
+
+
+== Biological significance ==
+A protein binding site, represented by a consensus sequence, may be a short sequence of nucleotides which is found several times in the genome and is thought to play the same role in its different locations. For example, many transcription factors recognize particular patterns in the promoters of the genes they regulate. In the same way, restriction enzymes usually have palindromic consensus sequences, usually corresponding to the site where they cut the DNA. Transposons act in much the same manner in their identification of target sequences for transposition. Finally, splice sites (sequences immediately surrounding the exon-intron boundaries) can also be considered as consensus sequences.
+Thus a consensus sequence is a model for a putative DNA binding site: it is obtained by aligning all known examples of a certain recognition site and defined as the idealized sequence that represents the predominant base at each position. All the actual examples shouldn't differ from the consensus by more than a few substitutions, but counting mismatches in this way can lead to inconsistencies.
+Any mutation allowing a mutated nucleotide in the core promoter sequence to look more like the consensus sequence is known as an up mutation. This kind of mutation will generally make the promoter stronger, and thus the RNA polymerase forms a tighter bind to the DNA it wishes to transcribe and transcription is up-regulated. On the contrary, mutations that destroy conserved nucleotides in the consensus sequence are known as down mutations. These types of mutations down-regulate transcription since RNA polymerase can no longer bind as tightly to the core promoter sequence.
+
+
+== Sequence analysis ==
+Developing software for pattern recognition is a major topic in genetics, molecular biology, and bioinformatics. Specific sequence motifs can function as regulatory sequences controlling biosynthesis, or as signal sequences that direct a molecule to a specific site within the cell or regulate its maturation. Since the regulatory function of these sequences is important, they are thought to be conserved across long periods of evolution.  In some cases, evolutionary relatedness can be estimated by the amount of conservation of these sites.
+
+
+=== Notation ===
+The conserved sequence motifs are called consensus sequences and they show which residues are conserved and which residues are variable. Consider the following example DNA sequence:
+
+A[CT]N{A}YR
+In this notation, A means that an A is always found in that position; [CT] stands for either C or T; N stands for any base; and {A} means any base except A. Y represents any pyrimidine, and R indicates any purine.
+In this example, the notation [CT] does not give any indication of the relative frequency of C or T occurring at that position. And it is not possible to write it as a single consensus sequence e.g. ACNCCA.  An alternative method of representing a consensus sequence uses a sequence logo. This is a graphical representation of the consensus sequence, in which the size of a symbol is related to the frequency that a given nucleotide (or amino acid) occurs at a certain position. In sequence logos the more conserved the residue, the larger the symbol for that residue is drawn; the less frequent, the smaller the symbol. Sequence logos can be generated using WebLogo, or using the Gestalt Workbench, a publicly available visualization tool written by Gustavo Glusman at the Institute for Systems Biology.
+
+
+== Software ==
+Bioinformatics tools are able to calculate and visualize consensus sequences. Examples of the tools are JalView and UGENE.
+
+
+== See also ==
+Position-specific scoring matrix
+Regular expression — denoting multiple sequences of symbols in formal language theory
+Sequence motif
+Sequence logo
+
+
+== References ==
--- a/data/en.wikipedia.org/wiki/Contact_order-0.md
+++ b/data/en.wikipedia.org/wiki/Contact_order-0.md
@ -0,0 +1,60 @@
+---
+title: "Contact order"
+chunk: 1/1
+source: "https://en.wikipedia.org/wiki/Contact_order"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T14:02:23.616584+00:00"
+instance: "kb-cron"
+---
+
+The contact order of a protein is a measure of the locality of the inter-amino acid contacts in the protein's native state tertiary structure. It is calculated as the average sequence distance between residues that form native contacts in the folded protein divided by the total length of the protein. Higher contact orders indicate longer folding times, and low contact order has been suggested as a predictor of potential downhill folding, or protein folding that occurs without a free energy barrier. This effect is thought to be due to the lower loss of conformational entropy associated with the formation of local as opposed to nonlocal contacts.
+Relative contact order (CO) is formally defined as:
+
+  
+    
+      
+        C
+        O
+        =
+        
+          
+            1
+            
+              L
+              ⋅
+              N
+            
+          
+        
+        
+          ∑
+          
+            N
+          
+        
+        Δ
+        
+          S
+          
+            i
+            ,
+            j
+          
+        
+      
+    
+    {\displaystyle CO={1 \over {L\cdot N}}\sum ^{N}\Delta S_{i,j}}
+  
+
+where N is the total number of contacts, ΔSi,j is the sequence separation, in residues, between contacting residues i and j, and L is the total number of residues in the protein. The value of contact order typically ranges from 5% to 25% for single-domain proteins, with lower contact order belonging to mainly helical proteins, and higher contact order belonging to proteins with a high beta-sheet content.
+Protein structure prediction methods are more accurate in predicting the structures of proteins with low contact orders. This may be partly because low contact order proteins tend to be small, but is likely to be explained by the smaller number of possible long-range residue-residue interactions to be considered during global optimization procedures that minimize an energy function. Even successful structure prediction methods such as the Rosetta method overproduce low-contact-order structure predictions compared to the distributions observed in experimentally determined protein structures.
+The percentage of the natively folded contact order can also be used as a measure of the "nativeness" of folding transition states. Phi value analysis in concert with molecular dynamics has produced transition-state models whose contact order is close to that of the folded state in proteins that are small and fast-folding. Further, contact orders in transition states as well as those in native states are highly correlated with overall folding time.
+In addition to their role in structure prediction, contact orders can themselves be predicted based on a sequence alignment, which can be useful in classifying the fold of a novel sequence with some degree of homology to known sequences.
+
+
+== See also ==
+Circuit topology: topological arrangement of contacts
+
+
+== References ==
--- a/data/en.wikipedia.org/wiki/Critical_Assessment_of_Function_Annotation-0.md
+++ b/data/en.wikipedia.org/wiki/Critical_Assessment_of_Function_Annotation-0.md
@ -0,0 +1,51 @@
+---
+title: "Critical Assessment of Function Annotation"
+chunk: 1/1
+source: "https://en.wikipedia.org/wiki/Critical_Assessment_of_Function_Annotation"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T14:02:25.969439+00:00"
+instance: "kb-cron"
+---
+
+The Critical Assessment of Function Annotation (CAFA) is an ongoing community-driven experiment designed to evaluate computational methods for protein function prediction. Organized as a recurring challenge since 2010, CAFA aims to improve the accuracy, transparency, and benchmarking of algorithms that predict the biological function of proteins, using ontologies such as the Gene Ontology (GO). By fostering open and rigorous assessments, CAFA has become a central benchmark in computational biology and bioinformatics.
+
+
+== Overview ==
+CAFA assesses methods by comparing predictions made by participating teams against experimentally determined annotations that accumulate over time in public protein databases. Predictions are submitted blindly before a predefined target accumulation period, during which newly curated experimental data becomes available. This approach enables objective evaluation of methods without bias from known annotations. The goal is to assign current labels from the Gene Ontology, a structured vocabulary describing protein function, 
+Over the years, CAFA has included additional subchallenges such as phenotype prediction and the prediction of disease-associated genes.
+
+
+== History ==
+
+
+=== CAFA1 (2010–2011) ===
+CAFA1 was the inaugural challenge, launched in 2010 with results published in Nature Methods in 2013. It established the baseline for method performance and popularized the use of time-delayed evaluation in function prediction. CAFA1 demonstrated that state of the art methods outperformed basic sequence similarity-based methods (like BLAST) but also highlighted that overall performance still lagged behind curated annotations.
+
+
+=== CAFA2 (2013–2014) ===
+Building on CAFA1, CAFA2 increased the scale and diversity of target proteins, and required its participants to submit predictions for a large number of target proteins regardless of whether they have previous annotations or not. It introduced improved metrics including customized semantic-precision recall based scores. This round demonstrated that ensemble methods and domain-specific predictors had improved considerably. Associated papers relating to benchmarking and validation were published in a linked thematic series in GigaScience, and the results were published in Genome Biology in 2016.
+
+
+=== CAFA3 (2016–2017) ===
+CAFA3 marked a major milestone by incorporating large-scale experimental validation into the assessment pipeline. Collaborating with experimental labs, the CAFA3 organizers tested top predictions in Candida albicans, Pseudomonas aeruginosa, and Drosophila melanogaster. This direct validation approach provided biological insights and uncovered novel gene functions. Results were published in Genome Biology in 2019.
+
+
+=== CAFA4 (2019–2020) ===
+CAFA4 expanded its experimental reach further and introduced new model organisms. It featured more extensive phenotype prediction tasks and incorporated community-driven annotations from various resources. Methodologies involving deep learning and protein language models began to gain prominence. CAFA4 also laid the groundwork for integrative approaches combining sequence, structure, and network data.
+
+
+=== CAFA5 (2023) ===
+CAFA5, the most recent iteration, was held as a challenge on the Kaggle website, which dramatically increased the number of participants. The challenge saw significant performance gains across multiple function prediction categories. It also introduced new benchmarking tasks for pathogens and environmental samples. Preliminary results were presented in 2024, with a comprehensive publication expected in 2025.
+
+
+== See also ==
+CASP: Critical Assessment of protein Structure Prediction 
+CAGI: Critical Assessment of Genome Interpretation 
+
+
+== References ==
+
+
+== External links ==
+Automated Function Prediction Special Interest Group - CAFA Challenge participation information
--- a/data/en.wikipedia.org/wiki/Critical_Assessment_of_Genome_Interpretation-0.md
+++ b/data/en.wikipedia.org/wiki/Critical_Assessment_of_Genome_Interpretation-0.md
@ -0,0 +1,14 @@
+---
+title: "Critical Assessment of Genome Interpretation"
+chunk: 1/1
+source: "https://en.wikipedia.org/wiki/Critical_Assessment_of_Genome_Interpretation"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T14:02:27.177446+00:00"
+instance: "kb-cron"
+---
+
+The Critical Assessment of Genome Interpretation (CAGI) is an annual bioinformatics competition focused on interpretation of genome variation. CAGI experiments are modeled on the protocols developed in the Critical Assessment of Structure Prediction (CASP) program, adapted to the genomics domain. Over a period of a decade CAGI has conducted five rounds of challenges, attracting 738 submissions from around the world. The results of which have been published in the journal Human Mutation.
+
+
+== References ==
--- a/data/en.wikipedia.org/wiki/Darwin_Core-0.md
+++ b/data/en.wikipedia.org/wiki/Darwin_Core-0.md
@ -0,0 +1,56 @@
+---
+title: "Darwin Core"
+chunk: 1/1
+source: "https://en.wikipedia.org/wiki/Darwin_Core"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T14:02:28.394699+00:00"
+instance: "kb-cron"
+---
+
+Darwin Core (often abbreviated to DwC) is an extension of Dublin Core for biodiversity informatics. It is meant to provide a stable standard reference for sharing information on biological diversity (biodiversity). The terms described in this standard are a part of a larger set of vocabularies and technical specifications under development and maintained by Biodiversity Information Standards (TDWG) (formerly the Taxonomic Databases Working Group).
+
+
+== Description ==
+The Darwin Core is a body of standards intended to facilitate the sharing of information about biological diversity. The DwC includes a glossary of terms, and documentation providing reference definitions, examples, and commentary. An overview of the currently adopted terms and concepts can be found in the Darwin Core quick reference guide maintained by TDWG.
+The DwC operational unit is primarily based on taxa, their occurrence in nature as documented by observations, specimens, and samples, and related information. Included in the standard are documents describing how these terms are managed, how the set of terms can be extended for new purposes, and how the terms can be used.
+Each DwC term includes a definition and discussions meant to promote the consistent use of the terms across applications and disciplines. In other contexts, such terms might be called properties, elements, fields, columns, attributes, or concepts. Though the data types and constraints are not provided in the term definitions, recommendations are made about how to restrict the values where appropriate, for instance by suggesting the use of controlled vocabularies.
+DwC standards are versioned and are constantly evolving, and working groups frequently add to the documentation practical examples that discuss, refine, and expand the normative definitions of each term. This approach to documentation allows the standard to adapt to new purposes without disrupting existing applications.
+In practice, Darwin Core decouples the definition and semantics of individual terms from application of these terms in different technologies. Darwin Core provides separate guidelines on how to encode the terms as RDF, XML or text files.
+The Simple Darwin Core  is a specification for one particular way to use the terms and to share data about taxa and their occurrences in a simply-structured way. It is likely what is meant if someone were to suggest "formatting your data according to the Darwin Core".
+
+
+== History ==
+Darwin Core was originally created as a Z39.50 profile by the Z39.50 Biology Implementers Group (ZBIG), supported by funding from a USA National Science Foundation award.  The name "Darwin Core" was first coined by Allen Allison at the first meeting of the ZBIG held at the University of Kansas in 1998 while commenting on the profile's conceptual similarity with Dublin Core. The Darwin Core profile was later expressed as an XML Schema document for use by the Distributed Generic Information Retrieval (DiGIR) protocol. A TDWG task group was created to revise the Darwin Core, and a ratified metadata standard was officially released on 9 October 2009.
+Though ratified as a standard by Biodiversity Information Standards (TDWG) since then, Darwin Core has had numerous previous versions in production usage. The published standard contains a normative term list with the complete history of the versions of terms leading to the current standard.
+
+
+== Key projects using Darwin Core ==
+The Global Biodiversity Information Facility (GBIF)
+The Ocean Biogeographic Information System (OBIS)
+The Atlas of Living Australia (ALA)
+Online Zoological Collections of Australian Museums (OZCAM)
+Mammal Networked Information System (MaNIS)
+Ornithological Information System (ORNIS)
+FishNet 2
+VertNet
+Canadensys
+Sistema Nature 3.0
+Encyclopedia of Life
+Integrated Digitized Biocollections (iDigBio)
+
+
+== See also ==
+Darwin Core Archive
+Data Curation Network Simple Darwin Core for Non-Biologists Primer
+
+
+== References ==
+
+
+== External links ==
+Darwin Core Quick Reference Guide
+Darwin Core Development Site
+Official Darwin Core Website
+Executive Summary of Darwin Core
+Darwin Core Standard Specifications - GitHub repository where DwC is actively maintained
--- a/data/en.wikipedia.org/wiki/Darwin_Core_Archive-0.md
+++ b/data/en.wikipedia.org/wiki/Darwin_Core_Archive-0.md
@ -0,0 +1,42 @@
+---
+title: "Darwin Core Archive"
+chunk: 1/1
+source: "https://en.wikipedia.org/wiki/Darwin_Core_Archive"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T14:02:29.607433+00:00"
+instance: "kb-cron"
+---
+
+Darwin Core Archive (DwC-A) is a biodiversity informatics data standard that makes use of the Darwin Core terms to produce a single, self-contained dataset for species occurrence, checklist, sampling event or material sample data. Essentially it is a set of text (CSV) files with a simple descriptor (meta.xml) to inform others how your files are organized. The format is defined in the Darwin Core Text Guidelines. It is the preferred format for publishing data to the GBIF network.
+
+
+== Darwin Core ==
+The Darwin Core standard has been used to mobilize the vast majority of specimen occurrence and observational records within the GBIF network. The Darwin Core standard was originally conceived to facilitate the discovery, retrieval, and integration of information about modern biological specimens, their spatio-temporal occurrence, and their supporting evidence housed in collections (physical or digital).
+The Darwin Core today is broader in scope. It aims to provide a stable, standard reference for sharing information on biological diversity. As a glossary of terms, the Darwin Core provides stable semantic definitions with the goal of being maximally reusable in a variety of contexts. This means that Darwin Core may still be used in the same way it has historically been used, but may also serve as the basis for building more complex exchange formats, while still ensuring interoperability through a common set of terms.
+
+
+== Archive format ==
+
+The central idea of an archive is that its data files are logically arranged in a star-like manner, with one core data file surrounded by any number of 'extensions'. Each extension record (or 'extension file row') points to a record in the core file; in this way, zero to many extension records can exist for each single core record, a more space-efficient method for data transfer than the alternative of including all the data within a single table which could otherwise contain many empty cells.
+Details about recommended extensions can be found in their respective subsections and will be extensively documented in the GBIF registry, which will catalogue all available extensions.
+Sharing entire datasets instead of using pageable web services like DiGIR and TAPIR allows much simpler and more efficient data transfer. For example, retrieving 260,000 records via TAPIR takes about nine hours, issuing 1,300 http requests to transfer 500 MB of XML-formatted data. The exact same dataset, encoded as DwC-A and zipped, becomes a 3 MB file. Therefore, GBIF highly recommends compressing an archive using ZIP or GZIP when generating a DwC-A.
+An archive requires stable identifiers for core records, but not for extensions. For any kind of shared data it is therefore necessary to have some sort of local record identifiers. It's good practice to maintain – with the original data – identifiers that are stable over time and are not being reused after the record is deleted. If you can, please provide globally unique identifiers instead of local ones.
+
+
+=== Archive descriptor ===
+To be completed.
+
+
+=== Dataset metadata ===
+A Darwin Core Archive should contain a file containing metadata describing the whole dataset. The Ecological Metadata Language (EML) is the most common format for this, but simple Dublin Core files are being used too.
+
+
+== References ==
+
+
+== External links ==
+Darwin Core Quick Reference Guide
+Biodiversity Information Standards (TDWG)
+Global Biodiversity Information Facility (GBIF)
+Biodiversity informatics
--- a/data/en.wikipedia.org/wiki/DeLano_Award_for_Computational_Biosciences-0.md
+++ b/data/en.wikipedia.org/wiki/DeLano_Award_for_Computational_Biosciences-0.md
@ -0,0 +1,40 @@
+---
+title: "DeLano Award for Computational Biosciences"
+chunk: 1/1
+source: "https://en.wikipedia.org/wiki/DeLano_Award_for_Computational_Biosciences"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T14:02:33.300687+00:00"
+instance: "kb-cron"
+---
+
+The DeLano Award for Computational Biosciences is a prize in the field of computational biology. It is awarded annually for "the most accessible and innovative development or application of computer technology to enhance research in the life sciences at the molecular level".
+The prize was established by the American Society for Biochemistry and Molecular Biology (ASBMB) in memory of Warren Lyford DeLano, an American bioinformatician. DeLano developed the PyMOL open source molecular viewer software and was an advocate for the increased adoption of open source practices in the sciences. DeLano died unexpectedly in 2009.
+Laureates include the Nobel Prize winner Michael Levitt, who was given the Delano Award in 2013 for his work in computational bioscience, and Helen M. Berman, former Director of the Protein Data Bank, who was given the DeLano Award for her leadership in standardizing protein structural data, leading to such developments as AlphaFold.
+
+
+== Laureates ==
+2026 - Roland Dunbrack
+2025 - Rohit Pappu
+2024 - Eytan Ruppin
+2023 - no award
+2022 - Tatyana Sharpee
+2021 - no award
+2020 - Yang Zhang
+2019 - Brian Kuhlman
+2018 - Chris Sander
+2017 - Brian K. Shoichet
+2016 - Todd O. Yeates
+2015 - Vijay S. Pande
+2014 - Michael Levitt
+2013 - Helen M. Berman
+2012 - Barry Honig
+2011 - Axel T. Brunger
+
+
+== See also ==
+List of biology awards
+List of awards in bioinformatics and computational biology
+
+
+== References ==
--- a/data/en.wikipedia.org/wiki/De_novo_protein_structure_prediction-0.md
+++ b/data/en.wikipedia.org/wiki/De_novo_protein_structure_prediction-0.md
@ -0,0 +1,21 @@
+---
+title: "De novo protein structure prediction"
+chunk: 1/3
+source: "https://en.wikipedia.org/wiki/De_novo_protein_structure_prediction"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T14:02:30.830807+00:00"
+instance: "kb-cron"
+---
+
+In computational biology, de novo protein structure prediction refers to an algorithmic process by which protein tertiary structure is predicted from its amino acid primary sequence. The problem itself has occupied leading scientists for decades while still remaining unsolved. According to Science, the problem remains one of the top 125 outstanding issues in modern science. At present, some of the most successful methods have a reasonable probability of predicting the folds of small, single-domain proteins within 1.5 angstroms over the entire structure.
+De novo methods, a term first coined by William DeGrado, tend to require vast computational resources, and have thus only been carried out for relatively small proteins. De novo protein structure modeling is distinguished from Template-based modeling (TBM) by the fact that no solved homologue to the protein of interest is used, making efforts to predict protein structure from amino acid sequence exceedingly difficult. Prediction of protein structure de novo for larger proteins will require better algorithms and larger computational resources such as those afforded by either powerful supercomputers (such as Blue Gene or MDGRAPE-3) or distributed computing projects (such as Folding@home, Rosetta@home, the Human Proteome Folding Project, or Nutritious Rice for the World). Although computational barriers are vast, the potential benefits of structural genomics (by predicted or experimental methods) to fields such as medicine and drug design make de novo structure prediction an active research field.
+
+== Background ==
+Currently, the gap between known protein sequences and confirmed protein structures is immense. At the beginning of 2008, only about 1% of the sequences listed in the UniProtKB database corresponded to structures in the Protein Data Bank (PDB), leaving a gap between sequence and structure of approximately five million. Experimental techniques for determining tertiary structure have faced serious bottlenecks in their ability to determine structures for particular proteins. For example, whereas X-ray crystallography has been successful in crystallizing approximately 80,000 cytosolic proteins, it has been far less successful in crystallizing membrane proteins – approximately 280. In light of experimental limitations, devising efficient computer programs to close the gap between known sequence and structure is believed to be the only feasible option.
+De novo protein structure prediction methods attempt to predict tertiary structures from sequences based on general principles that govern protein folding energetics and/or statistical tendencies of conformational features that native structures acquire, without the use of explicit templates. Research into de novo structure prediction has been primarily focused into three areas: alternate lower-resolution representations of proteins, accurate energy functions, and efficient sampling methods.
+A general paradigm for de novo prediction involves sampling conformation space, guided by scoring functions and other sequence-dependent biases such that a large set of candidate (“decoy") structures are generated.  Native-like conformations are then selected from these decoys using scoring functions as well as conformer clustering.  High-resolution refinement is sometimes used as a final step to fine-tune native-like structures. There are two major classes of scoring functions. Physics-based functions are based on mathematical models describing aspects of the known physics of molecular interaction. Knowledge-based functions are formed with statistical models capturing aspects of the properties of native protein conformations.
+
+== Amino acid sequence determines tertiary protein structure ==
+Several lines of evidence have been presented in favor of the notion that primary protein sequence contains all the information required for overall three-dimensional protein structure, making the idea of a de novo protein prediction possible. First, proteins with different functions usually have different amino acid sequences. Second, several different human diseases, such as Duchenne muscular dystrophy, can be linked to loss of protein function resulting from a change in just a single amino acid in the primary sequence. Third, proteins with similar functions across many different species often have similar amino acid sequences. Ubiquitin, for example, is a protein involved in regulating the degradation of other proteins; its amino acid sequence is nearly identical in species as far separated as Drosophila melanogaster and Homo sapiens. Fourth, by thought experiment, one can deduce that protein folding must not be a completely random process and that information necessary for folding must be encoded within the primary structure. For example, if we assume that each of 100 amino acid residues within a small polypeptide could take up 10 different conformations on average, giving 10^100 different conformations for the polypeptide. If one possible conformation was tested every 10^-13 second, then it would take about 10^77 years to sample all possible conformations. However, proteins are properly folded within the body on short timescales all the time, meaning that the process cannot be random and, thus, can potentially be modeled.
+One of the strongest lines of evidence for the supposition that all the relevant information needed to encode protein tertiary structure is found in the primary sequence was demonstrated in the 1950s by Christian Anfinsen. In a classic experiment, he showed that ribonuclease A could be entirely denatured by being submerged in a solution of urea (to disrupt stabilizing hydrophobic bonds) in the presence of a reducing agent (to cleave stabilizing disulfide bonds). Upon removal of the protein from this environment, the denatured and functionless ribonuclease protein spontaneously recoiled and regained function, demonstrating that protein tertiary structure is encoded in the primary amino acid sequence. Had the protein reformed randomly, over one-hundred different combinations of four disulfide bonds could have formed. However, in the majority of cases proteins will require the presence of molecular chaperons within the cell for proper folding. The overall shape of a protein may be encoded in its amino acid structure, but its folding may depend on chaperons to assist in folding.
--- a/data/en.wikipedia.org/wiki/De_novo_protein_structure_prediction-1.md
+++ b/data/en.wikipedia.org/wiki/De_novo_protein_structure_prediction-1.md
@ -0,0 +1,31 @@
+---
+title: "De novo protein structure prediction"
+chunk: 2/3
+source: "https://en.wikipedia.org/wiki/De_novo_protein_structure_prediction"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T14:02:30.830807+00:00"
+instance: "kb-cron"
+---
+
+== Successful de novo modeling requirements ==
+De novo conformation predictors usually function by producing candidate conformations (decoys) and then choosing amongst them based on their thermodynamic stability and energy state. Most successful predictors will have the following three factors in common:
+1) An accurate energy function that corresponds the most thermodynamically stable state to the native structure of a protein
+2) An efficient search method capable of quickly identifying low-energy states through conformational search
+3) The ability to select native-like models from a collection of decoy structures 
+De novo programs will search three dimensional space and, in the process, produce candidate protein conformations. As a protein approaches its correctly folded, native state, entropy and free energy will decrease. Using this information, de novo predictors can discriminate amongst decoys. Specifically, de novo programs will select possible conformations with lower free energies – which are more likely to be correct than those structures with higher free energies. As stated by David A. Baker in regards to how his de novo Rosetta predictor works, “during folding, each local segment of the chain flickers between a different subset of local conformations…folding to the native structure occurs when the conformations adopted by the local segments and their relative orientations allow…low energy features of native protein structures. In the Rosetta algorithm…the program then searches for the combination of these local conformations that has the lowest overall energy.”
+However, some de novo methods work by first enumerating through the entire conformational space using a simplified representation of a protein structure, and then select the ones that are most likely to be native-like. An example of this approach is one based on representing protein folds using tetrahedral lattices and building all atoms models on top of all possible conformations obtained using the tetrahedral representation. This approach was used successfully at CASP3 to predict a protein fold whose topology had not been observed before by Michael Levitt's team.
+By developing the QUARK program, Xu and Zhang showed that ab initio structure of some proteins can be successfully constructed through a knowledge-based force field
+.
+
+== Prediction strategies ==
+If a protein of known tertiary structure shares at least 30% of its sequence with a potential homolog of undetermined structure, comparative methods that overlay the putative unknown structure with the known can be utilized to predict the likely structure of the unknown. However, below this threshold three other classes of strategy are used to determine possible structure from an initial model: ab initio protein prediction, fold recognition, and threading.
+
+Ab Initio Methods: In ab initio methods, an initial effort to elucidate secondary structures (alpha helix, beta sheet, beta turn, etc.) from primary structure is made by utilization of physicochemical parameters and neural net algorithms. From that point, algorithms predict tertiary folding. One drawback to this strategy is that it is not yet capable of incorporating the locations and orientation of amino acid side chains.
+Fold Prediction: In fold recognition strategies, a prediction of secondary structure is first made and then compared to either a library of known protein folds, such as CATH or SCOP, or what is known as a "periodic table" of possible secondary structure forms. A confidence score is then assigned to likely matches.
+Threading: In threading strategies, the fold recognition technique is expanded further. In this process, empirically based energy functions for the interaction of residue pairs are used to place the unknown protein onto a putative backbone as a best fit, accommodating gaps where appropriate. The best interactions are then accentuated in order to discriminate amongst potential decoys and to predict the most likely conformation.
+The goal of both fold and threading strategies is to ascertain whether a fold in an unknown protein is similar to a domain in a known one deposited in a database, such as the protein databank (PDB). This is in contrast to de novo (ab initio) methods where structure is determined using a physics-base approach en lieu of comparing folds in the protein to structures in a data base.
+
+== Limitations of de novo prediction methods ==
+A major limitation of de novo protein prediction methods is the extraordinary amount of computer time required to successfully solve for the native conformation of a protein. Distributed methods, such as Rosetta@home, have attempted to ameliorate this by recruiting individuals who then volunteer idle home computer time in order to process data. Even these methods face challenges, however. For example, a distributed method was utilized by a team of researchers at the University of Washington and the Howard Hughes Medical Institute to predict the tertiary structure of the protein T0283 from its amino acid sequence. In a blind test comparing the accuracy of this distributed technique with the experimentally confirmed structure deposited within the Protein Databank (PDB), the predictor produced excellent agreement with the deposited structure. However, the time and number of computers required for this feat was enormous – almost two years and approximately 70,000 home computers, respectively.
+One method proposed to overcome such limitations involves the use of Markov models (see Markov chain Monte Carlo). One possibility is that such models could be constructed in order to assist with free energy computation and protein structure prediction, perhaps by refining computational simulations. Another way of circumventing the computational power limitations is using coarse-grained modeling. Coarse-grained protein models allow for de novo structure prediction of small proteins, or large protein fragments, in a short computational time.
--- a/data/en.wikipedia.org/wiki/De_novo_protein_structure_prediction-2.md
+++ b/data/en.wikipedia.org/wiki/De_novo_protein_structure_prediction-2.md
@ -0,0 +1,38 @@
+---
+title: "De novo protein structure prediction"
+chunk: 3/3
+source: "https://en.wikipedia.org/wiki/De_novo_protein_structure_prediction"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T14:02:30.830807+00:00"
+instance: "kb-cron"
+---
+
+=== Structure prediction of de novo proteins ===
+Another limitation of protein structure prediction software concerns a specific class of proteins, namely de novo proteins. Structure prediction software such as AlphaFold rely on co-evolutionary data derived from multiple sequence alignment (MSA) and homologous protein sequences to predict structures of proteins. However, per definition, de novo proteins lack homologous sequences, as they are evolutionarily new. Thus, structure prediction software which relies on such homology can be expected to perform poorly in predicting structures of de novo proteins. To improve accuracy of structure prediction for de novo proteins, new softwares have been developed. Namely, ESMFold is a newly developed large language model (LLM) for the prediction of protein structures based solely on their amino acid sequences. It can predict a 3D structure of a protein with atomic-level resolution with an input of a single amino acid sequence. 
+
+== Critical assessment of protein structure prediction ==
+“Progress for all variants of computational protein structure prediction methods is assessed in the biannual, community wide Critical Assessment of Protein Structure Prediction (CASP) experiments. In the CASP experiments, research groups are invited to apply their prediction methods to amino acid sequences for which the native structure is not known but to be determined and to be published soon. Even though the number of amino acid sequences provided by the CASP experiments is small, these competitions provide a good measure to benchmark methods and progress in the field in an arguably unbiased manner.”
+
+== Notes ==
+Samudrala, R, Xia, Y, Huang, E.S., Levitt, M. Ab initio prediction of protein structure using a combined hierarchical approach. (1999). Proteins Suppl 3: 194-198.
+Bradley, P.; Malmstrom, L.; Qian, B.; Schonbrun, J.; Chivian, D.; Kim, D. E.; Meiler, J.; Misura, K. M.; Baker, D. (2005). "Free modeling with Rosetta in CASP6". Proteins. 61 (Suppl 7): 128–34. doi:10.1002/prot.20729. PMID 16187354. S2CID 36366681.
+Bonneau; Baker, D (2001). "Ab Initio Protein Structure Prediction: Progress and Prospects". Annu. Rev. Biophys. Biomol. Struct. 30: 173–89. doi:10.1146/annurev.biophys.30.1.173. PMID 11340057.
+J. Skolnick, Y. Zhang and A. Kolinski. Ab Initio modeling. Structural genomics and high throughput structural biology. M. Sundsrom, M. Norin and A. Edwards, eds. 2006: 137-162.
+J Lee, S Wu, Y Zhang. Ab initio protein structure prediction. From Protein Structure to Function with Bioinformatics, Chapter 1, Edited by D. J. Rigden, (Springer-London, 2009), P. 1-26.
+
+== See also ==
+Protein structure prediction
+Protein structure prediction software
+Protein design
+
+== References ==
+
+== External links ==
+CASP
+Folding@Home Archived 2012-09-08 at the Wayback Machine
+HPF project
+Foldit Archived 2011-04-04 at the Wayback Machine
+UniProtKB
+Protein Data Bank (PDB)
+Expert Protein Analysis System - links to protein prediction tools
--- a/data/en.wikipedia.org/wiki/De_novo_transcriptome_assembly-0.md
+++ b/data/en.wikipedia.org/wiki/De_novo_transcriptome_assembly-0.md
@ -0,0 +1,34 @@
+---
+title: "De novo transcriptome assembly"
+chunk: 1/2
+source: "https://en.wikipedia.org/wiki/De_novo_transcriptome_assembly"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T14:02:32.097657+00:00"
+instance: "kb-cron"
+---
+
+De novo transcriptome assembly is the de novo sequence assembly method of creating a transcriptome without the aid of a reference genome.
+
+== Introduction ==
+As a result of the development of novel sequencing technologies, the years between 2008 and 2012 saw a large drop in the cost of sequencing. Per megabase and genome, the cost dropped to 1/100,000th and 1/10,000th of the price, respectively. Prior to this, only transcriptomes of organisms that were of broad interest and utility to scientific research were sequenced; however, these developed in 2010s high-throughput sequencing (also called next-generation sequencing) technologies are both cost- and labor- effective, and the range of organisms studied via these methods is expanding. Transcriptomes have subsequently been created for chickpea, planarians, Parhyale hawaiensis, as well as the brains of the Nile crocodile, the corn snake, the bearded dragon, and the red-eared slider, to name just a few.
+Examining non-model organisms can provide novel insights into the mechanisms underlying the "diversity of fascinating morphological innovations" that have enabled the abundance of life on planet Earth. In animals and plants, the "innovations" that cannot be examined in common model organisms include mimicry, mutualism, parasitism, and asexual reproduction. De novo transcriptome assembly is often the preferred method to studying non-model organisms, since it is cheaper and easier than building a genome, and reference-based methods are not possible without an existing genome. However, this process can be computationally challenging for species with large or complex genomes such as many amphibians, often requiring specialized bionformatics pipelines to ensure that the assembly is accurate. After de novo assembly, the transcriptomes of these organisms can reveal novel proteins and their isoforms that are implicated in such unique biological phenomena.
+
+=== De novo vs. reference-based assembly ===
+A set of assembled transcripts allows for initial gene expression studies. Prior to the development of transcriptome assembly computer programs, transcriptome data were analyzed primarily by mapping on to a reference genome. Though genome alignment is a robust way of characterizing transcript sequences, this method is disadvantaged by its inability to account for incidents of structural alterations of mRNA transcripts, such as alternative splicing. Since a genome contains the sum of all introns and exons that may be present in a transcript, spliced variants that do not align continuously along the genome may be discounted as actual protein isoforms. Even if a reference genome is available, de novo assembly should be performed, as it can recover transcripts that are transcribed from segments of the genome that are missing from the reference genome assembly.
+
+=== Transcriptome vs. genome assembly ===
+Unlike genome sequence coverage levels – which can vary randomly as a result of repeat content in non-coding intron regions of DNA – transcriptome sequence coverage levels can be directly indicative of gene expression levels. These repeated sequences also create ambiguities in the formation of contigs in genome assembly, while ambiguities in transcriptome assembly contigs usually correspond to spliced isoforms, or minor variation among members of a gene family. Genome assembler can't be directly used in transcriptome assembly for several reasons. First, genome sequencing depth is usually the same across a genome, but the depth of transcripts can vary. Second, both strands are always sequenced in genome sequencing, but RNA-seq can be strand-specific. Third, transcriptome assembly is more challenging because transcript variants from the same gene can share exons and are difficult to resolve unambiguously.
+
+== Method ==
+
+=== RNA-seq ===
+
+Once RNA is extracted and purified from cells, it is sent to a high-throughput sequencing facility, where it is first reverse transcribed to create a cDNA library. This cDNA can then be fragmented into various lengths depending on the platform used for sequencing. Each of the following platforms utilizes a different type of technology to sequence millions of short reads: 454 Sequencing, Illumina, and SOLiD.
+
+=== Assembly algorithms ===
+
+The cDNA sequence reads are assembled into transcripts via a short read transcript assembly program. Most likely, some amino acid variations among transcripts that are otherwise similar reflect different protein isoforms. It is also possible that they represent different genes within the same gene family, or even genes that share only a conserved domain, depending on the degree of variation.
+A number of assembly programs are available (see Assemblers). Although these programs have been generally successful in assembling genomes, transcriptome assembly presents some unique challenges. Whereas high sequence coverage for a genome may indicate the presence of repetitive sequences (and thus be masked), for a transcriptome, they may indicate abundance. In addition, unlike genome sequencing, transcriptome sequencing can be strand-specific, due to the possibility of both sense and antisense transcripts. Finally, it can be difficult to reconstruct and tease apart all splicing isoforms.
+Short read assemblers generally use one of two basic algorithms: overlap graphs and de Bruijn graphs. Overlap graphs are utilized for most assemblers designed for Sanger sequenced reads. The overlaps between each pair of reads is computed and compiled into a graph, in which each node represents a single sequence read. This algorithm is more computationally intensive than de Bruijn graphs, and most effective in assembling fewer reads with a high degree of overlap.
+De Bruijn graphs align k-mers (usually 25-50 bp) based on k-1 sequence conservation to create contigs. The k-mers are shorter than the read lengths allowing fast hashing so the operations in de Bruijn graphs are generally less computationally intensive.
--- a/Show More
+++ b/Show More