kb/Biomedical_text_mining-1.md at 981fd04123e07c5ef73d52c75faaff4a733baa12

turtle89431 2e50ba1868 Scrape wikipedia-science: 15617 new, 4054 updated, 20200 total (kb-cron)

2026-05-05 07:02:36 -07:00

7.0 KiB

Raw Blame History

title	chunk	source	category	tags	date_saved	instance
Biomedical text mining	2/3	https://en.wikipedia.org/wiki/Biomedical_text_mining	reference	science, encyclopedia	2026-05-05T14:01:43.230057+00:00	kb-cron

=== Information extraction === Information extraction, or IE, is the process of automatically identifying structured information from unstructured or partially structured text. IE processes can involve several or all of the above activities, including named entity recognition, relationship discovery, and document classification, with the overall goal of translating text to a more structured form, such as the contents of a template or knowledge base. In the biomedical domain, IE is used to generate links between concepts described in text, such as gene A inhibits gene B and gene C is involved in disease G. Biomedical knowledge bases containing this type of information are generally products of extensive manual curation, so replacement of manual efforts with automated methods remains a compelling area of research.

=== Information retrieval and question answering === Biomedical text mining supports applications for identifying documents and concepts matching search queries. Search engines such as PubMed search allow users to query literature databases with words or phrases present in document contents, metadata, or indices such as MeSH. Similar approaches may be used for medical literature retrieval. For more fine-grained results, some applications permit users to search with natural language queries and identify specific biomedical relationships. On 16 March 2020, the National Library of Medicine and others launched the COVID-19 Open Research Dataset (CORD-19) to enable text mining of the current literature on the novel virus. The dataset is hosted by the Semantic Scholar project of the Allen Institute for AI. Other participants include Google, Microsoft Research, the Center for Security and Emerging Technology, and the Chan Zuckerberg Initiative.

== Resources ==

=== Corpora === The following table lists a selection of biomedical text corpora and their contents. These items include annotated corpora, sources of biomedical research literature, and resources frequently used as vocabulary and/or ontology references, such as MeSH. Items marked "Yes" under "Freely Available" can be downloaded from a publicly accessible location.

=== Word embeddings === Several groups have developed sets of biomedical vocabulary mapped to vectors of real numbers, known as word vectors or word embeddings. Sources of pre-trained embeddings specific for biomedical vocabulary are listed in the table below. The majority are results of the word2vec model developed by Mikolov et al or variants of word2vec.

== Applications ==

Text mining applications in the biomedical field include computational approaches to assist with studies in protein docking, protein interactions, and protein-disease associations. Text mining techniques have several advantages over traditional manual curation for identifying associations. Text mining algorithms can identify and extract information from a vast amount of literature, and more efficiently than manual curation. This includes the integration of data from different sources, including literature, databases, and experimental results. These algorithms have transformed the process of identifying and prioritizing novel genes and gene-disease associations that have previously been overlooked.

These methods are the foundation to facilitate systematic searches of overlooked scientific and biomedical literature which could carry significant association between research. The combination of information can stem new discoveries and hypotheses especially with the integration of datasets. The quality of the database is as important as the size of it. Promising text mining methods such as iProLINK (integrated Protein Literature Information and Knowledge) have been developed to curate data sources that can aid text mining research in areas of bibliography mapping, annotation extraction, protein named entity recognition, and protein ontology development. Curated databases such as UniProt can accelerate the accessibility of targeted information not only for genetic sequences, but also for literature and phylogeny.

=== Gene cluster identification === Methods for determining the association of gene clusters obtained by microarray experiments with the biological context provided by the corresponding literature have been developed.

=== Protein interactions === Automatic extraction of protein interactions and associations of proteins to functional concepts (e.g. gene ontology terms) has been explored. The search engine PIE was developed to identify and return protein-protein interaction mentions from MEDLINE-indexed articles. The extraction of kinetic parameters from text or the subcellular location of proteins have also been addressed by information extraction and text mining technology.

=== Gene-disease associations === Computational gene prioritization is an essential step in understanding the genetic basis of diseases, particularly within genetic linkage analysis. Text mining and other computational tools extract relevant information, including gene-disease associations, among others, from numerous data sources, then apply different ranking algorithms to prioritize the genes based on their relevance to the specific disease. Text mining and gene prioritization allow researchers to focus their efforts on the most promising candidates for further research. Computational tools for gene prioritization continue to be developed and analyzed. One group studied the performance of various text-mining techniques for disease gene prioritization. They investigated different domain vocabularies, text representation schemes, and ranking algorithms in order to find the best approach for identifying disease-causing genes to establish a benchmark.

=== Gene-trait associations === An agricultural genomics group identified genes related to bovine reproductive traits using text mining, among other approaches.

==== Applications of phrase mining to disease associations ==== A text mining study assembled a collection of 709 core extracellular matrix proteins and associated proteins based on two databases: MatrixDB (matrixdb.univ-lyon1.fr) and UniProt. This set of proteins had a manageable size and a rich body of associated information, making it a suitable for the application of text mining tools. The researchers conducted phrase-mining analysis to cross-examine individual extracellular matrix proteins across the biomedical literature concerned with six categories of cardiovascular diseases. They used a phrase-mining pipeline, Context-aware Semantic Online Analytical Processing (CaseOLAP), then semantically scored all 709 proteins according to their Integrity, Popularity, and Distinctiveness using the CaseOLAP pipeline. The text mining study validated existing relationships and informed previously unrecognized biological processes in cardiovascular pathophysiology.

== Software tools ==

7.0 KiB Raw Blame History

7.0 KiB

Raw Blame History