Scrape wikipedia-science: 6185 new, 3208 updated, 9666 total (kb-cron)
This commit is contained in:
parent
712b063c02
commit
5f7a7ab0ae
@ -4,7 +4,7 @@ chunk: 1/1
|
||||
source: "https://en.wikipedia.org/wiki/Academic_Torrents"
|
||||
category: "reference"
|
||||
tags: "science, encyclopedia"
|
||||
date_saved: "2026-05-05T06:31:28.240761+00:00"
|
||||
date_saved: "2026-05-05T09:52:40.258053+00:00"
|
||||
instance: "kb-cron"
|
||||
---
|
||||
|
||||
|
||||
34
data/en.wikipedia.org/wiki/Academic_journal-0.md
Normal file
34
data/en.wikipedia.org/wiki/Academic_journal-0.md
Normal file
@ -0,0 +1,34 @@
|
||||
---
|
||||
title: "Academic journal"
|
||||
chunk: 1/5
|
||||
source: "https://en.wikipedia.org/wiki/Academic_journal"
|
||||
category: "reference"
|
||||
tags: "science, encyclopedia"
|
||||
date_saved: "2026-05-05T09:52:46.563930+00:00"
|
||||
instance: "kb-cron"
|
||||
---
|
||||
|
||||
An academic journal or scholarly journal is a periodical publication in which scholarship relating to a particular academic discipline is published. They serve as permanent and transparent forums for the dissemination, scrutiny, and discussion of research. Unlike professional magazines or trade magazines, the articles are mostly written by researchers rather than staff writers employed by the journal. They nearly universally require peer review for research articles or other scrutiny from contemporaries competent and established in their respective fields. Academic journals trace their origins back to the 17th century, with the Philosophical Transactions of the Royal Society being established in 1665 as the first scientific journal.
|
||||
As of 2012, it is estimated that over 28,100 active academic journals are in publication, with scopes ranging from the general sciences, as seen in journals like Science and Nature, to highly specialized fields. These journals publish a variety of articles including original research, review articles, and perspectives. The advent of electronic publishing has made academic journals more accessible.
|
||||
|
||||
== Content ==
|
||||
|
||||
Content usually takes the form of articles presenting original research, review articles, or book reviews. The purpose of an academic journal, according to Henry Oldenburg (the first editor of Philosophical Transactions of the Royal Society), is to give researchers a venue to "impart their knowledge to one another, and contribute what they can to the Grand design of improving natural knowledge, and perfecting all Philosophical Arts, and Sciences."
|
||||
The term academic journal applies to scholarly publications in all fields; this includes journals that cover formal sciences, natural sciences, social sciences, and humanities, which differ somewhat from each other in form and function. Academic journals in the formal and natural sciences are often called scientific journals. Most journals are highly specialized, although some of the oldest journals such as Science and Nature publish articles and scientific papers across a wide range of scientific fields.
|
||||
Although academic journals are superficially similar to professional magazines (or trade journals), they are quite different. Articles in academic journals are written by active researchers such as students, scientists, and professors. Their intended audience is others in the field, meaning their content is highly technical. Academic articles also deal with research, and are peer reviewed. Meanwhile, trade journals are aimed at people in different fields, focusing on how people in those fields can do their jobs better.
|
||||
Active academic researchers are expected to publish their work in academic journals, and public funding bodies often require research results to be published in academic journals. Academic credentials for promotion into academic ranks are established in large part by the number and impact of scientific articles published. This places pressure on researchers to publish articles frequently – an environment known as publish or perish.
|
||||
|
||||
== History ==
|
||||
|
||||
In the 17th century, scientists wrote letters to each other, and included scientific ideas with them. Then, in the mid-17th century, scientists began to hold meetings and share their scientific ideas. Eventually, they led to starting organizations, such as the Royal Society (1660) and the French Academy of Sciences (1666).
|
||||
The idea of a published journal with the purpose of "[letting] people know what is happening in the Republic of Letters" was first conceived by François Eudes de Mézeray in 1663. A publication titled Journal littéraire général was supposed to be published to fulfill that goal, but never was. Humanist scholar Denis de Sallo (under the pseudonym "Sieur de Hédouville") and printer Jean Cusson took Mazerai's idea, and obtained a royal privilege from King Louis XIV on 8 August 1664 to establish the Journal des sçavans. The journal's first issue was published on 5 January 1665. It was aimed at people of letters, and had four main objectives:
|
||||
|
||||
review newly published major European books,
|
||||
publish the obituaries of famous people,
|
||||
report on discoveries in arts and science, and
|
||||
report on the proceedings and censures of both secular and ecclesiastical courts, as well as those of universities both in France and outside.
|
||||
Soon after, the Royal Society established Philosophical Transactions of the Royal Society in March 1665, and the Académie des Sciences established the Mémoires de l'Académie des Sciences in 1666, which focused on scientific communications. By the end of the 18th century, nearly 500 such periodicals had been published, the vast majority coming from Germany (304 periodicals), France (53), and England (34). Several of those publications, in particular the German journals, tended to be short-lived (under five years). A.J. Meadows has estimated the proliferation of journals to reach 10,000 journals in 1950, and 71,000 in 1987. Michael Mabe wrote that the estimates will vary depending on the definition of what exactly counts as a scholarly publication, but that the growth rate has been "remarkably consistent over time", with an average rate of 3.46% per year from 1800 to 2003.
|
||||
In 1733, Medical Essays and Observations was established by the Medical Society of Edinburgh as the first fully peer-reviewed journal. Peer review was introduced as an attempt to increase the quality and pertinence of submissions. Other important events in the history of academic journals include the establishment of Nature (1869) and Science (1880), the establishment of Postmodern Culture in 1990 as the first online-only journal, the foundation of arXiv in 1991 for the dissemination of preprints to be discussed prior to publication in a journal, and the establishment of PLOS One in 2006 as the first megajournal.
|
||||
Peer review did not begin until the 1970s, and was seen as a way of enabling researchers who were not as well-known to have their papers published in journals that were more prestigious. Though it was originally done by mailing copies of papers to reviewers, it is now done online.
|
||||
|
||||
== Scholarly articles ==
|
||||
40
data/en.wikipedia.org/wiki/Academic_journal-1.md
Normal file
40
data/en.wikipedia.org/wiki/Academic_journal-1.md
Normal file
@ -0,0 +1,40 @@
|
||||
---
|
||||
title: "Academic journal"
|
||||
chunk: 2/5
|
||||
source: "https://en.wikipedia.org/wiki/Academic_journal"
|
||||
category: "reference"
|
||||
tags: "science, encyclopedia"
|
||||
date_saved: "2026-05-05T09:52:46.563930+00:00"
|
||||
instance: "kb-cron"
|
||||
---
|
||||
|
||||
There are two kinds of article or paper submissions in academia: solicited, where an individual has been invited to submit work either through direct contact or through a general submissions call, and unsolicited, where an individual submits a work for potential publication without directly being asked to do so. Upon receipt of a submitted article, editors at the journal determine whether to reject the submission outright or begin the process of peer review. In the latter case, the submission becomes subject to review by outside scholars of the editor's choosing who typically remain anonymous. The number of these peer reviewers (or "referees") varies according to each journal's editorial practice – typically, no fewer than two, though sometimes three or more, experts in the subject matter of the article produce reports upon the content, style, and other factors, which inform the editors' publication decisions.
|
||||
Though these reports are generally confidential, some journals and publishers also practice public peer review. The editors either choose to reject the article, ask for a revision and resubmission, or accept the article for publication. Even accepted articles are often subjected to further (sometimes considerable) editing by journal editorial staff before they appear in print. The peer review can take from several weeks to several months.
|
||||
Many journal articles are broadly structured according to the IMRAD scheme. Each article has several sections, often including the following:
|
||||
|
||||
The title;
|
||||
Information about the author(s);
|
||||
The abstract, which is a one-paragraph summary of the article;
|
||||
The introduction, including a background, why the research was done, research on this topic that has been done before, and (possibly) a hypothesis;
|
||||
The methodology or method, which includes the way the research was done, details concerning the study's sample, measures for assessment, and the procedure;
|
||||
Findings or results, which summarize what the study found;
|
||||
Conclusion, comments, or discussion, which both explain how the results answered the questions that were posed, as well as areas that could be researched in the future;
|
||||
A list of works that the article's author cited.
|
||||
Reading an article in an academic journal usually entails first reading the title, to see if it is related to the desired topic. If it is, the next step is to read the abstract (or summary or conclusion, if the abstract is missing), to determine if the article is worth reading.
|
||||
Publishing research results is an essential part of helping science to advance. If scientists are describing experiments or calculations, they should also explain how they did them so that an independent researcher could repeat the experiment or calculation to verify the results, or so that they could evaluate whatever the research article's findings were. Each journal article becomes part of the permanent scientific record.
|
||||
|
||||
=== Types of article ===
|
||||
Articles can also be categorized by their purpose. The exact terminology and definitions vary by field and specific journal, but often include:
|
||||
|
||||
Letters (also called communications, and not to be confused with letters to the editor) are short descriptions of important current research findings that are usually fast-tracked for immediate publication because they are considered urgent.
|
||||
Research notes are short descriptions of current research findings that are considered less urgent or important than Letters.
|
||||
Articles are usually between five and twenty pages and are complete descriptions of current original research findings, but there are considerable variations between different fields and journals—80-page articles are not rare in mathematics or theoretical computer science.
|
||||
Supplemental articles contain a large volume of tabular data that is the result of current research and may be dozens or hundreds of pages with mostly numerical data. Some journals now only publish this data electronically on the Internet. Supplemental information also contains other voluminous material not appropriate for the main body of the article, like descriptions of routine procedures, derivations of equations, source code, non-essential data, spectra or other such miscellaneous information.
|
||||
A target article in a journal is one which argues a case, to which other authors submit a commentary or a response. There may be a final response from the author of the target article. See, for example, Alison Gopnik's article How we know our minds: The illusion of first-person knowledge of intentionality in the journal Behavioral and Brain Sciences, Volume 16, Issue 1 (1993), which was one of a pair of "target articles" to which other responses were published in the same volume.
|
||||
Review articles do not cover original research but rather accumulate the results of many different articles on a particular topic into a coherent narrative about the state of the art in that field. Review articles provide information about the topic and also provide journal references to the original research. Reviews may be entirely narrative, or may provide quantitative summary estimates resulting from the application of meta-analytical methods.
|
||||
Data papers are articles dedicated to describe datasets. This type of article is becoming popular and journals exclusively dedicated to them have been established, e.g. Scientific Data and Earth System Science Data.
|
||||
Video papers are a recent addition to practice of academic publications. They most often combine an online video demonstration of a new technique or protocol with a rigorous textual description.
|
||||
|
||||
== Reviewing ==
|
||||
|
||||
=== Review articles ===
|
||||
31
data/en.wikipedia.org/wiki/Academic_journal-2.md
Normal file
31
data/en.wikipedia.org/wiki/Academic_journal-2.md
Normal file
@ -0,0 +1,31 @@
|
||||
---
|
||||
title: "Academic journal"
|
||||
chunk: 3/5
|
||||
source: "https://en.wikipedia.org/wiki/Academic_journal"
|
||||
category: "reference"
|
||||
tags: "science, encyclopedia"
|
||||
date_saved: "2026-05-05T09:52:46.563930+00:00"
|
||||
instance: "kb-cron"
|
||||
---
|
||||
|
||||
Review articles, also called "reviews of progress", are checks on the research published in journals. Some journals are devoted entirely to review articles, some contain a few in each issue, and others do not publish review articles. Such reviews often cover the research from the preceding year, some for longer or shorter terms; some are devoted to specific topics, some to general surveys. Some reviews are enumerative, listing all significant articles in a given subject; others are selective, including only what they think worthwhile. Yet others are evaluative, judging the state of progress in the subject field. Some journals are published in series, each covering a complete subject field year, or covering specific fields through several years.
|
||||
Unlike original research articles, review articles tend to be solicited or "peer-invited" submissions, often planned years in advance, which may themselves go through a peer-review process once received. They are typically relied upon by students beginning a study in a given field, or for current awareness of those already in the field.
|
||||
|
||||
=== Book reviews ===
|
||||
|
||||
Reviews of scholarly books are checks upon the research books published by scholars; unlike articles, book reviews tend to be solicited. Journals typically have a separate book review editor determining which new books to review and by whom. If an outside scholar accepts the book review editor's request for a book review, he or she generally receives a free copy of the book from the journal in exchange for a timely review. Publishers send books to book review editors in the hope that their books will be reviewed. The length and depth of research book reviews varies much from journal to journal, as does the extent of textbook and trade book review.
|
||||
|
||||
== Prestige and ranking ==
|
||||
|
||||
An academic journal's prestige is established over time, and can reflect many factors, some but not all of which are expressible quantitatively. In many fields, a formal or informal hierarchy of academic journals exists; the most prestigious journal in a field tends to be the most selective in terms of the articles it will select for publication, and usually will also have the highest impact factor. In some countries, journal rankings can be utilized for funding decisions and even evaluation of individual researchers, although they are poorly suited for that purpose.
|
||||
In each academic discipline, some journals receive a high number of submissions and opt to restrict how many they publish, keeping the acceptance rate low. Size or prestige are not a guarantee of reliability.
|
||||
In the natural sciences and in the social sciences, the impact factor is an established proxy, measuring the number of later articles citing articles already published in the journal. There are other quantitative measures of prestige, such as the overall number of citations, how quickly articles are cited, and the average "half-life" of articles. Clarivate Analytics' Journal Citation Reports, which among other features, computes an impact factor for academic journals, draws data for computation from the Science Citation Index Expanded (for natural science journals), and from the Social Sciences Citation Index (for social science journals). Several other metrics are also used, including the SCImago Journal Rank, CiteScore, Eigenfactor, and Altmetrics.
|
||||
In the Anglo-American humanities, there is no tradition (as there is in the sciences) of giving impact-factors that could be used in establishing a journal's prestige. Recent moves have been made by the European Science Foundation (ESF) to change the situation, resulting in the publication of preliminary lists for the ranking of academic journals in the humanities. These rankings have been severely criticized, notably by history and sociology of science British journals that have published a common editorial entitled "Journals under Threat". Though it did not prevent ESF and some national organizations from proposing journal rankings, it largely prevented their use as evaluation tools.
|
||||
In some disciplines such as knowledge management/intellectual capital, the lack of a well-established journal ranking system is perceived by academics as "a major obstacle on the way to tenure, promotion and achievement recognition". Conversely, a significant number of scientists and organizations consider the pursuit of impact factor calculations as inimical to the goals of science, and have signed the San Francisco Declaration on Research Assessment to limit its use.
|
||||
Three categories of techniques have developed to assess journal quality and create journal rankings:
|
||||
|
||||
stated preference;
|
||||
revealed preference; and
|
||||
publication power approaches
|
||||
|
||||
== Costs ==
|
||||
26
data/en.wikipedia.org/wiki/Academic_journal-3.md
Normal file
26
data/en.wikipedia.org/wiki/Academic_journal-3.md
Normal file
@ -0,0 +1,26 @@
|
||||
---
|
||||
title: "Academic journal"
|
||||
chunk: 4/5
|
||||
source: "https://en.wikipedia.org/wiki/Academic_journal"
|
||||
category: "reference"
|
||||
tags: "science, encyclopedia"
|
||||
date_saved: "2026-05-05T09:52:46.563930+00:00"
|
||||
instance: "kb-cron"
|
||||
---
|
||||
|
||||
Many academic journals are subsidized by universities or professional organizations, and do not exist to make a profit. They often accept advertising, page and image charges from authors to pay for production costs. On the other hand, some journals are produced by commercial publishers who do make a profit by charging subscriptions to individuals and libraries. They may also sell all of their journals in discipline-specific collections or a variety of other packages. Many scientists and librarians have long protested these costs, especially as they see these payments going to large for-profit publishing houses. To allow their researchers online access to journals, many universities purchase site licenses, permitting access from anywhere in the university, and, with appropriate authorization, by university-affiliated users at home or elsewhere. These may be much more expensive than the cost for a print subscription. Despite the transition to electronic publishing, the costs of site licenses continue to rise relative to universities' budgets. This is known as the serials crisis.
|
||||
Journal editors tend to have other professional responsibilities, most often as teaching professors. In the case of the largest journals, there are paid staff assisting in the editing. The production of the journals is almost always done by publisher-paid staff. Humanities and social science academic journals are usually subsidized by universities or professional organization.
|
||||
Traditional scientific journals require a paid subscription to access published articles.
|
||||
The cost and value proposition of subscription to academic journals is being continuously re-assessed by institutions worldwide. In the context of the big deal cancellations by several library systems in the world, data analysis tools like Unpaywall Journals are used by libraries to estimate the specific cost and value of the various options: libraries can avoid subscriptions for materials already served by instant open access via open archives like PubMed Central.
|
||||
Concerns about cost and open access have led to the creation of free-access journals such as the Public Library of Science (PLoS) family and partly open or reduced-cost journals such as the Journal of High Energy Physics. However, professional editors still have to be paid, and PLoS still relies heavily on donations from foundations to cover the majority of its operating costs; smaller journals do not often have access to such resources. Open access journals may charge authors a fee for review or publication, rather than charging a readers a fee for access.
|
||||
|
||||
== Reproducibility and replicability ==
|
||||
For scientific journals, reproducibility and replicability of the scientific results are core concepts that allow other scientists to check and reproduce the results under the same conditions described in the paper or at least similar conditions and produce similar results with similar measurements of the same subject or carried out under changed conditions of measurement. While the ability to reproduce the results based only on details included in the article is expected, verification of reproducibility by a third party is not generally required for publication. The reproducibility of results presented in an article is therefore judged implicitly by the quality of the procedures reported and agreement with the data provided. However, some journals in the field of chemistry such as Inorganic Syntheses and Organic Syntheses require independent reproduction of the results presented as part of the review process.
|
||||
The inability for independent researches to reproduce published results is widespread, with 70% of researchers reporting failure to reproduce another scientist's results, including more than half who report failing to reproduce their own experiments. Sources of irreproducibility vary, including publication of falsified or misrepresented data and poor detailing of procedures.
|
||||
|
||||
== Copyright ==
|
||||
|
||||
Traditionally, the author of an article was required to transfer the copyright to the journal publisher. Publishers claimed this was necessary in order to protect authors' rights, and to coordinate permissions for reprints or other use. However, many authors, especially those active in the open access movement, found this unsatisfactory, and have used their influence to effect a gradual move towards a license to publish instead. Under such a system, the publisher has permission to edit, print, and distribute the article commercially, but the authors retain the other rights themselves.
|
||||
Even if they retain the copyright to an article, most journals allow certain rights to their authors. These rights usually include the ability to reuse parts of the paper in the author's future work, and allow the author to distribute a limited number of copies. In the print format, such copies are called reprints; in the electronic format, they are called postprints. Some publishers, for example the American Physical Society, also grant the author the right to post and update the article on the author's or employer's website and on free e-print servers, to grant permission to others to use or reuse figures, and even to reprint the article as long as no fee is charged. The rise of open access journals, in which the author retains the copyright but must pay a publication charge, such as the Public Library of Science family of journals, is another recent response to copyright concerns.
|
||||
|
||||
== New developments ==
|
||||
55
data/en.wikipedia.org/wiki/Academic_journal-4.md
Normal file
55
data/en.wikipedia.org/wiki/Academic_journal-4.md
Normal file
@ -0,0 +1,55 @@
|
||||
---
|
||||
title: "Academic journal"
|
||||
chunk: 5/5
|
||||
source: "https://en.wikipedia.org/wiki/Academic_journal"
|
||||
category: "reference"
|
||||
tags: "science, encyclopedia"
|
||||
date_saved: "2026-05-05T09:52:46.563930+00:00"
|
||||
instance: "kb-cron"
|
||||
---
|
||||
|
||||
The Internet has revolutionized the production of, and access to, academic journals, with their contents available online via services subscribed to by academic libraries. Individual articles are subject-indexed in databases such as Google Scholar. Some of the smallest, most specialized journals are prepared in-house, by an academic department, and published only online – this has sometimes been in the blog format, though some, like the open access journal Internet Archaeology, use the medium to embed searchable datasets, 3D models, and interactive mapping.
|
||||
Currently, there is a movement in higher education encouraging open access, either via self archiving, whereby the author deposits a paper in a disciplinary or institutional repository where it can be searched for and read, or via publishing it in a free open access journal, which does not charge for subscriptions, being either subsidized or financed by a publication fee. Given the goal of sharing scientific research to speed advances, open access has affected science journals more than humanities journals. Commercial publishers are experimenting with open access models, but are trying to protect their subscription revenues.
|
||||
Academic journals have been criticized for persistent gender, geographic, and other representation imbalances on editorial boards.
|
||||
|
||||
=== Predatory and junk journals ===
|
||||
|
||||
The much lower entry cost of on-line publishing has also raised concerns of an increase in publication of "junk" journals with lower publishing standards. These journals, often with names chosen as similar to well-established publications, solicit articles via e-mail and then charge the author to publish an article, often with no sign of actual review. Jeffrey Beall, a research librarian at the University of Colorado, has compiled a list of what he considers to be "potential, possible, or probable predatory scholarly open-access publishers"; the list numbered over 300 journals as of April 2013, but he estimates that there may be thousands. The OMICS Publishing Group, which publishes a number of the journals on this list, threatened to sue Beall in 2013 and Beall stopped publishing in 2017, citing pressure from his university. A US judge fined OMICS $50 million in 2019 stemming from an FTC lawsuit.
|
||||
Some academic journals use the registered report format, which aims to counteract issues such as data dredging and hypothesizing after the results are known. For example, Nature Human Behaviour has adopted the registered report format, as it "shift[s] the emphasis from the results of research to the questions that guide the research and the methods used to answer them". The European Journal of Personality defines this format: "In a registered report, authors create a study proposal that includes theoretical and empirical background, research questions/hypotheses, and pilot data (if available). Upon submission, this proposal will then be reviewed prior to data collection, and if accepted, the paper resulting from this peer-reviewed procedure will be published, regardless of the study outcomes."
|
||||
|
||||
=== Electronic journals ===
|
||||
|
||||
Some journals are born digital in that they are solely published on the web and in a digital format. Though most electronic journals originated as print journals, which subsequently evolved to have an electronic version, while still maintaining a print component, others eventually became electronic-only.
|
||||
An e-journal closely resembles a print journal in structure: there is a table of contents which lists the articles, and many electronic journals still use a volume/issue model, although some titles now publish on a continuous basis. Online journal articles are a specialized form of electronic document: they have the purpose of providing material for academic research and study, and they are formatted approximately like journal articles in traditional printed journals. Often, a journal article will be available for download in two formats: PDF and HTML, although other electronic file types are often supported for supplementary material. New tools such as JATS and Utopia Documents provide a 'bridge' to the 'web-versions' in that they connect the content in PDF versions directly to the World Wide Web via hyperlinks that are created 'on-the-fly'. The PDF version of an article is usually seen as the version of record, but the matter is subject to some debate. Articles are indexed in bibliographic databases as well as by search engines. E-journals allow new types of content to be included in journals, for example, video material, or the data sets on which research has been based.
|
||||
With the growth and development of the Internet, there has been a growth in the number of new digital-only journals. A subset of these journals exist as Open Access titles, meaning that they are free to access for all, and have Creative Commons licences which permit the reproduction of content in different ways. High quality open access journals are listed in Directory of Open Access Journals. Most, however, continue to exist as subscription journals, for which libraries, organisations and individuals purchase access.
|
||||
Benefits of electronically publishing include easy availability of supplementary materials (data, graphics and video), lower cost, and availability to more people, especially scientists from non-developed countries. Hence, research results from more developed nations are becoming more accessible to scientists from non-developed countries.
|
||||
|
||||
== Lists ==
|
||||
|
||||
Databases providing detailed information about journals:
|
||||
Ulrich's Global Serials Directory
|
||||
Directory of Periodicals by Modern Language Association
|
||||
JournalSeek by Genamics
|
||||
Web of Science
|
||||
Scopus
|
||||
WorldCat
|
||||
Journal hosting sites that also provide lists. Some sites evaluate journals, providing information such as how long a journal takes to review articles and what types of articles it publishes:
|
||||
Project MUSE
|
||||
JSTOR
|
||||
ScienceDirect
|
||||
Informaworld
|
||||
|
||||
== See also ==
|
||||
|
||||
== Explanatory notes ==
|
||||
|
||||
== References ==
|
||||
|
||||
== Further reading ==
|
||||
Bakkalbasi, N; Bauer, K; Glover, J; Wang, L (2006). "Three options for citation tracking: Google Scholar, Scopus and Web of Science". Biomedical Digital Libraries. 3 7. doi:10.1186/1742-5581-3-7. PMC 1533854. PMID 16805916.
|
||||
Waller, A.C. (2001). Editorial Peer Review Its Strengths and Weaknesses. ASIST monograph series. Information Today. ISBN 978-1-57387-100-6.
|
||||
Ware, Mark; Mabe, Michael (2015). The STM Report: An overview of scientific and scholarly journal publishing (PDF) (4th ed.). International Association of Scientific, Technical and Medical Publishers.
|
||||
|
||||
== External links ==
|
||||
|
||||
Journal Collection at the Internet Archive
|
||||
39
data/en.wikipedia.org/wiki/Ads.txt-0.md
Normal file
39
data/en.wikipedia.org/wiki/Ads.txt-0.md
Normal file
@ -0,0 +1,39 @@
|
||||
---
|
||||
title: "Ads.txt"
|
||||
chunk: 1/1
|
||||
source: "https://en.wikipedia.org/wiki/Ads.txt"
|
||||
category: "reference"
|
||||
tags: "science, encyclopedia"
|
||||
date_saved: "2026-05-05T09:53:31.690614+00:00"
|
||||
instance: "kb-cron"
|
||||
---
|
||||
|
||||
ads.txt (Authorized Digital Sellers) is an initiative from IAB Technology Laboratory. It specifies a text file that companies can host on their web servers, listing the other companies authorized to sell their products or services. This is designed to allow online buyers to check the validity of the sellers from whom they buy, for the purposes of internet fraud prevention.
|
||||
|
||||
|
||||
== State of adoption ==
|
||||
By November 2017, more than 44% of publishers had ads.txt files. More than 90,000 sites were using ads.txt, up from 3,500 in September 2017, according to Pixalate. Among the top 1,000 sites that sold programmatic ads, 57 percent had ads.txt files, compared to 16 percent in September, per Pixalate.
|
||||
Latest adoption data per FirstImpression.io's Ads.txt Industry Dashboard:
|
||||
|
||||
Google has been an active proponent of ads.txt. From the end of October 2017 Google Display & Video 360 only buys inventory from sources identified as authorized sellers in a publisher's ads.txt file, when a file is available. ads.txt may become a requirement for Display & Video 360.
|
||||
|
||||
|
||||
== File format ==
|
||||
The IAB's ads.txt specification dictates the formatting of ads.txt files, which can contain three types of record; data records, variables and comments. An ads.txt file can include any number of records, each placed on their own line.
|
||||
Since the ads.txt file format must be adhered to, a range of validation, management and collaboration tools have become available to help ensure ads.txt files are created correctly.
|
||||
The original specification, v1.0, was issued in 2017. with version 1.0.1 making small changes based on community feedback later that year. In 2019, version 1.0.2 made further small amendments and recommended using a placeholder record to indicate the intent of an empty ads.txt file. placeholder.example.com, placeholder, DIRECT, placeholder
|
||||
In 2021, version 1.0.3 added the inventorypartnerdomain directive, with version 1.1 adding managerdomain and ownerdomain in 2022.
|
||||
|
||||
|
||||
== See also ==
|
||||
Online advertising
|
||||
robots.txt
|
||||
security.txt
|
||||
trust.txt
|
||||
|
||||
|
||||
== References ==
|
||||
|
||||
|
||||
== External links ==
|
||||
Official website
|
||||
@ -0,0 +1,70 @@
|
||||
---
|
||||
title: "African Peer Review Mechanism"
|
||||
chunk: 1/2
|
||||
source: "https://en.wikipedia.org/wiki/African_Peer_Review_Mechanism"
|
||||
category: "reference"
|
||||
tags: "science, encyclopedia"
|
||||
date_saved: "2026-05-05T09:52:47.737196+00:00"
|
||||
instance: "kb-cron"
|
||||
---
|
||||
|
||||
The African Peer Review Mechanism (APRM) is a mutually agreed instrument voluntarily acceded to by the member states of the African Union (AU) as a self-monitoring mechanism. The APRM was launched on 9 March 2003 by the NEPAD Heads of State and Government Implementation Committee (HSGIC) in Abuja, Nigeria (NEPAD/HSGIC/03-2003/APRM/MOU (9 March 2003), Assembly Decision 198 (XI), Decision 527 (XXIII) and Decision Ext/Assembly/AU/Dec.1-4(XI);
|
||||
The APRM is an African-owned and African-led platform for self-assessment, peer-learning, and experience-sharing in democracy and good governance, in full respect for democratic principles, human rights, rule of law, the acceleration of political, social and economic integration in Africa;
|
||||
|
||||
== The Mandate ==
|
||||
The mandate of the APRM is to encourage conformity with regards to political, economic and corporate governance values, codes and standards, among African countries and the objectives in socio-economic development as well as to ensure monitoring and evaluation of AU Agenda 2063 and SDGs 2030.
|
||||
|
||||
The mandate of the APRM is to ensure that policies and practices of participating Member States conform to the agreed political, economic and corporate governance values, codes and standards contained in the African Union Declaration on Democracy, Political, Economic and Corporate Governance. As a voluntary self-monitoring instrument, APRM fosters the adoption of policies, standards and practices that lead to political stability, high economic growth, sustainable development and accelerated regional and continental economic integration through sharing of experiences and best practices, including identifying deficiencies and assessing the needs for capacity building.
|
||||
|
||||
== Expanded Mandate ==
|
||||
In 2018, during the 28th Assembly of Heads of State and Government of the AU, the APR Forum of Heads of State and Government decided to extend the APRM's mandate. This expansion includes the tracking and oversight of key governance initiatives across the continent.
|
||||
Furthermore, the AU Assembly expanded the APRM's responsibilities to encompass monitoring the implementation of the African Union's Agenda 2063
|
||||
|
||||
and the United Nations Sustainable Development Goals (SDGs) as part of Agenda 2030. This broadened mandate aims to enhance the APRM's role in promoting governance, development, and accountability in African nations.
|
||||
|
||||
Tracking governance-related aspects of Agenda 2063 and UN SDGs 2030
|
||||
Biennial Africa Governance Report completed in AGR2019, AGR2021, AGR2023 on UCG. AGR2025 (tbc)
|
||||
National Governance Reporting
|
||||
Support to Member States in the area of International Credit Rating Agencies
|
||||
|
||||
== Africa's self-assessment for good governance ==
|
||||
Member countries within the APRM undertake self-monitoring in all aspects of their governance and socio-economic development. African Union (AU) stakeholders participate in the self-assessment of all branches of government – executive, legislative and judicial – as well as the private sector, civil society and the media. The APRM Review Process gives member states a space for national dialogue on governance and socio-economic indicators and an opportunity to build consensus on the way forward.
|
||||
|
||||
== Four types of country reviews ==
|
||||
1. Base Review – carried out immediately after a country becomes a member of the APRM
|
||||
2. Periodic Review every four years
|
||||
3. Targeted Review – requested by the member country itself outside the framework of mandated reviews
|
||||
4. A Review commissioned by the APR Forum when there are early signs of pending political and economic crisis.
|
||||
|
||||
== Five Thematic Areas ==
|
||||
1. Democracy and Political Governance (DPG)
|
||||
2. Economic Governance and Management (EGM)
|
||||
3. Corporate Governance (CG)
|
||||
4. Broad-based Sustainable Socio-economic Development (SED)
|
||||
5. State Resilience to Shocks and Disasters
|
||||
The APRM Principles that underpin APRM reviews include
|
||||
(i) national ownership and leadership;
|
||||
(ii) inclusive participation;
|
||||
(iii) technical competence and
|
||||
(iv) freedom from political manipulation.
|
||||
|
||||
== The five stages of a peer review ==
|
||||
|
||||
=== 1. Consultation ===
|
||||
The APR Secretariat and the Country under review consult on the process overview and terms of the Memorandum of Understanding (MoU). The Country under review creates a Focal Point to liaise with the Secretariat and provide it with relevant laws, treaty ratifications, budgets and development plans. The Secretariat prepares a background assessment document. At the same time, the Country under review independently completes the APR Self-Assessment Questionnaire, gathers inputs from civil society and drafts a paper outlining the nation's issues and a National Programme of Action (NPoA) with clear steps and deadlines on how it plans to conform to APRM codes and standards, the African Union Charter, and UN obligations. The Country Review Team that is set up writes a report outlining issues to be focused on during the review mission.
|
||||
2. THE REVIEW MISSION
|
||||
Visits the Country under review and conducts broad-based consultations with government, officials, political parties, parliamentarians, and representatives of civil society organisations (e.g. media, academia, trade unions, professional bodies), and the private sector. The mission typically lasts two-and-a-half to three weeks.
|
||||
3. DRAFT REPORT
|
||||
The APR Country Review Team drafts a report on the Country under review.
|
||||
4.THE PEER REVIEW
|
||||
takes place at the level of the APR Forum, using the APR Panel's report on the team's findings as a basis. The APR Forum discusses these recommendations with the Reviewed Country's leadership.
|
||||
5. FINAL REPORT
|
||||
Within six months, after the peer review, the published Country Review Report must be tabled in sub-regional institutions (Pan-African Parliament, African Commission on Human and Peoples' Rights, AU Peace and Security Council, Economic, Social and Cultural Council of the African Union [AU-ECOSOCC]). The report is then made publicly available.
|
||||
|
||||
== The second generation review ==
|
||||
The objective of the APRM Second Generation Review is to assess progress made in Governance and Socio-economic Development in Member States in the period since the Base Review. The specific objectives are to:
|
||||
|
||||
reinvigorate, rationalize and institutionalize the APRM in governance reforms within a Member States.
|
||||
appraise to what extent the National Programme of Action (NPoA) is implemented and its continued relevance, on the basis of which a new NPOA with a few key actions will be proposed;
|
||||
facilitate the development of a second NPOA with greater focus and based only on key actions; and
|
||||
make the APRM Review process more relevant to citizens' needs, more cost-effective and in tune with the Agenda 2063 priorities and goals.
|
||||
@ -0,0 +1,75 @@
|
||||
---
|
||||
title: "African Peer Review Mechanism"
|
||||
chunk: 2/2
|
||||
source: "https://en.wikipedia.org/wiki/African_Peer_Review_Mechanism"
|
||||
category: "reference"
|
||||
tags: "science, encyclopedia"
|
||||
date_saved: "2026-05-05T09:52:47.737196+00:00"
|
||||
instance: "kb-cron"
|
||||
---
|
||||
|
||||
== What happens after the country review ==
|
||||
The National Programme of Action (NPoA) is divided into short-term, medium-term and long-term goals and is continuously monitored by the National Governance Commission/Governing Council, or a smaller body of state and non-state representatives. Progress Reports on implementation are presented annually to the APR Forum. The APR Secretariat follows up on commitments made, holds regional workshops to share best practices identified in the reviews, and offers technical support to fulfill APRM plans.
|
||||
|
||||
== APRM Structures ==
|
||||
|
||||
=== APR FORUM ===
|
||||
(Committee of Participating Heads of State and Government)
|
||||
Highest decision-making authority.
|
||||
APR PANEL
|
||||
(Panel of Eminent Persons)
|
||||
Oversees the review process to ensure its independence, professionalism and credibility, and reports to the Forum. The APR Panel is also responsible for selecting and appointing and the Review Teams.
|
||||
COMMITTEE OF FOCAL POINTS
|
||||
Committee of representatives of Heads of State and Government
|
||||
Manages the budgetary process, resource mobilisation through Member States, Strategic and Development Partners, and the APRM Trust Fund and Audit.
|
||||
National Governing Council (NGC)
|
||||
The National Governance Commission/National Governing Council (NGC) is the body that oversees implementation of the APRM process at the Member State level. In addition to providing guidance in terms of policy direction, the NGC ensures professionalism, credibility and independence of the national APRM self-assessment and review processes. The NGC is composed of key stakeholder groups from government, civil society and the private sector, in line with the APRM principle of broad-based participation.
|
||||
|
||||
Management of the APRM Continental Secretariat
|
||||
The APRM Secretariat is currently managed by H.E. Ambassador Marie-Antoinette Rose Quatre, Chief Executive Officer (CEO). The continental structure works in collaboration with the National Focal Points and the National Commissions / National Governing Councils.
|
||||
APRM SECRETARIAT
|
||||
Provides technical, coordinating and administrative support services. It must have sufficient capacity for the analytical work that underpins the peer review process.
|
||||
|
||||
== Membership of the APRM ==
|
||||
Membership of the APRM is voluntary and open to all African Union (AU) countries. Accession begins with an expression of interest in membership followed by the signing of a Memorandum of Understanding (MoU) between the country and the APR Forum.
|
||||
As of 2024, the African Peer Review Mechanism (APRM) comprises 44 member states, with the Central African Republic (CAR) acceding during the 33rd APR Forum on February 6, 2024. Among these members, 26 countries have completed their first-generation peer reviews, 5 have undergone second-generation reviews, and 12 have participated in targeted peer reviews.
|
||||
|
||||
== Strategic partners ==
|
||||
The APRM has entered into special support agreements with partner institutions designated by the Forum as Strategic Partners. These are: African Capacity Building Foundation (ACBF), the African Development Bank (AfDB); Mo Ibrahim Foundation; United Nations Economic Commission for Africa (UNECA); Office of the Special Advisor on Africa (OSAA); United Nations Development Programme (UNDP) Regional Bureau for Africa.
|
||||
|
||||
== See also ==
|
||||
New Partnership for Africa's Development
|
||||
African Unions
|
||||
|
||||
== Bibliography ==
|
||||
|
||||
=== APRM documents ===
|
||||
APRM Base document The APRM base document
|
||||
Memorandum of Understanding on the APRM The Memorandum of Understanding (MOU)
|
||||
Guidelines Guidelines for Countries to prepare for and participate in the APRM
|
||||
Organisation and Processes Organisation and Processes
|
||||
The objectives, standards, criteria and indicators for the APRM. Retrieved 27 May 2006.
|
||||
The Africa Governance Report 2019
|
||||
|
||||
=== Critiques and studies of the APRM process ===
|
||||
An Analysis of the Implementation of the African Peer Review Mechanism in Ghana, Kenya and Mauritius Grant Masterson, EISA, February 2005
|
||||
NEPAD’s APRM: A Progress Report, Practical Limitations and Challenges Paper by Ayesha Kajee on the APRM, 2004, retrieved 27 May 2006.
|
||||
Becoming my brother's keeper eAfrica October 2003 by Ross Herbert, retrieved 27 May 2006.
|
||||
The APRM process in Kenya: A pathway to a new state? Report by OSIEA/AfriMAP, April 2007
|
||||
Critical review of the African Peer Review Mechanism Process in Rwanda Report by Ligue des Droits de la Personne dans la Région des Grands Lacs (LDGL), January 2007
|
||||
Between Hope and Scepticism: Civil Society and the African Peer Review Mechanism Archived 29 June 2007 at the Wayback Machine Partnership Africa Canada, October 2005
|
||||
Strategies for promoting effective stakeholder participation in the African Peer Review Mechanism, UNECA, 2005
|
||||
Inadequately Self-Critical: Rwanda’s Self-Assessment for the African Peer Review Mechanism Eduard Jordaan, African Affairs, April 2006
|
||||
Effective Stakeholder Participation. in the APRM Process for the Promotion. of Democratic Governance:. A Case Study of Ghana, Eric Opoku, UNDP, December 2006
|
||||
Ghana and the APRM: A Critical Assessment, Adotey Bing-Pappoe, AfriMAP & OSIWA, June 2007
|
||||
The African Peer Review Mechanism in Mauritius: Lessons from Phase I Sheila Bunwaree, AfriMAP & OSISA, August 2007
|
||||
Further analysis of the African Peer Review Mechanism and the ECA/OECD-DAC Mutual Review of Development Effectiveness in the context of NEPAD Report to the UN High Level Task Force on the Implementation of the Right to Development, January 2008
|
||||
The African Peer Review Mechanism: Lessons from the Pioneers, Ross Herbert and Steven Gruzd, South African Institute of International Affairs, March 2008
|
||||
Civil Society Participation in Uganda’s APRM Process Juliet Nakato Odoi, SAIIA, June 2008
|
||||
Assessing South Africa’s APRM: An NGO Perspective, Nick Hutchings, Mukelani Dimba, and Alison Tilley, SAIIA, June 2008
|
||||
Addressing the African Peer Review Mechanism's Programmes of Action, Faten Aggad, SAIIA, June 2008
|
||||
Understanding APRM Research: Planning, Process and Politics. A Practical Handbook for Peer Review Research, George Katito, SAIIA, June 2008
|
||||
The African Peer Review Process in Nigeria, Adele Jinadu, AfriMAP, August 2008
|
||||
Benin and the African Peer Review Mechanism: Consolidating Democratic Achievements Gilles Badet, AfriMAP, August 2008
|
||||
Making the News: Why the APRM Didn't, Brendan Boyle, SAIIA, September 2008
|
||||
Media Freedom, Transparency and Governance, Raymond Louw, SAIIA, September 2008
|
||||
42
data/en.wikipedia.org/wiki/African_Theological_Studies-0.md
Normal file
42
data/en.wikipedia.org/wiki/African_Theological_Studies-0.md
Normal file
@ -0,0 +1,42 @@
|
||||
---
|
||||
title: "African Theological Studies"
|
||||
chunk: 1/2
|
||||
source: "https://en.wikipedia.org/wiki/African_Theological_Studies"
|
||||
category: "reference"
|
||||
tags: "science, encyclopedia"
|
||||
date_saved: "2026-05-05T09:52:48.901352+00:00"
|
||||
instance: "kb-cron"
|
||||
---
|
||||
|
||||
African Theological Studies (Études Théologiques Africaines) is a book series published by Peter Lang Verlagsgruppe under the ISSN number 2196-0615. It includes, as of 2021, a total of 22 volumes written in English and French. Some editors have changed over the years since 2013, when the series was founded, but the University of Würzburg Professor Gerhard Droesser was a founding editor and serves as an editor still. The co-editors are currently Esther Hornung and Frédéric Fungula Kwilu. Chibueze Clement Udeani has served as the editor of some volumes. Many of the monographs were originally submitted as dissertations at the Catholic Theology Faculty of the University of Würzburg. According to the title pages of each volume, the series is peer-reviewed.
|
||||
|
||||
== Scope of the Series ==
|
||||
The editors of the series seek to promote "the characteristics of African theology and its independence". Many of the authors are from Nigeria; almost all of them are from Africa. Bernhard Dinkelaker's monograph was a successful book in the series, since it was a European reception of the work of the African theologian Kwame Bediako, important to African but not that well-known in Europe and North America. The book was well-received by reviewer Kees van Dam.
|
||||
|
||||
== Criticism of the Series' Academic Integrity ==
|
||||
|
||||
=== Misleading Book Titles ===
|
||||
In 2014, Blaise Okachibe Okpanachi published his Würzburg dissertation on missionary history in modern Nigeria in the first volume of a subseries of ATS devoted to church history. Edmund Hogan stated in his book review published in Catholic Historical Review that: "The problem with this book, however, is its title. Nigerian-Vatican Diplomatic Relations purports to be a study of diplomatic relations between Nigeria and the Vatican for the period 1884–1950. This does not reflect well on the general editorship of the series, for in the 300 pages of this book only twenty-nine pages have any sort of relevance to this title (pp. 191–212, 221–29). Moreover, in these few pages, there are serious problems."
|
||||
|
||||
=== Reliance on Secondary Sources that are not Peer-Reviewed or Vetted ===
|
||||
Chimaka's 2014 monograph (volume 6 in the series) is devoted to Nigerian education and nation-building.
|
||||
Text on p. 11 about artisans, leaders and wives in Aristotelian and Platonic notions of education are taken from p. 10 in Rao's History of Education. Rao's book was published in New Delhi in 2005, and is not to be found in a single online-catalogue for German-speaking Europe (Chimaka submitted his dissertation at Würzburg). Any number of authoritative histories of education would have been available in European libraries.
|
||||
Much of p. 17 is plagiarized as well, such as the following sentence, taken verbatim from a 2010 source: "Education encompasses both the teaching and learning of knowledge, proper conduct, and technical competency. It thus focuses on the cultivation of skills, trades or professions, as well as mental, moral and aesthetic development."
|
||||
|
||||
=== Plagiarizing Academic and Non-Academic Sources ===
|
||||
While Nwaigwe's 2013 study of canonical law and marriage in Nigeria (volume 6 in the series) references some academic publications, it does not always give credit where necessary. On p. 90, almost all of the third paragraph is copied verbatim from Glenn Olsen's study of the history of marriage. In another section, the passage "civil divorce cannot dissolve the marriage bond, and a divorced person cannot enter into a new valid marriage while the first spouse is still alive" (p. 98) is verbatim from a Liberian source; he does not cite it. If made aware of the source, many would question why such introductory facts are being taken from a Liberian newsletter. The second paragraph on p. 127 is taken verbatim from Gerard Sheehy's practical guide to canon law, but without attributing Sheehy.
|
||||
The Nigerian doctoral candidate also copies from inappropriate sources. At the end of the second paragraph on p. 317, there is verbatim text overlap from a handbook for marriage preparation issued for parish use by the Archdiocese of Newark. Such texts are not usually considered valid sources for research on the doctoral level.
|
||||
Author published the 20th monograph in the series in 2020. He teaches canon law. The book on participative structures in the AMECEA plagiarizes from a book by Makarios Tillyrides, the Eastern Orthodox Archbishop of Nairobi. He uses the following sentence on p. 46-47 without attributing its author: "It is especially the concern of the local church and entrusted to the guide of the local bishop to coordinate the commitment to evangelization, by gathering the faithful together, confirming them in the Faith through the priests and supporting them in the fulfillment of their respective tasks." It was originally published on p. 389 of Tillyride's 2011 book.
|
||||
|
||||
=== Lack of English Proficiency ===
|
||||
|
||||
==== Volume 4, Marriage Preparation in the Igbo Tradition ====
|
||||
Nwaigwe's book on marriage includes a section on the Nigerian fattening ceremony and describes the bride "before she finally leaved her home" (p. 232). Another paragraph begins as follows: "Also such theme like sexuality has to be made clear to them that sex is intimately connected with our personal identity" (p. 299).
|
||||
|
||||
==== Volume 8, Religious Education ====
|
||||
Sister Chizurum Ugbor's dissertation, Living the Future in Dialogue, was accepted by the University of Leuven and printed as the eighth volume in the series in 2015. The following examples are evidence for insufficient linguistic skills:
|
||||
|
||||
"Political leaders, as well, use youths to get their political powers, which often lead to bloody shedding of human beings." (p. 24)
|
||||
"Education in Nigeria is as old as the country itself and is considered as a formative process that enables the citizens to meet up their challenging and changing situations to live and fit in a dynamic society." (p. 33)
|
||||
|
||||
=== Erroneous Bibliography ===
|
||||
21
data/en.wikipedia.org/wiki/African_Theological_Studies-1.md
Normal file
21
data/en.wikipedia.org/wiki/African_Theological_Studies-1.md
Normal file
@ -0,0 +1,21 @@
|
||||
---
|
||||
title: "African Theological Studies"
|
||||
chunk: 2/2
|
||||
source: "https://en.wikipedia.org/wiki/African_Theological_Studies"
|
||||
category: "reference"
|
||||
tags: "science, encyclopedia"
|
||||
date_saved: "2026-05-05T09:52:48.901352+00:00"
|
||||
instance: "kb-cron"
|
||||
---
|
||||
|
||||
==== Volume 2, Solving the Problems of Poverty ====
|
||||
Kwazu's book on African poverty was the second monograph in the series. His bibliography lists "Hillman, K., Wörterbuch der Soziologie, Kröner, Stuttgart, Bd. 410, 1994" on p. 252, but Hillmann should be spelled with two n's and the dictionary does not contain 410 volumes, as Kwazu's entry suggests, but is volume 410 in the series called "Kröners Taschenausgabe". The 1994 printing is, furthermore, the fourth, revised edition.
|
||||
The bibliography cites E. Leuninger's "Take (sic) to Catholic Workers' Movement, (KAB) Diocese of Limburg, Germany, 1 April 2001" (p. 255). The talk (not "take") was certainly held in German, and Kwazu's bibliography lists several German titles, but here, Kwazu gives no specific title, making the reference inaccessible.
|
||||
|
||||
==== Volume 13, Women's Emancipation in Africa ====
|
||||
Mutume's monograph on women's rights (volume 13 in the series) cites the Vatican's 1976 "Declaration on the Question of Admission of Women to the Ministerial Priesthood" and lists it twice in the bibliography, once under A for Austin Flannery, and once under F. Both are incorrect, since it was signed by Franjo Cardinal Seper, then Prefect of the Congregation of the Doctrine of the Faith. The Dominican priest Austin Flannery was the translator; listing him as its author is confusing.
|
||||
The bibliography also mistakenly cites Tamale's 2003 article "Out of the Closet" as having been published in 2005. Furthermore, Mutume identifies no page numbers for the Tamale publication. Since it has been republished several time, giving page numbers would be one way of helping readers confirm they have the right version at hand.
|
||||
Mutume cites from Together, a journal published by World Vision International, which has often been criticized for unethical fundraising techniques and corruption. The bibliography lists the article under the authors's first name (Graeme), thus suggesting that Mutume was unable to distinguish it from the last (Irvine). In any case, magazines distributed by World Vision to their donors are unusual sources for doctoral dissertations.
|
||||
Blind references to publications unseen by Mutume appear regularly in confusing dead ends. One bibliographical entry cites "Schrijvers, quoted in Stromquist", but gives no title or page numbers for the Schrijvers publication (p. 219). The same page in the bibliography lists two entries under T because they begin with "The". This page of the bibliography ends with the following puzzling entry: "Women; Definitions. UNICEF. Retrieved on 20th February 2012". Without citing the URL or the bibliographical container, this entry is useless.
|
||||
|
||||
== References ==
|
||||
31
data/en.wikipedia.org/wiki/Al-Ruhawi-0.md
Normal file
31
data/en.wikipedia.org/wiki/Al-Ruhawi-0.md
Normal file
@ -0,0 +1,31 @@
|
||||
---
|
||||
title: "Al-Ruhawi"
|
||||
chunk: 1/1
|
||||
source: "https://en.wikipedia.org/wiki/Al-Ruhawi"
|
||||
category: "reference"
|
||||
tags: "science, encyclopedia"
|
||||
date_saved: "2026-05-05T09:53:16.940835+00:00"
|
||||
instance: "kb-cron"
|
||||
---
|
||||
|
||||
Ishāq bin Ali al-Rohawi (Arabic: إسحاق بن علي الرهاوي) was a 9th-century author of the first medical ethics book in Arabic medicine.
|
||||
His Ethics of the Physician contains the first documented description for peer review processes. The processes gave the fundamentals of peer review processes where practising Arab physicians were reviewed by peers on the medical treatment they provide to their patients. If the treatment of a patient was incorrectly done with negative peer reviews, then the physician is at a lawsuit liability.
|
||||
Al-Ruhawi was probably an Iranian-Nestorian from Al-Ruha, modern-day Şanlıurfa. It is often known as Urfa. Not much is known about Al-Ruhawi, including his beliefs. The author Levey stated in his book that Al-Ruhawi was a Nestorian Christian while Johann Christoph Bürgel wrote that Al-Ruhawi was Jewish. However, both authors did not give evidence to support their argument, rather having it based on their interpretations.
|
||||
Although there are conflictions in these two beliefs, there is evidence to prove that Al-Rahawi had Islamic beliefs. Al-Rahawi began his book with the words "In the name of Allah," which is a traditional opening for Muslims. Additionally, Al-Rahawi uses the word "Allah" hundreds of times in his work, which is associated with Islam. There is also more proof for the Islamic beliefs of Al-Ruhawi in another area of his writing where he hints towards the five pillars of Islam. In his introduction of the first chapter for one of his books, Al-Ruhawi writes the following: "The first thing in which a physician must believe is that all in this world has only one able creator who performs all deeds wilfully. The second article of faith in which a physician must believe is that he have credence in the great Allah with a firm affection, and is devoted to Him with all his reason, soul, and free will. The third faith which a physician must possess is that Allah sent His messengers to mankind to teach them what is good since the mind alone is not sufficient. Thus, without His apostles, it is not enough for man...... In all these matters, the physician must truly believe since all the holy books and ancients affirm them. No believer can deny them."
|
||||
|
||||
|
||||
== Works ==
|
||||
Al-Ruhawi's most celebrated work is Adab al-Tabib ("Practical Ethics of the Physician" or "Practical Medical Deontology"), the earliest surviving Arabic work on medical ethics. In his works, Al-Ruhawi's regarded physicians as "guardians of souls and bodies". The work was based on Hippocrates and Galen and consisted of twenty chapters on various topics related to medical ethics.
|
||||
Al- Ruhawi also wrote the following books:
|
||||
|
||||
A compilation of first four books of Alexandrian Canons
|
||||
Introduction to Dialectics for Beginners
|
||||
On Examination of Physicians
|
||||
He compiled two works based on Galen.
|
||||
|
||||
|
||||
== References ==
|
||||
|
||||
|
||||
== External links ==
|
||||
lib.eshia.ir (in Persian)
|
||||
@ -0,0 +1,85 @@
|
||||
---
|
||||
title: "American Institute of Biological Sciences"
|
||||
chunk: 1/1
|
||||
source: "https://en.wikipedia.org/wiki/American_Institute_of_Biological_Sciences"
|
||||
category: "reference"
|
||||
tags: "science, encyclopedia"
|
||||
date_saved: "2026-05-05T09:52:50.014686+00:00"
|
||||
instance: "kb-cron"
|
||||
---
|
||||
|
||||
The American Institute of Biological Sciences (AIBS) is a nonprofit scientific public charitable organization. The organization's mission is to promote the use of science to inform decision-making and advance biology for the benefit of science and society.
|
||||
|
||||
|
||||
== Overview ==
|
||||
AIBS serves as a society of societies. It has 98 member organizations and is headquartered in Herndon, VA. Its staff work to achieve its mission by publishing the peer-reviewed journal BioScience, providing peer review and advisory support services for funding organizations, providing professional development for scientists and students, advocating for science policy and educating the public about biology. AIBS works with like-minded organizations, funding agencies, and nonprofit and for-profit entities to promote the use of science to inform decision-making.
|
||||
AIBS is governed by an esteemed Board of Directors and a Council of representatives of its member organizations.
|
||||
|
||||
|
||||
== Background and history ==
|
||||
AIBS was established in 1947 as a part of the National Academy of Sciences. The overarching goal was to unify the individuals and organizations that collectively represent the biological sciences, so that the community could address matters of common concern. In the 1950s, AIBS became an independent, member-governed, nonprofit 501(c)3 public charity scientific organization. In 1962, the National Science Foundation audited the books of IABS and froze assets, eventually leading to AIBS having to pay back nearly $500,000 and institute a program of austerity. By the 1970s it appeared to have largely recovered.
|
||||
|
||||
|
||||
== Strategic priorities ==
|
||||
AIBS works toward overarching outcomes through three strategic priorities:
|
||||
|
||||
Scientific Peer Advisory and Review Services for research proposals and programs sponsored by funding organizations, including the federal government, state agencies, private research foundations, other non-government organizations and educate the community about the science of peer review.
|
||||
Publications and Communications, including reliable reports, analyses, and the peer-reviewed journal BioScience, which is a forum for integrating the life sciences and educating the public about biological sciences.
|
||||
Community Programs that advance the field and profession of biology while promoting and providing leadership, with a particular emphasis on public policy and advocacy, education and professional development, as well as public awareness of science.
|
||||
|
||||
|
||||
== Core activities ==
|
||||
The American Institute of Biological Sciences promotes the use of science to inform decision-making and advances biology for the benefit of science and society through:
|
||||
|
||||
Assessment. Coordinating and facilitating expert merit-based evaluation of research proposals, ongoing research programs, and completed research efforts, as well as analyzing merit-based peer-review activities sponsored by a diverse group of research-funding organizations.
|
||||
Advocacy. Advancing biology through programs, partnerships, and advocacy with a particular emphasis on public policy that benefits and promotes the interests and diversity of the life sciences community and informed decision-making.
|
||||
Training. Providing educational training programs to scientists that enhance their professional skills and opportunities, as well as improve their ability to engage and inform diverse audiences.
|
||||
Communication. Convening communities, publishing, and sharing scientific information (through a variety of channels) with the breadth of the scientific community, decision-makers, and the public.
|
||||
|
||||
|
||||
=== Assessment ===
|
||||
|
||||
|
||||
==== Scientific Peer Advisory and Review Services (SPARS) ====
|
||||
The Scientific Peer Advisory and Review Services (SPARS) department is focused on coordinating, facilitating, and promoting independent, equitable evaluation processes to inform decision-making.
|
||||
AIBS SPARS partners with a diverse group of organizations that are focused on a wide variety of research and funding efforts. AIBS SPARS facilitates review/advisory support services to ensure thorough and structured assessment processes are performed by vetted, qualified experts.
|
||||
AIBS SPARS staff are experts in review processes and advisory services, which are often provided for proposed research grants and progress reports, ongoing funded research programs/portfolios, retrospective impact analyses, and for advisory boards and committees.
|
||||
|
||||
|
||||
==== Science of Peer Review ====
|
||||
The empirical basis upon which peer review rests is limited. To build up this literature base, AIBS SPARS staff perform in-house research and meta-analyses on the peer review process and share results with the scientific community through publications and presentations.
|
||||
|
||||
|
||||
=== Advocacy ===
|
||||
|
||||
|
||||
==== Public Policy Office ====
|
||||
The Public Policy Office works to educate policymakers about the importance of investing in biology and advocates for policies that serve the needs of researchers, educators, and other biological science professionals. Issues addressed by the office include but are not limited to: funding for the biological sciences; funding for research infrastructure, including scientific collections and field stations; strengthening science education policy; investing in the scientific workforce; and promoting scientific integrity and transparency.
|
||||
|
||||
|
||||
=== Training ===
|
||||
With over 2,600 people trained, AIBS provides training programs to scientists that enhance their professional skills and opportunities, as well as improve their ability to engage and inform diverse audiences. Workshops include:
|
||||
|
||||
Employment Acquisition Skills Boot Camp for Scientists: Helping scientists hone and practice the skills needed to secure employment
|
||||
Writing for Impact and Influence: Helping scientists and graduate students hone their written communication skills to increase the impact and influence of their message
|
||||
Enabling Interdisciplinary and Team Science: Providing participants with the knowledge and skills required to become productive and effective members of scientific teams
|
||||
Communication Bootcamp: Helps the science community learn how to engage and inform decision-makers and have the opportunity to become effective and engaged communicators
|
||||
|
||||
|
||||
=== Communications ===
|
||||
|
||||
|
||||
==== Faces of Biology Photo Contest ====
|
||||
AIBS conducts a yearly photo contest that helps communicate science through imagery. Photographs entered into the contest must depict a person, such as a scientist, researcher, collections curator, technician, or student, engaging in biological research. The depicted research may occur outside, in a lab, with a natural history collection, on a computer, in a classroom, or elsewhere.
|
||||
|
||||
|
||||
==== Publications ====
|
||||
AIBS publishes BioScience, a peer-reviewed monthly journal with content written and edited for accessibility to researchers, educators, and students. The journal is heavily cited, with a 2021 Impact Factor of 11.687. AIBS also publishes BioScience Talks, a companion podcast to the journal.
|
||||
BioScience includes articles about research findings and techniques, advances in biology education, professionally written feature articles about new developments in biology, discussions of professional issues, book reviews, news about AIBS, and a policy column (Washington Watch). Roundtables, forums, and viewpoint articles offer the perspectives of opinion leaders and invite further commentary.
|
||||
|
||||
|
||||
== References ==
|
||||
|
||||
|
||||
== See also ==
|
||||
BioScience
|
||||
41
data/en.wikipedia.org/wiki/Augmented_Analytics-0.md
Normal file
41
data/en.wikipedia.org/wiki/Augmented_Analytics-0.md
Normal file
@ -0,0 +1,41 @@
|
||||
---
|
||||
title: "Augmented Analytics"
|
||||
chunk: 1/1
|
||||
source: "https://en.wikipedia.org/wiki/Augmented_Analytics"
|
||||
category: "reference"
|
||||
tags: "science, encyclopedia"
|
||||
date_saved: "2026-05-05T09:53:32.830056+00:00"
|
||||
instance: "kb-cron"
|
||||
---
|
||||
|
||||
Augmented Analytics is an approach of data analytics that employs the use of machine learning and natural language processing to automate analysis processes normally done by a specialist or data scientist. The term was introduced in 2017 by Rita Sallam, Cindi Howson, and Carlie Idoine in a Gartner research paper.
|
||||
Augmented analytics is based on business intelligence and analytics. In the graph extraction step, data from different sources are investigated.
|
||||
|
||||
|
||||
== Defining Augmented Analytics ==
|
||||
Machine Learning – a systematic computing method that uses algorithms to sift through data to identify relationships, trends, and patterns. It is a process that allows algorithms to dynamically learn from data instead of having a set base of programmed rules.
|
||||
Natural language generation (NLG) – a software capability that takes unstructured data and translates it into plain-English, readable, language.
|
||||
Automating Insights – using machine learning algorithms to automate data analysis processes.
|
||||
Natural Language Query – enabling users to query data using business terms that are either typed onto a search box or spoken.
|
||||
|
||||
|
||||
== Data Democratization ==
|
||||
Data Democratization is the democratizing data access in order to relieve data congestion and get rid of any sense of data "gatekeepers". This process must be implemented alongside a method for users to make sense of the data. This process is used in hopes of speeding up company decision making and uncovering opportunities hidden in data.
|
||||
There are three aspects to democratising data:
|
||||
|
||||
Data Parameterisation and Characterisation.
|
||||
Data Decentralisation using an OS of blockchain and DLT technologies, as well as an independently governed secure data exchange to enable trust.
|
||||
Consent Market-driven Data Monetisation.
|
||||
When it comes to connecting assets, there are two features that will accelerate the adoption and usage of data democratisation: decentralized identity management and business data object monetization of data ownership. It enables multiple individuals and organizations to identify, authenticate, and authorize participants and organizations, enabling them to access services, data or systems across multiple networks, organizations, environments, and use cases. It empowers users and enables a personalized, self-service digital onboarding system so that users can self-authenticate without relying on a central administration function to process their information. Simultaneously, decentralized identity management ensures the user is authorized to perform actions subject to the system’s policies based on their attributes (role, department, organization, etc.) and/ or physical location.
|
||||
|
||||
|
||||
== Use cases ==
|
||||
Agriculture – Farmers collect data on water use, soil temperature, moisture content and crop growth, augmented analytics can be used to make sense of this data and possibly identify insights that the user can then use to make business decisions.
|
||||
Smart Cities – Many cities across the United States, known as Smart Cities collect large amounts of data on a daily basis. Augmented analytics can be used to simplify this data in order to increase effectiveness in city management (transportation, natural disasters, etc.).
|
||||
Analytic Dashboards – Augmented analytics has the ability to take large data sets and create highly interactive and informative analytical dashboards that assist in many organizational decisions.
|
||||
Augmented Data Discovery – Using an augmented analytics process can assist organizations in automatically finding, visualizing and narrating potentially important data correlations and trends.
|
||||
Data Preparation – Augmented analytics platforms have the ability to take large amounts of data and organize and "clean" the data in order for it to be usable for future analyses.
|
||||
Business – Businesses collect large amounts of data, daily. Some examples of types of data collected in business operations include; sales data, consumer behavior data, distribution data. An augmented analytics platform provides access to analysis of this data, which could be used in making business decisions.
|
||||
|
||||
|
||||
== References ==
|
||||
36
data/en.wikipedia.org/wiki/BADIR-0.md
Normal file
36
data/en.wikipedia.org/wiki/BADIR-0.md
Normal file
@ -0,0 +1,36 @@
|
||||
---
|
||||
title: "BADIR"
|
||||
chunk: 1/1
|
||||
source: "https://en.wikipedia.org/wiki/BADIR"
|
||||
category: "reference"
|
||||
tags: "science, encyclopedia"
|
||||
date_saved: "2026-05-05T09:53:34.122536+00:00"
|
||||
instance: "kb-cron"
|
||||
---
|
||||
|
||||
The BADIR (pronounced /ˈbaːdɪr/) is a structured data science and data analytics process designed to enhance data-driven decision-making within organizations by addressing both analytical output as well as usefulness to management. It was developed by Piyanka Jain and Puneet Sharma and first published in the 2014 book “Behind Every Good Decision”.
|
||||
|
||||
|
||||
== Overview ==
|
||||
The BADIR Framework employs a hypothesis-driven approach that involves understanding business objectives and challenges, analysis planning before acquiring and ensuring the quality of relevant data, applying analytics to derive insights and the potential impact on the business challenge, developing actionable recommendations aligned with strategic goals, and implementing decisions while monitoring outcomes and making necessary adjustments.
|
||||
|
||||
|
||||
== Key Components of BADIR ==
|
||||
The BADIR Framework consists of interrelated components essential for decision-making. Its main assertion is that if data analytics does not drive business impact, then it is just statistics, not analytics. The acronym in the framework stands for the following five steps:
|
||||
|
||||
B = Business Question: The first step in the framework is to define the real business question. Market trends, customer feedback, competitor actions, etc. are data sources commonly used by businesses. However, these data alone do not help teams to make decisions.
|
||||
A = Analysis Plan: Once the fundamental business question is defined, the next phase involves generating and testing hypotheses to explore potential strategic directions. This phase ensures that decisions are not based on assumptions but on validated data insights.
|
||||
D = Data Collection: With a clear plan and methodology in place, the next step is to gather the required data. This stage is crucial as the quality of data directly affects the reliability of insights derived from it.
|
||||
I = Insights Derivation: This phase involves analyzing the data to identify patterns, trends, and outcomes that either support or challenge the proposed hypotheses.
|
||||
R = Recommendations: The final step is to synthesize the insights gained through data analysis into actionable recommendations that can guide strategic business decisions.
|
||||
|
||||
|
||||
== Origin ==
|
||||
Piyanka Jain, a data science expert with experience at Adobe and PayPal, developed the BADIR framework to address the need for more structured data analytics methodologies. Observing that data-driven decision-making was often seen as complex and inaccessible, Jain designed BADIR as a five-step process to streamline and simplify data analysis in business environments.
|
||||
|
||||
|
||||
== Impact and Recognition ==
|
||||
According to Jain, the framework's focus on defining the business question early is said to optimize up to 80% of a business leader's workflow by reducing guesswork and concentrating efforts on strategic goals. IMA's Strategic Finance magazine compared the framework to the scientific process in terms of its approach to data analysis, and noted its ability to improve cross-departmental efficiency. The framework has been recommended for a variety of business leaders, from customer service managers to records and information management professionals.
|
||||
|
||||
|
||||
== References ==
|
||||
35
data/en.wikipedia.org/wiki/Base_effect-0.md
Normal file
35
data/en.wikipedia.org/wiki/Base_effect-0.md
Normal file
@ -0,0 +1,35 @@
|
||||
---
|
||||
title: "Base effect"
|
||||
chunk: 1/1
|
||||
source: "https://en.wikipedia.org/wiki/Base_effect"
|
||||
category: "reference"
|
||||
tags: "science, encyclopedia"
|
||||
date_saved: "2026-05-05T09:53:35.330547+00:00"
|
||||
instance: "kb-cron"
|
||||
---
|
||||
|
||||
The base effect is a mathematical effect that originates from the fact that a given percentage of a reference value, is not the same as the absolute difference of the same given percentage of a much larger or smaller reference value. E.g. 1% of a GDP of US$1 million is not equal to 1% of GDP of US$1 billion in terms of absolute difference.
|
||||
The reference value is common called a base year in economics.
|
||||
A low base effect is the tendency of an absolute change from a low initial amount to be translated into a larger percentage change, while a high base effect would be the tendency of an absolute change from a high initial amount to be translated into a smaller percentage change.
|
||||
Because of the base effect percentages in time series analysis can be misleading, in particular when percentages are compounded annually over a period of many years.
|
||||
A high base effect can mislead because of decreasing percentages even while the underlying absolute difference is increasing over time as much as the measured value or population increases over time. Also, a low base effect can mislead because of increasing percentages even while the absolute difference is decreasing over time as much as the measured value or population decreases over time.
|
||||
A base effect is closely related to a base year, which serves as a reference point to normalize rates of change, similar to the function of a denominator in comparisons.
|
||||
When inflation is measured with a price index different formulas produce different results because of the base effect. E.g. Paasche versus Laspeyres price indices. Because of the problems, in particular with headline inflation, that arise from volatility in prices core inflation is used as an additional indicator of the development of inflation.
|
||||
A lot of different methodologies can be used to correct for the base effect in computations of economic growth, and are often used in financial market analysis.
|
||||
A base effect relates to inflation when in the corresponding period of the previous year. If the inflation rate was too low in the corresponding period of the previous year, even a smaller rise in the Price Index will arithmetically give a high rate of inflation now. On the other hand, if the price index had risen at a high rate in the corresponding period of the previous year and recorded high inflation rate, a similar absolute increase in the price index now will show a lower inflation rate now.
|
||||
An example of the base effect:
|
||||
|
||||
The Price Index is 100, 150, and 200 in each of three consecutive periods, called 1, 2, and 3, respectively. The increase of 50 from period 1 to period 2 gives a percentage increase of 50%, but the increase from period 2 to period 3, despite being the same as the previous increase in absolute terms, gives a percentage increase of only 33.33%. This is due to the relatively large difference in the bases on which the percentages are calculated (100 vs 150).
|
||||
|
||||
|
||||
== See also ==
|
||||
Compound annual growth rate
|
||||
Path dependence
|
||||
Volatility tax
|
||||
Low base effect
|
||||
Relative change
|
||||
Percentage#Percentage increase and decrease
|
||||
Time series analysis
|
||||
|
||||
|
||||
== References ==
|
||||
31
data/en.wikipedia.org/wiki/Big_data-0.md
Normal file
31
data/en.wikipedia.org/wiki/Big_data-0.md
Normal file
@ -0,0 +1,31 @@
|
||||
---
|
||||
title: "Big data"
|
||||
chunk: 1/10
|
||||
source: "https://en.wikipedia.org/wiki/Big_data"
|
||||
category: "reference"
|
||||
tags: "science, encyclopedia"
|
||||
date_saved: "2026-05-05T09:53:36.506913+00:00"
|
||||
instance: "kb-cron"
|
||||
---
|
||||
|
||||
Big data primarily refers to data sets that are too large or complex to be dealt with by traditional data-processing software. Data with many entries (rows) offers greater statistical power, while data with higher complexity (more attributes or columns) may lead to a higher false discovery rate.
|
||||
Big data analysis challenges include capturing data, data storage, data analysis, search, sharing, transfer, visualization, querying, updating, information privacy, and data sources. Big data was originally associated with three key concepts: volume, variety, and velocity. The analysis of big data that have only volume, velocity, and variety can pose challenges in sampling. A fourth concept, veracity, which refers to the level of reliability of data, was thus added. Without sufficient investment in expertise to ensure big data veracity, the volume and variety of data can produce costs and risks that exceed an organization's capacity to create and capture value from big data.
|
||||
Current usage of the term big data tends to refer to the use of predictive analytics, user behavior analytics, or certain other advanced data analytics methods that extract value from big data, and seldom to a particular size of data set. "There is little doubt that the quantities of data now available are indeed large, but that's not the most relevant characteristic of this new data ecosystem."
|
||||
Analysis of data sets can find new correlations to "spot business trends, prevent diseases, combat crime and so on". Scientists, business executives, medical practitioners, advertising and governments alike regularly meet difficulties with large datasets in areas including Internet searches, fintech, healthcare analytics, geographic information systems, urban informatics, and business informatics. Scientists encounter limitations in e-Science work, including meteorology, genomics, connectomics, complex physics simulations, biology, and environmental research.
|
||||
The size and number of available data sets have grown rapidly as data is collected by devices such as mobile devices, cheap and numerous information-sensing Internet of things devices, aerial (remote sensing) equipment, software logs, cameras, microphones, radio-frequency identification (RFID) readers and wireless sensor networks. The world's technological per-capita capacity to store information has roughly doubled every 40 months since the 1980s; as of 2012, every day 2.5 exabytes (2.17×260 bytes) of data are generated. Based on an IDC report prediction, the global data volume was predicted to grow exponentially from 4.4 zettabytes to 44 zettabytes between 2013 and 2020. By 2025, IDC predicts there will be 163 zettabytes of data. According to IDC, global spending on big data and business analytics (BDA) solutions is estimated to reach $215.7 billion in 2021. Statista reported that the global big data market is forecasted to grow to $103 billion by 2027. In 2011 McKinsey & Company reported, if US healthcare were to use big data creatively and effectively to drive efficiency and quality, the sector could create more than $300 billion in value every year. In the developed economies of Europe, government administrators could save more than €100 billion ($149 billion) in operational efficiency improvements alone by using big data. And users of services enabled by personal-location data could capture $600 billion in consumer surplus. One question for large enterprises is determining who should own big-data initiatives that affect the entire organization.
|
||||
Relational database management systems and desktop statistical software packages used to visualize data often have difficulty processing and analyzing big data. The processing and analysis of big data may require "massively parallel software running on tens, hundreds, or even thousands of servers". What qualifies as "big data" varies depending on the capabilities of those analyzing it and their tools. Furthermore, expanding capabilities make big data a moving target. "For some organizations, facing hundreds of gigabytes of data for the first time may trigger a need to reconsider data management options. For others, it may take tens or hundreds of terabytes before data size becomes a significant consideration."
|
||||
|
||||
== Definition ==
|
||||
The term big data has been in use since the 1990s, with some giving credit to John Mashey for popularizing the term. Big data usually includes data sets with sizes beyond the ability of commonly used software tools to capture, curate, manage, and process data within a tolerable elapsed time. Big data philosophy encompasses unstructured, semi-structured and structured data; however, the main focus is on unstructured data. Big data "size" is a constantly moving target; as of 2012 ranging from a few dozen terabytes to many zettabytes of data. Big data requires a set of techniques and technologies with new forms of integration to reveal insights from datasets that are diverse, complex, and of a massive scale. Variability is often included as an additional quality of big data.
|
||||
A 2018 definition states "Big data is where parallel computing tools are needed to handle data", and notes, "This represents a distinct and clearly defined change in the computer science used, via parallel programming theories, and losses of some of the guarantees and capabilities made by Codd's relational model."
|
||||
In a comparative study of big datasets, Kitchin and McArdle found that none of the commonly considered characteristics of big data appear consistently across all of the analyzed cases. For this reason, other studies identified the redefinition of power dynamics in knowledge discovery as the defining trait. Instead of focusing on the intrinsic characteristics of big data, this alternative perspective pushes forward a relational understanding of the object claiming that what matters is the way in which data is collected, stored, made available and analyzed.
|
||||
|
||||
=== Big data vs. business intelligence ===
|
||||
The growing maturity of the concept more starkly delineates the difference between "big data" and "business intelligence":
|
||||
|
||||
Business intelligence uses applied mathematics tools and descriptive statistics with data with high information density to measure things, detect trends, etc.
|
||||
Big data uses mathematical analysis, optimization, inductive statistics, and concepts from nonlinear system identification to infer laws (regressions, nonlinear relationships, and causal effects) from large sets of data with low information density to reveal relationships and dependencies, or to perform predictions of outcomes and behaviors.
|
||||
|
||||
== Characteristics ==
|
||||
|
||||
Big data can be described by the following characteristics:
|
||||
43
data/en.wikipedia.org/wiki/Big_data-1.md
Normal file
43
data/en.wikipedia.org/wiki/Big_data-1.md
Normal file
@ -0,0 +1,43 @@
|
||||
---
|
||||
title: "Big data"
|
||||
chunk: 2/10
|
||||
source: "https://en.wikipedia.org/wiki/Big_data"
|
||||
category: "reference"
|
||||
tags: "science, encyclopedia"
|
||||
date_saved: "2026-05-05T09:53:36.506913+00:00"
|
||||
instance: "kb-cron"
|
||||
---
|
||||
|
||||
Volume
|
||||
The quantity of generated and stored data. The size of the data determines the value and potential insight, and whether it can be considered big data or not. The size of big data is usually larger than terabytes and petabytes.
|
||||
Variety
|
||||
The type and nature of the data. Earlier technologies like RDBMSs were capable to handle structured data efficiently and effectively. However, the change in type and nature from structured to semi-structured or unstructured challenged the existing tools and technologies. Big data technologies evolved with the prime intention to capture, store, and process the semi-structured and unstructured (variety) data generated with high speed (velocity), and huge in size (volume). Later, these tools and technologies were explored and used for handling structured data also but preferable for storage. Eventually, the processing of structured data was still kept as optional, either using big data or traditional RDBMSs. This helps in analyzing data towards effective usage of the hidden insights exposed from the data collected via social media, log files, sensors, etc. Big data draws from text, images, audio, video; plus it completes missing pieces through data fusion.
|
||||
Velocity
|
||||
The speed at which the data is generated and processed to meet the demands and challenges that lie in the path of growth and development. Big data is often available in real-time. Compared to small data, big data is produced more continually. Two kinds of velocity related to big data are the frequency of generation and the frequency of handling, recording, and publishing.
|
||||
Veracity
|
||||
The truthfulness or reliability of the data, which refers to the data quality and the data value. Big data must not only be large in size, but also must be reliable in order to achieve value in the analysis of it. The data quality of captured data can vary greatly, affecting an accurate analysis.
|
||||
Value
|
||||
The worth in information that can be achieved by the processing and analysis of large datasets. Value also can be measured by an assessment of the other qualities of big data. Value may also represent the profitability of information that is retrieved from the analysis of big data.
|
||||
Variability
|
||||
The characteristic of the changing formats, structure, or sources of big data. Big data can include structured, unstructured, or combinations of structured and unstructured data. Big data analysis may integrate raw data from multiple sources. The processing of raw data may also involve transformations of unstructured data to structured data.
|
||||
Other possible characteristics of big data are:
|
||||
|
||||
Exhaustive
|
||||
Whether the entire system (i.e.,
|
||||
|
||||
|
||||
|
||||
n
|
||||
|
||||
|
||||
{\textstyle n}
|
||||
|
||||
=all) is captured or recorded or not. Big data may or may not include all the available data from sources.
|
||||
Fine-grained and uniquely lexical
|
||||
Respectively, the proportion of specific data of each element per element collected and if the element and its characteristics are properly indexed or identified.
|
||||
Relational
|
||||
If the data collected contains common fields that would enable a conjoining, or meta-analysis, of different data sets.
|
||||
Extensional
|
||||
If new fields in each element of the data collected can be added or changed easily.
|
||||
Scalability
|
||||
If the size of the big data storage system can expand rapidly.
|
||||
34
data/en.wikipedia.org/wiki/Big_data-2.md
Normal file
34
data/en.wikipedia.org/wiki/Big_data-2.md
Normal file
@ -0,0 +1,34 @@
|
||||
---
|
||||
title: "Big data"
|
||||
chunk: 3/10
|
||||
source: "https://en.wikipedia.org/wiki/Big_data"
|
||||
category: "reference"
|
||||
tags: "science, encyclopedia"
|
||||
date_saved: "2026-05-05T09:53:36.506913+00:00"
|
||||
instance: "kb-cron"
|
||||
---
|
||||
|
||||
== Architecture ==
|
||||
Big data repositories have existed in many forms, often built by corporations with a special need. Commercial vendors historically offered parallel database management systems for big data beginning in the 1990s. For many years, WinterCorp published the largest database report.
|
||||
Teradata Corporation in 1984 marketed the parallel processing DBC 1012 system. Teradata systems were the first to store and analyze 1 terabyte of data in 1992. Hard disk drives were 2.5 GB in 1991 so the definition of big data continuously evolves. Teradata installed the first petabyte class RDBMS based system in 2007. As of 2017, there are a few dozen petabyte class Teradata relational databases installed, the largest of which exceeds 50 PB. Systems up until 2008 were 100% structured relational data. Since then, Teradata has added semi structured data types including XML, JSON, and Avro.
|
||||
In 2000, Seisint Inc. (now LexisNexis Risk Solutions) developed a C++-based distributed platform for data processing and querying known as the HPCC Systems platform. This system automatically partitions, distributes, stores and delivers structured, semi-structured, and unstructured data across multiple commodity servers. Users can write data processing pipelines and queries in a declarative dataflow programming language called ECL. Data analysts working in ECL are not required to define data schemas upfront and can rather focus on the particular problem at hand, reshaping data in the best possible manner as they develop the solution. In 2004, LexisNexis acquired Seisint Inc. and their high-speed parallel processing platform and successfully used this platform to integrate the data systems of Choicepoint Inc. when they acquired that company in 2008. In 2011, the HPCC systems platform was open-sourced under the Apache v2.0 License.
|
||||
CERN and other physics experiments have collected big data sets for many decades, usually analyzed via high-throughput computing rather than the map-reduce architectures usually meant by the current "big data" movement.
|
||||
In 2004, Google published a paper on a process called MapReduce that uses a similar architecture. The MapReduce concept provides a parallel processing model, and an associated implementation was released to process huge amounts of data. With MapReduce, queries are split and distributed across parallel nodes and processed in parallel (the "map" step). The results are then gathered and delivered (the "reduce" step). The framework was very successful, so others wanted to replicate the algorithm. Therefore, an implementation of the MapReduce framework was adopted by an Apache open-source project named "Hadoop". Apache Spark was developed in 2012 in response to limitations in the MapReduce paradigm, as it adds in-memory processing and the ability to set up many operations (not just map followed by reducing).
|
||||
MIKE2.0 is an open approach to information management that acknowledges the need for revisions due to big data implications identified in an article titled "Big Data Solution Offering". The methodology addresses handling big data in terms of useful permutations of data sources, complexity in interrelationships, and difficulty in deleting (or modifying) individual records.
|
||||
Studies in 2012 showed that a multiple-layer architecture was one option to address the issues that big data presents. A distributed parallel architecture distributes data across multiple servers; these parallel execution environments can dramatically improve data processing speeds. This type of architecture inserts data into a parallel DBMS, which implements the use of MapReduce and Hadoop frameworks. This type of framework looks to make the processing power transparent to the end-user by using a front-end application server.
|
||||
The data lake allows an organization to shift its focus from centralized control to a shared model to respond to the changing dynamics of information management. This enables quick segregation of data into the data lake, thereby reducing the overhead time.
|
||||
|
||||
== Technologies ==
|
||||
A 2011 McKinsey Global Institute report characterizes the main components and ecosystem of big data as follows:
|
||||
|
||||
Techniques for analyzing data, such as A/B testing, machine learning, and natural language processing
|
||||
Big data technologies, like business intelligence, cloud computing, and databases
|
||||
Visualization, such as charts, graphs, and other displays of the data
|
||||
Multidimensional big data can also be represented as OLAP data cubes or, mathematically, tensors. Array database systems have set out to provide storage and high-level query support on this data type.
|
||||
Additional technologies being applied to big data include efficient tensor-based computation, such as multilinear subspace learning, massively parallel-processing (MPP) databases, search-based applications, data mining, distributed file systems, distributed cache (e.g., burst buffer and Memcached), distributed databases, cloud and HPC-based infrastructure (applications, storage and computing resources), and the Internet. Although, many approaches and technologies have been developed, it still remains difficult to carry out machine learning with big data.
|
||||
Some MPP relational databases have the ability to store and manage petabytes of data. Implicit is the ability to load, monitor, back up, and optimize the use of the large data tables in the RDBMS.
|
||||
DARPA's Topological Data Analysis program seeks the fundamental structure of massive data sets and in 2008 the technology went public with the launch of a company called "Ayasdi".
|
||||
The practitioners of big data analytics processes are generally hostile to slower shared storage, preferring direct-attached storage (DAS) in its various forms from solid state drive (SSD) to high capacity SATA disk buried inside parallel processing nodes. The perception of shared storage architectures—storage area network (SAN) and network-attached storage (NAS)— is that they are relatively slow, complex, and expensive. These qualities are not consistent with big data analytics systems that thrive on system performance, commodity infrastructure, and low cost.
|
||||
Real or near-real-time information delivery is one of the defining characteristics of big data analytics. Latency is therefore avoided whenever and wherever possible. Data in direct-attached memory or disk is good—data on memory or disk at the other end of an FC SAN connection is not. The cost of an SAN at the scale needed for analytics applications is much higher than other storage techniques.
|
||||
|
||||
== Applications ==
|
||||
40
data/en.wikipedia.org/wiki/Big_data-3.md
Normal file
40
data/en.wikipedia.org/wiki/Big_data-3.md
Normal file
@ -0,0 +1,40 @@
|
||||
---
|
||||
title: "Big data"
|
||||
chunk: 4/10
|
||||
source: "https://en.wikipedia.org/wiki/Big_data"
|
||||
category: "reference"
|
||||
tags: "science, encyclopedia"
|
||||
date_saved: "2026-05-05T09:53:36.506913+00:00"
|
||||
instance: "kb-cron"
|
||||
---
|
||||
|
||||
Big data has increased the demand of information management specialists so much so that Software AG, Oracle Corporation, IBM, Microsoft, SAP, EMC, HP, and Dell have spent more than $15 billion on software firms specializing in data management and analytics. In 2010, this industry was worth more than $100 billion and was growing at almost 10 percent a year, about twice as fast as the software business as a whole.
|
||||
Developed economies increasingly use data-intensive technologies. There are 4.6 billion mobile-phone subscriptions worldwide, and between 1 billion and 2 billion people accessing the internet. Between 1990 and 2005, more than 1 billion people worldwide entered the middle class, which means more people became more literate, which in turn led to information growth. The world's effective capacity to exchange information through telecommunication networks was 281 petabytes in 1986, 471 petabytes in 1993, 2.2 exabytes in 2000, 65 exabytes in 2007 and predictions put the amount of internet traffic at 667 exabytes annually by 2014. According to one estimate, one-third of the globally stored information is in the form of alphanumeric text and still image data, which is the format most useful for most big data applications. This also shows the potential of yet unused data (i.e. in the form of video and audio content).
|
||||
While many vendors offer off-the-shelf products for big data, experts promote the development of in-house custom-tailored systems if the company has sufficient technical capabilities.
|
||||
|
||||
=== Government ===
|
||||
|
||||
The use and adoption of big data within governmental processes allows efficiencies in terms of cost, productivity, and innovation, but comes with flaws. Data analysis often requires multiple parts of government (central and local) to work in collaboration and create new and innovative processes to deliver the desired outcome. A common government organization that makes use of big data is the National Security Administration (NSA), which monitors the activities of the Internet constantly in search for potential patterns of suspicious or illegal activities their system may pick up.
|
||||
Civil registration and vital statistics (CRVS) collects all certificates status from birth to death. CRVS is a source of big data for governments.
|
||||
|
||||
=== International development ===
|
||||
Research on the effective usage of information and communication technologies for development (also known as "ICT4D") suggests that big data technology can make important contributions but also present unique challenges to international development. Advancements in big data analysis offer cost-effective opportunities to improve decision-making in critical development areas such as health care, employment, economic productivity, crime, security, and natural disaster and resource management. Additionally, user-generated data offers new opportunities to give the unheard a voice. However, longstanding challenges for developing regions such as inadequate technological infrastructure and economic and human resource scarcity exacerbate existing concerns with big data such as privacy, imperfect methodology, and interoperability issues. The challenge of "big data for development" is currently evolving toward the application of this data through machine learning, known as "artificial intelligence for development (AI4D).
|
||||
|
||||
==== Benefits ====
|
||||
A major practical application of big data for development has been "fighting poverty with data". In 2015, Blumenstock and colleagues estimated predicted poverty and wealth from mobile phone metadata and in 2016 Jean and colleagues combined satellite imagery and machine learning to predict poverty. Using digital trace data to study the labor market and the digital economy in Latin America, Hilbert and colleagues argue that digital trace data has several benefits such as:
|
||||
|
||||
Thematic coverage: including areas that were previously difficult or impossible to measure
|
||||
Geographical coverage: providing sizable and comparable data for almost all countries, including many small countries that usually are not included in international inventories
|
||||
Level of detail: providing fine-grained data with many interrelated variables, and new aspects, like network connections
|
||||
Timeliness and timeseries: graphs can be produced within days of being collected
|
||||
|
||||
==== Challenges ====
|
||||
At the same time, working with digital trace data instead of traditional survey data does not eliminate the traditional challenges involved when working in the field of international quantitative analysis. Priorities change, but the basic discussions remain the same. Among the main challenges are:
|
||||
|
||||
Representativeness. While traditional development statistics is mainly concerned with the representativeness of random survey samples, digital trace data is never a random sample.
|
||||
Generalizability. While observational data always represents this source very well, it only represents what it represents, and nothing more. While it is tempting to generalize from specific observations of one platform to broader settings, this is often very deceptive.
|
||||
Harmonization. Digital trace data still requires international harmonization of indicators. It adds the challenge of so-called "data-fusion", the harmonization of different sources.
|
||||
Data overload. Analysts and institutions are not used to effectively deal with a large number of variables, which is efficiently done with interactive dashboards. Practitioners still lack a standard workflow that would allow researchers, users and policymakers to efficiently and effectively deal with data.
|
||||
|
||||
=== Finance ===
|
||||
Big Data is being rapidly adopted in Finance to 1) speed up processing and 2) deliver better, more informed inferences, both internally and to the clients of the financial institutions. The financial applications of Big Data range from investing decisions and trading (processing volumes of available price data, limit order books, economic data and more, all at the same time), portfolio management (optimizing over an increasingly large array of financial instruments, potentially selected from different asset classes), risk management (credit rating based on extended information), and any other aspect where the data inputs are large. Big Data has also been a typical concept within the field of alternative financial service. Some of the major areas involve crowd-funding platforms and crypto currency exchanges.
|
||||
37
data/en.wikipedia.org/wiki/Big_data-4.md
Normal file
37
data/en.wikipedia.org/wiki/Big_data-4.md
Normal file
@ -0,0 +1,37 @@
|
||||
---
|
||||
title: "Big data"
|
||||
chunk: 5/10
|
||||
source: "https://en.wikipedia.org/wiki/Big_data"
|
||||
category: "reference"
|
||||
tags: "science, encyclopedia"
|
||||
date_saved: "2026-05-05T09:53:36.506913+00:00"
|
||||
instance: "kb-cron"
|
||||
---
|
||||
|
||||
=== Healthcare ===
|
||||
Big data analytics has been used in healthcare in providing personalized medicine and prescriptive analytics, clinical risk intervention and predictive analytics, waste and care variability reduction, automated external and internal reporting of patient data, standardized medical terms and patient registries. Some areas of improvement are more aspirational than actually implemented. The level of data generated within healthcare systems is not trivial. With the added adoption of mHealth, eHealth and wearable technologies the volume of data will continue to increase. This includes electronic health record data, imaging data, patient generated data, sensor data, and other forms of difficult to process data. There is now an even greater need for such environments to pay greater attention to data and information quality. "Big data very often means 'dirty data' and the fraction of data inaccuracies increases with data volume growth." Human inspection at the big data scale is impossible and there is a desperate need in health service for intelligent tools for accuracy and believability control and handling of information missed. While extensive information in healthcare is now electronic, it fits under the big data umbrella as most is unstructured and difficult to use. The use of big data in healthcare has raised significant ethical challenges ranging from risks for individual rights, privacy and autonomy, to transparency and trust.
|
||||
Big data in health research is particularly promising in terms of exploratory biomedical research, as data-driven analysis can move forward more quickly than hypothesis-driven research. Then, trends seen in data analysis can be tested in traditional, hypothesis-driven follow up biological research and eventually clinical research.
|
||||
A related application sub-area, that heavily relies on big data, within the healthcare field is that of computer-aided diagnosis in medicine. For instance, for epilepsy monitoring it is customary to create 5 to 10 GB of data daily. Similarly, a single uncompressed image of breast tomosynthesis averages 450 MB of data. These are just a few of the many examples where computer-aided diagnosis uses big data. For this reason, big data has been recognized as one of the seven key challenges that computer-aided diagnosis systems need to overcome in order to reach the next level of performance.
|
||||
|
||||
=== Education ===
|
||||
A McKinsey Global Institute study found a shortage of 1.5 million highly trained data professionals and managers and a number of universities including University of Tennessee and UC Berkeley, have created masters programs to meet this demand. Private boot camps have also developed programs to meet that demand, including paid programs like The Data Incubator or General Assembly. In the specific field of marketing, one of the problems stressed by Wedel and Kannan is that marketing has several sub domains (e.g., advertising, promotions,
|
||||
product development, branding) that all use different types of data.
|
||||
|
||||
=== Media ===
|
||||
To understand how the media uses big data, it is first necessary to provide some context into the mechanism used for media process. It has been suggested by Nick Couldry and Joseph Turow that practitioners in media and advertising approach big data as many actionable points of information about millions of individuals. The industry appears to be moving away from the traditional approach of using specific media environments such as newspapers, magazines, or television shows and instead taps into consumers with technologies that reach targeted people at optimal times in optimal locations. The ultimate aim is to serve or convey, a message or content that is (statistically speaking) in line with the consumer's mindset. For example, publishing environments are increasingly tailoring messages (advertisements) and content (articles) to appeal to consumers that have been exclusively gleaned through various data-mining activities.
|
||||
|
||||
Targeting of consumers (for advertising by marketers)
|
||||
Data capture
|
||||
Data journalism: publishers and journalists use big data tools to provide unique and innovative insights and infographics.
|
||||
Channel 4, the British public-service television broadcaster, is a leader in the field of big data and data analysis.
|
||||
|
||||
=== Insurance ===
|
||||
Health insurance providers are collecting data on social "determinants of health" such as food and TV consumption, marital status, clothing size, and purchasing habits, from which they make predictions on health costs, in order to spot health issues in their clients. It is controversial whether these predictions are currently being used for pricing.
|
||||
|
||||
=== Internet of things (IoT) ===
|
||||
|
||||
Big data and the IoT work in conjunction. Data extracted from IoT devices provides a mapping of device inter-connectivity. Such mappings have been used by the media industry, companies, and governments to more accurately target their audience and increase media efficiency. The IoT is also increasingly adopted as a means of gathering sensory data, and this sensory data has been used in medical, manufacturing and transportation contexts.
|
||||
Kevin Ashton, the digital innovation expert who is credited with coining the term, defines the Internet of things in this quote: "If we had computers that knew everything there was to know about things—using data they gathered without any help from us—we would be able to track and count everything, and greatly reduce waste, loss, and cost. We would know when things needed replacing, repairing, or recalling, and whether they were fresh or past their best."
|
||||
|
||||
=== Information technology ===
|
||||
Especially since 2015, big data has come to prominence within business operations as a tool to help employees work more efficiently and streamline the collection and distribution of information technology (IT). The use of big data to resolve IT and data collection issues within an enterprise is called IT operations analytics (ITOA). By applying big data principles into the concepts of machine intelligence and deep computing, IT departments can predict potential issues and prevent them. ITOA businesses offer platforms for systems management that bring data silos together and generate insights from the whole of the system rather than from isolated pockets of data.
|
||||
53
data/en.wikipedia.org/wiki/Big_data-5.md
Normal file
53
data/en.wikipedia.org/wiki/Big_data-5.md
Normal file
@ -0,0 +1,53 @@
|
||||
---
|
||||
title: "Big data"
|
||||
chunk: 6/10
|
||||
source: "https://en.wikipedia.org/wiki/Big_data"
|
||||
category: "reference"
|
||||
tags: "science, encyclopedia"
|
||||
date_saved: "2026-05-05T09:53:36.506913+00:00"
|
||||
instance: "kb-cron"
|
||||
---
|
||||
|
||||
=== Survey science ===
|
||||
Compared to survey-based data collection, big data has low cost per data point, applies analysis techniques via machine learning and data mining, and includes diverse and new data sources, e.g., registers, social media, apps, and other forms digital data. Since 2018, survey scientists have started to examine how big data and survey science can complement each other to allow researchers and practitioners to improve the production of statistics and its quality. There have been three Big Data Meets Survey Science (BigSurv) conferences in 2018, 2020 (virtual), 2023, and as of 2023 one conference forthcoming in 2025, a special issue in the Social Science Computer Review, a special issue in Journal of the Royal Statistical Society, and a special issue in EP J Data Science, and a book called Big Data Meets Social Sciences edited by Craig Hill and five other Fellows of the American Statistical Association. In 2021, the founding members of BigSurv received the Warren J. Mitofsky Innovators Award from the American Association for Public Opinion Research.
|
||||
|
||||
=== Marketing ===
|
||||
Big data is notable in marketing due to the constant "datafication" of everyday consumers of the internet, in which all forms of data are tracked. The datafication of consumers can be defined as quantifying many of or all human behaviors for the purpose of marketing. The increasingly digital world of rapid datafication makes this idea relevant to marketing because the amount of data constantly grows exponentially. It is predicted to increase from 44 to 163 zettabytes within the span of five years. The size of big data can often be difficult to navigate for marketers. As a result, adopters of big data may find themselves at a disadvantage. Algorithmic findings can be difficult to achieve with such large datasets. Big data in marketing is a highly lucrative tool that can be used for large corporations, its value being as a result of the possibility of predicting significant trends, interests, or statistical outcomes in a consumer-based manner.
|
||||
There are three significant factors in the use of big data in marketing:
|
||||
|
||||
Big data provides customer behavior pattern spotting for marketers, since all human actions are being quantified into readable numbers for marketers to analyze and use for their research. In addition, big data can also be seen as a customized product recommendation tool. Specifically, since big data is effective in analyzing customers' purchase behaviors and browsing patterns, this technology can assist companies in promoting specific personalized products to specific customers.
|
||||
Real-time market responsiveness is important for marketers because of the ability to shift marketing efforts and correct to current trends, which is helpful in maintaining relevance to consumers. This can supply corporations with the information necessary to predict the wants and needs of consumers in advance.
|
||||
Data-driven market ambidexterity are being highly fueled by big data. New models and algorithms are being developed to make significant predictions about certain economic and social situations.
|
||||
|
||||
== Case studies ==
|
||||
|
||||
=== Government ===
|
||||
|
||||
==== China ====
|
||||
The Integrated Joint Operations Platform (IJOP, 一体化联合作战平台) is used by the government to monitor the population, particularly Uyghurs. Biometrics, including DNA samples, are gathered through a program of free physicals.
|
||||
By 2020, China plans to give all its citizens a personal "social credit" score based on how they behave. The Social Credit System, now being piloted in a number of Chinese cities, is considered a form of mass surveillance which uses big data analysis technology.
|
||||
|
||||
==== India ====
|
||||
Big data analysis was tried out for the BJP to win the 2014 Indian General Election.
|
||||
The Indian government uses numerous techniques to ascertain how the Indian electorate is responding to government action, as well as ideas for policy augmentation.
|
||||
|
||||
==== Israel ====
|
||||
Personalized diabetic treatments can be created through GlucoMe's big data solution.
|
||||
|
||||
==== United Kingdom ====
|
||||
Examples of uses of big data in public services:
|
||||
|
||||
Data on prescription drugs: by connecting origin, location and the time of each prescription, a research unit was able to exemplify and examine the considerable delay between the release of any given drug, and a UK-wide adaptation of the National Institute for Health and Care Excellence guidelines. This suggests that new or most up-to-date drugs take some time to filter through to the general patient.
|
||||
Joining up data: a local authority blended data about services, such as road gritting rotas, with services for people at risk, such as Meals on Wheels. The connection of data allowed the local authority to avoid any weather-related delay.
|
||||
|
||||
==== United States ====
|
||||
In 2012, the Obama administration announced the Big Data Research and Development Initiative, to explore how big data could be used to address important problems faced by the government. The initiative is composed of 84 different big data programs spread across six departments.
|
||||
Big data analysis played a large role in Barack Obama's successful 2012 re-election campaign.
|
||||
The United States Federal Government owns four of the ten most powerful supercomputers in the world.
|
||||
The Utah Data Center has been constructed by the United States National Security Agency. When finished, the facility will be able to handle a large amount of information collected by the NSA over the Internet. The exact amount of storage space is unknown, but more recent sources claim it will be on the order of a few exabytes. This has posed security concerns regarding the anonymity of the data collected.
|
||||
|
||||
=== Retail ===
|
||||
Walmart handles more than 1 million customer transactions every hour, which are imported into databases estimated to contain more than 2.5 petabytes (2560 terabytes) of data—the equivalent of 167 times the information contained in all the books in the US Library of Congress.
|
||||
Windermere Real Estate uses location information from nearly 100 million drivers to help new home buyers determine their typical drive times to and from work throughout various times of the day.
|
||||
FICO Card Detection System protects accounts worldwide.
|
||||
Omnichannel retailing leverages online big data to improve offline experiences.
|
||||
38
data/en.wikipedia.org/wiki/Big_data-6.md
Normal file
38
data/en.wikipedia.org/wiki/Big_data-6.md
Normal file
@ -0,0 +1,38 @@
|
||||
---
|
||||
title: "Big data"
|
||||
chunk: 7/10
|
||||
source: "https://en.wikipedia.org/wiki/Big_data"
|
||||
category: "reference"
|
||||
tags: "science, encyclopedia"
|
||||
date_saved: "2026-05-05T09:53:36.506913+00:00"
|
||||
instance: "kb-cron"
|
||||
---
|
||||
|
||||
=== Science ===
|
||||
The Large Hadron Collider experiments represent about 150 million sensors delivering data 40 million times per second. There are nearly 600 million collisions per second. After filtering and refraining from recording more than 99.99995% of these streams, there are 1,000 collisions of interest per second.
|
||||
As a result, only working with less than 0.001% of the sensor stream data, the data flow from all four LHC experiments represents 25 petabytes annual rate before replication (as of 2012). This becomes nearly 200 petabytes after replication.
|
||||
If all sensor data were recorded in LHC, the data flow would be extremely hard to work with. The data flow would exceed 150 million petabytes annual rate, or nearly 500 exabytes per day, before replication. To put the number in perspective, this is equivalent to 500 quintillion (5×1020) bytes per day, almost 200 times more than all the other sources combined in the world.
|
||||
The Square Kilometre Array is a radio telescope built of thousands of antennas. It is expected to be operational by 2024. Collectively, these antennas are expected to gather 14 exabytes and store one petabyte per day. It is considered one of the most ambitious scientific projects ever undertaken.
|
||||
When the Sloan Digital Sky Survey (SDSS) began to collect astronomical data in 2000, it amassed more in its first few weeks than all data collected in the history of astronomy previously. Continuing at a rate of about 200 GB per night, SDSS has amassed more than 140 terabytes of information. When the Large Synoptic Survey Telescope, successor to SDSS, comes online in 2020, its designers expect it to acquire that amount of data every five days.
|
||||
Decoding the human genome originally took 10 years to process; now it can be achieved in less than a day. The DNA sequencers have divided the sequencing cost by 10,000 in the last ten years, which is 100 times less expensive than the reduction in cost predicted by Moore's law.
|
||||
The NASA Center for Climate Simulation (NCCS) stores 32 petabytes of climate observations and simulations on the Discover supercomputing cluster.
|
||||
Google's DNAStack compiles and organizes DNA samples of genetic data from around the world to identify diseases and other medical defects. These fast and exact calculations eliminate any "friction points", or human errors that could be made by one of the numerous science and biology experts working with the DNA. DNAStack, a part of Google Genomics, allows scientists to use the vast sample of resources from Google's search server to scale social experiments that would usually take years, instantly.
|
||||
23andme's DNA database contains the genetic information of over 1,000,000 people worldwide. The company explores selling the "anonymous aggregated genetic data" to other researchers and pharmaceutical companies for research purposes if patients give their consent. Ahmad Hariri, professor of psychology and neuroscience at Duke University who has been using 23andMe in his research since 2009 states that the most important aspect of the company's new service is that it makes genetic research accessible and relatively cheap for scientists. A study that identified 15 genome sites linked to depression in 23andMe's database lead to a surge in demands to access the repository with 23andMe fielding nearly 20 requests to access the depression data in the two weeks after publication of the paper.
|
||||
Computational fluid dynamics (CFD) and hydrodynamic turbulence research generate massive data sets. The Johns Hopkins Turbulence Databases (JHTDB) contains over 350 terabytes of spatiotemporal fields from Direct Numerical simulations of various turbulent flows. Such data have been difficult to share using traditional methods such as downloading flat simulation output files. The data within JHTDB can be accessed using "virtual sensors" with various access modes ranging from direct web-browser queries, access through Matlab, Python, Fortran and C programs executing on clients' platforms, to cut out services to download raw data. The data have been used in over 150 scientific publications.
|
||||
|
||||
=== Sports ===
|
||||
Big data can be used to improve training and understanding competitors, using sport sensors. It is also possible to predict winners in a match using big data analytics.
|
||||
Future performance of players could be predicted as well. Thus, players' value and salary is determined by data collected throughout the season.
|
||||
In Formula One races, race cars with hundreds of sensors generate terabytes of data. These sensors collect data points from tire pressure to fuel burn efficiency.
|
||||
Based on the data, engineers and data analysts decide whether adjustments should be made in order to win a race. Besides, using big data, race teams try to predict the time they will finish the race beforehand, based on simulations using data collected over the season.
|
||||
|
||||
=== Technology ===
|
||||
As of 2013, eBay.com uses two data warehouses at 7.5 petabytes and 40PB as well as a 40PB Hadoop cluster for search, consumer recommendations, and merchandising.
|
||||
Amazon.com handles millions of back-end operations every day, as well as queries from more than half a million third-party sellers. The core technology that keeps Amazon running is Linux-based and as of 2005 they had the world's three largest Linux databases, with capacities of 7.8 TB, 18.5 TB, and 24.7 TB.
|
||||
Facebook handles 50 billion photos from its user base. As of June 2017, Facebook reached 2 billion monthly active users.
|
||||
Google was handling roughly 100 billion searches per month as of August 2012.
|
||||
A majorly renowned example of the practice of big data is Amazon. Amazon utilizes data analytics to drive its system for recommendations. Amazon has great success from the portion of sales that derives from the "recommended" section with personalized recommended items to shop.
|
||||
|
||||
=== COVID-19 ===
|
||||
During the COVID-19 pandemic, big data was raised as a way to minimise the impact of the disease. Significant applications of big data included minimizing the spread of the virus, case identification and development of medical treatment.
|
||||
Governments used big data to track infected people to minimise spread. Early adopters included China, Taiwan, South Korea, and Israel.
|
||||
28
data/en.wikipedia.org/wiki/Big_data-7.md
Normal file
28
data/en.wikipedia.org/wiki/Big_data-7.md
Normal file
@ -0,0 +1,28 @@
|
||||
---
|
||||
title: "Big data"
|
||||
chunk: 8/10
|
||||
source: "https://en.wikipedia.org/wiki/Big_data"
|
||||
category: "reference"
|
||||
tags: "science, encyclopedia"
|
||||
date_saved: "2026-05-05T09:53:36.506913+00:00"
|
||||
instance: "kb-cron"
|
||||
---
|
||||
|
||||
== Research activities ==
|
||||
Encrypted search and cluster formation in big data were demonstrated in March 2014 at the American Society of Engineering Education. Gautam Siwach engaged at Tackling the challenges of Big Data by MIT Computer Science and Artificial Intelligence Laboratory and Amir Esmailpour at the UNH Research Group investigated the key features of big data as the formation of clusters and their interconnections. They focused on the security of big data and the orientation of the term towards the presence of different types of data in an encrypted form at cloud interface by providing the raw definitions and real-time examples within the technology. Moreover, they proposed an approach for identifying the encoding technique to advance towards an expedited search over encrypted text leading to the security enhancements in big data.
|
||||
In March 2012, The White House announced a national "Big Data Initiative" that consisted of six federal departments and agencies committing more than $200 million to big data research projects.
|
||||
The initiative included a National Science Foundation "Expeditions in Computing" grant of $10 million over five years to the AMPLab at the University of California, Berkeley. The AMPLab also received funds from DARPA, and over a dozen industrial sponsors and uses big data to attack a wide range of problems from predicting traffic congestion to fighting cancer.
|
||||
The White House Big Data Initiative also included a commitment by the Department of Energy to provide $25 million in funding over five years to establish the Scalable Data Management, Analysis and Visualization (SDAV) Institute, led by the Energy Department's Lawrence Berkeley National Laboratory. The SDAV Institute aims to bring together the expertise of six national laboratories and seven universities to develop new tools to help scientists manage and visualize data on the department's supercomputers.
|
||||
The U.S. state of Massachusetts announced the Massachusetts Big Data Initiative in May 2012, which provides funding from the state government and private companies to a variety of research institutions. The Massachusetts Institute of Technology hosts the Intel Science and Technology Center for Big Data in the MIT Computer Science and Artificial Intelligence Laboratory, combining government, corporate, and institutional funding and research efforts.
|
||||
The European Commission is funding the two-year-long Big Data Public Private Forum through their Seventh Framework Program to engage companies, academics and other stakeholders in discussing big data issues. The project aims to define a strategy in terms of research and innovation to guide supporting actions from the European Commission in the successful implementation of the big data economy. Outcomes of this project will be used as input for Horizon 2020, their next framework program.
|
||||
The British government announced in March 2014 the founding of the Alan Turing Institute, named after the computer pioneer and code-breaker, which will focus on new ways to collect and analyze large data sets.
|
||||
At the University of Waterloo Stratford Campus Canadian Open Data Experience (CODE) Inspiration Day, participants demonstrated how using data visualization can increase the understanding and appeal of big data sets and communicate their story to the world.
|
||||
Computational social sciences – Anyone can use application programming interfaces (APIs) provided by big data holders, such as Google and Twitter, to do research in the social and behavioral sciences. Often these APIs are provided for free. Tobias Preis et al. used Google Trends data to demonstrate that Internet users from countries with a higher per capita gross domestic products (GDPs) are more likely to search for information about the future than information about the past. The findings suggest there may be a link between online behaviors and real-world economic indicators. The authors of the study examined Google queries logs made by ratio of the volume of searches for the coming year (2011) to the volume of searches for the previous year (2009), which they call the "future orientation index". They compared the future orientation index to the per capita GDP of each country, and found a strong tendency for countries where Google users inquire more about the future to have a higher GDP.
|
||||
Tobias Preis and his colleagues Helen Susannah Moat and H. Eugene Stanley introduced a method to identify online precursors for stock market moves, using trading strategies based on search volume data provided by Google Trends. Their analysis of Google search volume for 98 terms of varying financial relevance, published in Scientific Reports, suggests that increases in search volume for financially relevant search terms tend to precede large losses in financial markets.
|
||||
Big data sets come with algorithmic challenges that previously did not exist. Hence, there is seen by some to be a need to fundamentally change the processing ways.
|
||||
|
||||
=== Sampling big data ===
|
||||
A research question that is asked about big data sets is whether it is necessary to look at the full data to draw certain conclusions about the properties of the data or if is a sample is good enough. The name big data itself contains a term related to size and this is an important characteristic of big data. But sampling enables the selection of right data points from within the larger data set to estimate the characteristics of the whole population. In manufacturing different types of sensory data such as acoustics, vibration, pressure, current, voltage, and controller data are available at short time intervals. To predict downtime it may not be necessary to look at all the data but a sample may be sufficient. Big data can be broken down by various data point categories such as demographic, psychographic, behavioral, and transactional data. With large sets of data points, marketers are able to create and use more customized segments of consumers for more strategic targeting.
|
||||
|
||||
== Critique ==
|
||||
Critiques of the big data paradigm come in two flavors: those that question the implications of the approach itself, and those that question the way it is currently done. One approach to this criticism is the field of critical data studies.
|
||||
27
data/en.wikipedia.org/wiki/Big_data-8.md
Normal file
27
data/en.wikipedia.org/wiki/Big_data-8.md
Normal file
@ -0,0 +1,27 @@
|
||||
---
|
||||
title: "Big data"
|
||||
chunk: 9/10
|
||||
source: "https://en.wikipedia.org/wiki/Big_data"
|
||||
category: "reference"
|
||||
tags: "science, encyclopedia"
|
||||
date_saved: "2026-05-05T09:53:36.506913+00:00"
|
||||
instance: "kb-cron"
|
||||
---
|
||||
|
||||
=== Critiques of the big data paradigm ===
|
||||
"A crucial problem is that we do not know much about the underlying empirical micro-processes that lead to the emergence of the[se] typical network characteristics of Big Data." In their critique, Snijders, Matzat, and Reips point out that often very strong assumptions are made about mathematical properties that may not at all reflect what is really going on at the level of micro-processes. Mark Graham has leveled broad critiques at Chris Anderson's assertion that big data will spell the end of theory: focusing in particular on the notion that big data must always be contextualized in their social, economic, and political contexts. Even as companies invest eight- and nine-figure sums to derive insight from information streaming in from suppliers and customers, less than 40% of employees have sufficiently mature processes and skills to do so. To overcome this insight deficit, big data, no matter how comprehensive or well analyzed, must be complemented by "big judgment", according to an article in the Harvard Business Review.
|
||||
Much in the same line, it has been pointed out that the decisions based on the analysis of big data are inevitably "informed by the world as it was in the past, or, at best, as it currently is". Fed by a large number of data on past experiences, algorithms can predict future development if the future is similar to the past. If the system's dynamics of the future change (if it is not a stationary process), the past can say little about the future. In order to make predictions in changing environments, it would be necessary to have a thorough understanding of the systems dynamic, which requires theory. As a response to this critique Alemany Oliver and Vayre suggest to use "abductive reasoning as a first step in the research process in order to bring context to consumers' digital traces and make new theories emerge". Additionally, it has been suggested to combine big data approaches with computer simulations, such as agent-based models and complex systems. Agent-based models are increasingly getting better in predicting the outcome of social complexities of even unknown future scenarios through computer simulations that are based on a collection of mutually interdependent algorithms. Finally, the use of multivariate methods that probe for the latent structure of the data, such as factor analysis and cluster analysis, have proven useful as analytic approaches that go well beyond the bi-variate approaches (e.g. contingency tables) typically employed with smaller data sets.
|
||||
In health and biology, conventional scientific approaches are based on experimentation. For these approaches, the limiting factor is the relevant data that can confirm or refute the initial hypothesis. A new postulate is accepted now in biosciences: the information provided by the data in huge volumes (omics) without prior hypothesis is complementary and sometimes necessary to conventional approaches based on experimentation. In the massive approaches it is the formulation of a relevant hypothesis to explain the data that is the limiting factor. The search logic is reversed and the limits of induction ("Glory of Science and Philosophy scandal", C. D. Broad, 1926) are to be considered.
|
||||
Privacy advocates are concerned about the threat to privacy represented by increasing storage and integration of personally identifiable information; expert panels have released various policy recommendations to conform practice to expectations of privacy. The misuse of big data in several cases by media, companies, and even the government has allowed for abolition of trust in almost every fundamental institution holding up society.
|
||||
Barocas and Nissenbaum argue that one way of protecting individual users is by being informed about the types of information being collected, with whom it is shared, under what constraints and for what purposes.
|
||||
|
||||
=== Critiques of the "V" model ===
|
||||
The "V" model of big data is concerning as it centers around computational scalability and lacks in a loss around the perceptibility and understandability of information. This led to the framework of cognitive big data, which characterizes big data applications according to:
|
||||
|
||||
Data completeness: understanding of the non-obvious from data
|
||||
Data correlation, causation, and predictability: causality as not essential requirement to achieve predictability
|
||||
Explainability and interpretability: humans desire to understand and accept what they understand, where algorithms do not cope with this
|
||||
Level of automated decision-making: algorithms that support automated decision making and algorithmic self-learning
|
||||
|
||||
=== Critiques of novelty ===
|
||||
Large data sets have been analyzed by computing machines for well over a century, including the US census analytics performed by IBM's punch-card machines which computed statistics including means and variances of populations across the whole continent. In more recent decades, science experiments such as CERN have produced data on similar scales to current commercial "big data". However, science experiments have tended to analyze their data using specialized custom-built high-performance computing (super-computing) clusters and grids, rather than clouds of cheap commodity computers as in the current commercial wave, implying a difference in both culture and technology stack.
|
||||
39
data/en.wikipedia.org/wiki/Big_data-9.md
Normal file
39
data/en.wikipedia.org/wiki/Big_data-9.md
Normal file
@ -0,0 +1,39 @@
|
||||
---
|
||||
title: "Big data"
|
||||
chunk: 10/10
|
||||
source: "https://en.wikipedia.org/wiki/Big_data"
|
||||
category: "reference"
|
||||
tags: "science, encyclopedia"
|
||||
date_saved: "2026-05-05T09:53:36.506913+00:00"
|
||||
instance: "kb-cron"
|
||||
---
|
||||
|
||||
=== Critiques of big data execution ===
|
||||
Ulf-Dietrich Reips and Uwe Matzat wrote in 2014 that big data had become a "fad" in scientific research. Researcher danah boyd has raised concerns about the use of big data in science neglecting principles such as choosing a representative sample by being too concerned about handling the huge amounts of data. This approach may lead to results that have a bias in one way or another. Integration across heterogeneous data resources—some that might be considered big data and others not—presents formidable logistical as well as analytical challenges, but many researchers argue that such integrations are likely to represent the most promising new frontiers in science.
|
||||
In the provocative article "Critical Questions for Big Data", the authors title big data a part of mythology: "large data sets offer a higher form of intelligence and knowledge [...], with the aura of truth, objectivity, and accuracy". Users of big data are often "lost in the sheer volume of numbers", and "working with Big Data is still subjective, and what it quantifies does not necessarily have a closer claim on objective truth". Recent developments in BI domain, such as pro-active reporting especially target improvements in the usability of big data, through automated filtering of non-useful data and correlations. Big structures are full of spurious correlations either because of non-causal coincidences (law of truly large numbers), solely nature of big randomness (Ramsey theory), or existence of non-included factors so the hope, of early experimenters to make large databases of numbers "speak for themselves" and revolutionize scientific method, is questioned. Catherine Tucker has pointed to "hype" around big data, writing "By itself, big data is unlikely to be valuable." The article explains: "The many contexts where data is cheap relative to the cost of retaining talent to process it, suggests that processing skills are more important than data itself in creating value for a firm."
|
||||
Big data analysis is often shallow compared to analysis of smaller data sets. In many big data projects, there is no large data analysis happening, but the challenge is the extract, transform, load part of data pre-processing.
|
||||
Big data is a buzzword and a "vague term", but at the same time an "obsession" with entrepreneurs, consultants, scientists, and the media. Big data showcases such as Google Flu Trends failed to deliver good predictions in recent years, overstating the flu outbreaks by a factor of two. Similarly, Academy Awards and election predictions solely based on Twitter were more often off than on target.
|
||||
Big data often poses the same challenges as small data; adding more data does not solve problems of bias, but may emphasize other problems. In particular data sources such as Twitter are not representative of the overall population, and results drawn from such sources may then lead to wrong conclusions. Google Translate—which is based on big data statistical analysis of text—does a good job at translating web pages. However, results from specialized domains may be dramatically skewed.
|
||||
On the other hand, big data may also introduce new problems, such as the multiple comparisons problem: simultaneously testing a large set of hypotheses is likely to produce many false results that mistakenly appear significant.
|
||||
Ioannidis argued that "most published research findings are false" due to essentially the same effect: when many scientific teams and researchers each perform many experiments (i.e. process a big amount of scientific data; although not with big data technology), the likelihood of a "significant" result being false grows fast – even more so, when only positive results are published.
|
||||
Furthermore, big data analytics results are only as good as the model on which they are predicated. In an example, big data took part in attempting to predict the results of the 2016 U.S. presidential election with varying degrees of success.
|
||||
|
||||
=== Critiques of big data policing and surveillance ===
|
||||
Big data has been used in policing and surveillance by institutions like law enforcement and corporations (see: corporate surveillance and surveillance capitalism). Due to the less visible nature of data-based surveillance as compared to traditional methods of policing, objections to big data policing are less likely to arise. According to Sarah Brayne's Big Data Surveillance: The Case of Policing, big data policing can reproduce existing societal inequalities in three ways:
|
||||
|
||||
Placing people under increased surveillance by using the justification of a mathematical and therefore unbiased algorithm
|
||||
Increasing the scope and number of people that are subject to law enforcement tracking and exacerbating existing racial overrepresentation in the criminal justice system
|
||||
Encouraging members of society to abandon interactions with institutions that would create a digital trace, thus creating obstacles to social inclusion
|
||||
If these potential problems are not corrected or regulated, the effects of big data policing may continue to shape societal hierarchies. Conscientious usage of big data policing could prevent individual level biases from becoming institutional biases, Brayne also notes.
|
||||
|
||||
== See also ==
|
||||
|
||||
== References ==
|
||||
|
||||
== Bibliography ==
|
||||
|
||||
== Further reading ==
|
||||
|
||||
== External links ==
|
||||
Media related to Big data at Wikimedia Commons
|
||||
The dictionary definition of big data at Wiktionary
|
||||
60
data/en.wikipedia.org/wiki/Causal_analysis-0.md
Normal file
60
data/en.wikipedia.org/wiki/Causal_analysis-0.md
Normal file
@ -0,0 +1,60 @@
|
||||
---
|
||||
title: "Causal analysis"
|
||||
chunk: 1/1
|
||||
source: "https://en.wikipedia.org/wiki/Causal_analysis"
|
||||
category: "reference"
|
||||
tags: "science, encyclopedia"
|
||||
date_saved: "2026-05-05T09:53:30.564050+00:00"
|
||||
instance: "kb-cron"
|
||||
---
|
||||
|
||||
Causal analysis is the field of experimental design and statistics pertaining to establishing cause and effect. Typically it involves establishing four elements: correlation, sequence in time (that is, causes must occur before their proposed effect), a plausible physical or information-theoretical mechanism for an observed effect to follow from a possible cause, and eliminating the possibility of common and alternative ("special") causes. Such analysis usually involves one or more controlled or natural experiments.
|
||||
|
||||
|
||||
== Motivation ==
|
||||
Data analysis is primarily concerned with causal questions. For example, did the fertilizer cause the crops to grow? Or, can a given sickness be prevented? Or, why is my friend depressed? The potential outcomes and regression analysis techniques handle such queries when data is collected using designed experiments. Data collected in observational studies require different techniques for causal inference (because, for example, of issues such as confounding). Causal inference techniques used with experimental data require additional assumptions to produce reasonable inferences with observation data. The difficulty of causal inference under such circumstances is often summed up as "correlation does not imply causation".
|
||||
|
||||
|
||||
== In philosophy and physics ==
|
||||
|
||||
The nature of causality is systematically investigated in several academic disciplines, including philosophy and physics. In academia, there are a significant number of theories on causality; The Oxford Handbook of Causation (Beebee, Hitchcock & Menzies 2009) encompasses 770 pages. Among the more influential theories within philosophy are Aristotle's Four causes and Al-Ghazali's occasionalism. David Hume argued that beliefs about causality are based on experience, and experience similarly based on the assumption that the future models the past, which in turn can only be based on experience – leading to circular logic. In conclusion, he asserted that causality is not based on actual reasoning: only correlation can actually be perceived. Immanuel Kant, according to Beebee, Hitchcock & Menzies (2009), held that "a causal principle according to which every event has a cause, or follows according to a causal law, cannot be established through induction as a purely empirical claim, since it would then lack strict universality, or necessity".
|
||||
Outside the field of philosophy, theories of causation can be identified in classical mechanics, statistical mechanics, quantum mechanics, spacetime theories, biology, social sciences, and law. To establish a correlation as causal within physics, it is normally understood that the cause and the effect must connect through a local mechanism (cf. for instance the concept of impact) or a nonlocal mechanism (cf. the concept of field), in accordance with known laws of nature. From the point of view of thermodynamics, universal properties of causes as compared to effects have been identified through the Second law of thermodynamics, confirming the ancient, medieval and Cartesian view that "the cause is greater than the effect" for the particular case of thermodynamic free energy. This, in turn, is challenged by popular interpretations of the concepts of nonlinear systems and the butterfly effect, in which small events cause large effects due to, respectively, unpredictability and an unlikely triggering of large amounts of potential energy.
|
||||
|
||||
|
||||
=== Causality construed from counterfactual states ===
|
||||
|
||||
Intuitively, causation seems to require not just a correlation, but a counterfactual dependence. Suppose that a student performed poorly on a test and guesses that the cause was his not studying. To prove this, one thinks of the counterfactual – the same student writing the same test under the same circumstances but having studied the night before. If one could rewind history, and change only one small thing (making the student study for the exam), then causation could be observed (by comparing version 1 to version 2). Because one cannot rewind history and replay events after making small controlled changes, causation can only be inferred, never exactly known. This is referred to as the Fundamental Problem of Causal Inference – it is impossible to directly observe causal effects.
|
||||
A major goal of scientific experiments and statistical methods is to approximate as best possible the counterfactual state of the world. For example, one could run an experiment on identical twins who were known to consistently get the same grades on their tests. One twin is sent to study for six hours while the other is sent to the amusement park. If their test scores suddenly diverged by a large degree, this would be strong evidence that studying (or going to the amusement park) had a causal effect on test scores. In this case, correlation between studying and test scores would almost certainly imply causation.
|
||||
Well-designed experimental studies replace equality of individuals as in the previous example by equality of groups. The objective is to construct two groups that are similar except for the treatment that the groups receive. This is achieved by selecting subjects from a single population and randomly assigning them to two or more groups. The likelihood of the groups behaving similarly to one another (on average) rises with the number of subjects in each group. If the groups are essentially equivalent except for the treatment they receive, and a difference in the outcome for the groups is observed, then this constitutes evidence that the treatment is responsible for the outcome, or in other words the treatment causes the observed effect. However, an observed effect could also be caused "by chance", for example as a result of random perturbations in the population. Statistical tests exist to quantify the likelihood of erroneously concluding that an observed difference exists when in fact it does not (for example see P-value).
|
||||
|
||||
|
||||
== Operational definitions of causality ==
|
||||
Clive Granger created the first operational definition of causality in 1969. Granger made the definition of probabilistic causality proposed by Norbert Wiener operational as a comparison of variances.
|
||||
|
||||
|
||||
== Verification by "truth" ==
|
||||
Peter Spirtes, Clark Glymour, and Richard Scheines introduced the idea of explicitly not providing a definition of causality . Spirtes and Glymour introduced the PC algorithm for causal discovery in 1990. Many recent causal discovery algorithms follow the Spirtes-Glymour approach to verification.
|
||||
|
||||
|
||||
== Exploratory ==
|
||||
|
||||
Exploratory causal analysis, also known as "data causality" or "causal discovery" is the use of statistical algorithms to infer associations in observed data sets that are potentially causal under strict assumptions. ECA is a type of causal inference distinct from causal modeling and treatment effects in randomized controlled trials. It is exploratory research usually preceding more formal causal research in the same way exploratory data analysis often precedes statistical hypothesis testing in data analysis.
|
||||
|
||||
|
||||
== See also ==
|
||||
Causal inference
|
||||
Causal model
|
||||
Causality
|
||||
Causal reasoning
|
||||
|
||||
|
||||
== References ==
|
||||
|
||||
|
||||
=== Bibliography ===
|
||||
Beebee, Helen; Hitchcock, Christopher; Menzies, Peter (2009). The Oxford Handbook of Causation. Oxford University Press. ISBN 978-0-19-162946-4.
|
||||
|
||||
|
||||
== External links ==
|
||||
Causality Workbench team tools and data
|
||||
University of Pittsburgh CCD team tools
|
||||
@ -4,7 +4,7 @@ chunk: 1/1
|
||||
source: "https://en.wikipedia.org/wiki/Center_for_Open_Science"
|
||||
category: "reference"
|
||||
tags: "science, encyclopedia"
|
||||
date_saved: "2026-05-05T06:31:44.226264+00:00"
|
||||
date_saved: "2026-05-05T09:52:41.470257+00:00"
|
||||
instance: "kb-cron"
|
||||
---
|
||||
|
||||
|
||||
@ -0,0 +1,69 @@
|
||||
---
|
||||
title: "Chen Zhenyuan fake peer review scandal"
|
||||
chunk: 1/2
|
||||
source: "https://en.wikipedia.org/wiki/Chen_Zhenyuan_fake_peer_review_scandal"
|
||||
category: "reference"
|
||||
tags: "science, encyclopedia"
|
||||
date_saved: "2026-05-05T09:52:51.221154+00:00"
|
||||
instance: "kb-cron"
|
||||
---
|
||||
|
||||
The Chen Zhenyuan fake peer review scandal was an academic misconduct case that came to light in July 2014. The UK-based publisher SAGE Publications discovered that Chen Zhenyuan, then an associate professor in the Department of Computer Science at National Pingtung University of Education (now National Pingtung University, Minsheng Campus), was suspected of fabricating 130 fake reviewer accounts in order to manipulate peer review and citation networks to secure the publication of his papers. As a result, SAGE retracted 60 papers authored or co-authored by Chen Zhenyuan from the Journal of Vibration and Control (JVC).
|
||||
The mass retraction of 60 papers at once was extremely rare in the global academic community. The editor-in-chief of the Journal of Vibration and Control subsequently resigned and applied for retirement from his university. Among the retracted papers, five were authored by Taiwan’s then Minister of Education, Chiang Wei-ling, with Chen Zhenyuan listed as a co-author. Amid intense public pressure, Chiang Wei-ling resigned as Minister of Education.
|
||||
|
||||
== Background and timeline ==
|
||||
|
||||
=== 2009 ===
|
||||
Chen Zhenyuan joined National Pingtung University of Education as a faculty member and began publishing a large number of papers in the Journal of Vibration and Control (JVC).
|
||||
|
||||
=== February 2011 ===
|
||||
Chen Zhenyuan was promoted to associate professor.
|
||||
|
||||
=== 2013 ===
|
||||
Ali H. Nayfeh, editor-in-chief of JVC, reviewed Chen’s papers and discovered suspicious reviewer accounts. Upon investigation, 130 problematic accounts were identified, all suspected to have been fabricated by Chen Zhenyuan.
|
||||
In September, SAGE Publications sent a formal letter to National Pingtung University of Education requesting assistance in the investigation.
|
||||
In October, the Department of Computer Science at the university established a special investigative committee.
|
||||
In November, Chen Zhenyuan resigned from his associate professor position and officially left the university after the start of the semester in February 2014.
|
||||
The university concluded its internal investigation without publicly releasing the results or notifying the Ministry of Education. The university stated that Chen’s resignation had hindered the investigation, and that many investigative actions could not be carried out during the summer break. It emphasized that a final report would be completed after the new semester began and that the matter would “not be privately settled.”
|
||||
|
||||
== Public exposure and aftermath ==
|
||||
|
||||
=== May 2014 ===
|
||||
After completing his investigation and submitting a report to JVC, editor-in-chief Ali H. Nayfeh resigned from his position and applied for retirement from his university, taking responsibility for the incident.
|
||||
|
||||
=== July 8, 2014 ===
|
||||
The blog Retraction Watch first disclosed the incident to the public.
|
||||
|
||||
=== July 9, 2014 ===
|
||||
The Journal of Vibration and Control issued a statement accusing Chen Zhenyuan of exploiting the journal’s online peer review system by registering 130 fake reviewer accounts. SAGE Publications suspected that Chen used this method to pass peer review and consequently retracted 60 papers associated with him.
|
||||
Major international media outlets, including The New York Times and The Washington Post, reported on the case. Taiwanese media followed shortly thereafter.
|
||||
Academic Lee Chia-tung criticized the incident as being caused by flaws in JVC’s peer review system.
|
||||
|
||||
=== July 11, 2014 ===
|
||||
Because five of the retracted papers were authored by Chiang Wei-ling, with Chen Zhenyuan listed as a co-author, Chiang held a press conference to clarify the matter. Chiang stated that Chen Zhenyuan’s younger brother, Chen Zhenwu, was his supervised student, and that Chen Zhenyuan’s co-authorship was handled by Chen Zhenwu without Chiang’s knowledge.
|
||||
Public opinion called for Chiang Wei-ling to be suspended pending investigation.
|
||||
|
||||
=== July 13, 2014 ===
|
||||
Chen Zhenyuan held a press conference, asserting that Chiang Wei-ling was not involved. In a post-conference statement, Chen claimed that there was no plagiarism or data fabrication in the papers themselves. Regarding the fake accounts, he argued that they were created unintentionally due to keyboard input errors during manuscript submission and revision, which caused the information system to automatically generate multiple co-author accounts, and that there was no malicious intent.
|
||||
|
||||
=== July 14, 2014 ===
|
||||
The Executive Yuan of the Republic of China announced that Chiang Wei-ling had resigned as Minister of Education.
|
||||
|
||||
== Government investigation and sanctions ==
|
||||
|
||||
=== September 25, 2014 ===
|
||||
The Academic Ethics Committee of Taiwan’s Ministry of Science and Technology (MOST) issued preliminary findings confirming peer review fraud. Proposed sanctions included:
|
||||
|
||||
A 10-year ban on Chen Zhenyuan from applying for MOST funding
|
||||
A 5-year ban on Chen Zhenwu
|
||||
Recovery of more than NT$2 million in research grants obtained during the period in question
|
||||
The case was submitted to Minister Chang San-cheng for final determination.
|
||||
During the MOST investigation, it was further found that, in addition to the 60 retracted JVC papers, Chen Zhenyuan had published a paper in Nature Hazards whose introduction was copied from Wikipedia. Approximately one-third of the paper consisted of citations from other works, and about 90% of those citations referenced papers authored by Chen Zhenyuan and Chen Zhenwu, raising suspicions of deliberate inflation of citation metrics.
|
||||
Additionally, Chiang Wei-ling was reported for alleged self-plagiarism in one of the retracted Journal of Vibration and Control papers in which he was listed as a co-author. The paper was suspected of substantially duplicating content from another JVC paper published in 2010 that had not been retracted. Comparative analysis found multiple paragraphs to be nearly identical word-for-word, without proper citation.
|
||||
On October 16, MOST’s Academic Ethics Committee initially ruled to impose a three-year funding ban on Chiang Wei-ling, pending final approval.
|
||||
|
||||
=== October 26, 2014 ===
|
||||
MOST formally decided to impose a 10-year ban on Chen Zhenyuan from applying for research grants. Sanctions against Chen Zhenwu and Chiang Wei-ling remained undecided at that time.
|
||||
|
||||
=== November 26, 2014 ===
|
||||
The British journal Nature published a feature article titled “The Peer-Review Scam”, reviewing the incident and concluding that the Chen Zhenyuan case, together with the Wen Hyeong-yoon case, exposed security vulnerabilities in automated manuscript submission and review systems such as ScholarOne.
|
||||
@ -0,0 +1,29 @@
|
||||
---
|
||||
title: "Chen Zhenyuan fake peer review scandal"
|
||||
chunk: 2/2
|
||||
source: "https://en.wikipedia.org/wiki/Chen_Zhenyuan_fake_peer_review_scandal"
|
||||
category: "reference"
|
||||
tags: "science, encyclopedia"
|
||||
date_saved: "2026-05-05T09:52:51.221154+00:00"
|
||||
instance: "kb-cron"
|
||||
---
|
||||
|
||||
=== February 6, 2015 ===
|
||||
MOST decided to impose a 10-year ban on Chen Zhenwu and to recover principal investigator fees from three research projects. Chiang Wei-ling received a one-year ban.
|
||||
|
||||
== Subsequent developments ==
|
||||
|
||||
=== March 6, 2016 ===
|
||||
The international journal Human Factors and Ergonomics in Manufacturing & Service Industries announced the retraction of three papers published by Chen Zhenwu between 2012 and 2013. These papers listed only Chen Zhenwu and other scholars as authors, without Chen Zhenyuan. Retraction Watch concluded that Chen Zhenwu was not merely an innocent bystander in the case, prompting MOST to reopen its investigation.
|
||||
The Ministry of Education stated that after the incident, National Kaohsiung University of Science and Technology verified that several papers submitted by Chen Zhenwu for promotion contained problems. The university decided to revoke his professorship, demote him to associate professor, and recover excess salary payments. Chen Zhenwu filed an administrative appeal.
|
||||
After court proceedings, the Ministry of Education initially lost the case, appealed, and the Supreme Administrative Court ultimately ruled to vacate the original judgment, revoke the appeal decision and the original administrative sanction, and ordered the case to be reinvestigated and re-sanctioned accordingly. As a result, Chen Zhenwu currently holds the rank of associate professor.
|
||||
|
||||
== Similar cases ==
|
||||
|
||||
=== 2012 ===
|
||||
At Dong-A University in South Korea, agricultural scientist Wen Hyeong-yoon was found to have created 24 fake email accounts to impersonate peer reviewers for his own submissions to the Journal of Medicinal Biology. As a result, 35 papers were retracted.
|
||||
|
||||
=== March 2015 ===
|
||||
BioMed Central retracted 43 papers from its journals, mostly authored by Chinese researchers, after discovering fabricated peer review activities. The Washington Post reported on the incident and explicitly compared it to the Chen Zhenyuan case.
|
||||
|
||||
== References ==
|
||||
24
data/en.wikipedia.org/wiki/Clinical_peer_review-0.md
Normal file
24
data/en.wikipedia.org/wiki/Clinical_peer_review-0.md
Normal file
@ -0,0 +1,24 @@
|
||||
---
|
||||
title: "Clinical peer review"
|
||||
chunk: 1/5
|
||||
source: "https://en.wikipedia.org/wiki/Clinical_peer_review"
|
||||
category: "reference"
|
||||
tags: "science, encyclopedia"
|
||||
date_saved: "2026-05-05T09:52:52.389008+00:00"
|
||||
instance: "kb-cron"
|
||||
---
|
||||
|
||||
Clinical peer review, also known as medical peer review is the process by which health care professionals, including those in nursing and pharmacy, evaluate each other's clinical performance. A discipline-specific process may be referenced accordingly (e.g., physician peer review, nursing peer review).
|
||||
Today, clinical peer review is most commonly done in hospitals, but may also occur in other practice settings including surgical centers and large group practices. The primary purpose of peer review is to improve the quality and safety of care. Secondarily, it serves to reduce the organization's vicarious malpractice liability and meet regulatory requirements. In the US, these include accreditation, licensure and Medicare participation. Peer review also supports the other processes that healthcare organizations have in place to assure that physicians are competent and practice within the boundaries of professionally accepted norms.
|
||||
|
||||
== Overview ==
|
||||
Clinical peer review should be distinguished from the peer review that medical journals use to evaluate the merits of a scientific manuscript, from the peer review process used to evaluate health care research grant applications, and, also, from the process by which clinical teaching might be evaluated. All these forms of peer review are confounded in the term Medical Peer Review. Moreover, Medical peer review has been used by the American Medical Association (AMA) to refer not only to the process of improving quality and safety in health care organizations, but also to process by which adverse actions involving clinical privileges or professional society membership may be pursued. In addition, peer review methods are frequently used by state medical boards with respect to licensing decisions and complaint investigation. They are also used by insurance companies with respect to credentialing and utilization management processes.
|
||||
|
||||
=== Medicine ===
|
||||
In US hospital settings, clinical peer review encompasses a wide variety of activities, whose scope varies across institutions. Virtually all programs perform retrospective medical record review (aka, case review) of the quality of care. Most also include Ongoing Professional Practice Evaluation (OPPE) and Focused Professional Practice Evaluation (FPPE) required by the Joint Commission since 2007. Many also include the management of disruptive behavior (see ) and physician health programs.
|
||||
Routine clinical peer review activity (performance assessment) is typically organized separately from the credentialing/privileging process (competence assessment), but the results of peer review inform those decisions. From a legal and regulatory perspective, however, the line is blurred. The definition of a peer review body can be broad, including not only individuals but also (for example, in Oregon), "tissue committees, governing bodies or committees including medical staff committees of a [licensed] health care facility...or any other medical group in connection with bona fide medical research, quality assurance, utilization review, credentialing, education, training, supervision or discipline of physicians or other health care providers."
|
||||
Medical staffs generally rely on generic screens for adverse events to identify cases for peer review, even though that might not be the most efficient or effective method. These are generally applied through administrative data analysis, but referrals for peer review are frequently made by risk managers, nurses and medical staff. The median annual review volume is 1–2% of hospital inpatient admissions. Thus, case review may be the dominant form of adverse event analysis in US hospitals.
|
||||
Case reviews are typically conducted by individual reviewers, but in nearly 70% of hospitals, most reviews are presented and discussed in a committee prior to final decision-making. Nurses now participate on physician review committees in the majority of programs. This extends the trend of the past decade for adoption of multi-specialty representation in the direction of multi-disciplinary peer review. Some of these committees now routinely assess nursing care during the case review process and may even directly address all improvement opportunities.
|
||||
|
||||
=== Nursing ===
|
||||
The American Nurses Association published the first definition of nursing peer review in 1988. It includes the following statements:
|
||||
29
data/en.wikipedia.org/wiki/Clinical_peer_review-1.md
Normal file
29
data/en.wikipedia.org/wiki/Clinical_peer_review-1.md
Normal file
@ -0,0 +1,29 @@
|
||||
---
|
||||
title: "Clinical peer review"
|
||||
chunk: 2/5
|
||||
source: "https://en.wikipedia.org/wiki/Clinical_peer_review"
|
||||
category: "reference"
|
||||
tags: "science, encyclopedia"
|
||||
date_saved: "2026-05-05T09:52:52.389008+00:00"
|
||||
instance: "kb-cron"
|
||||
---
|
||||
|
||||
The American Nurses Association believes nurses bare primary responsibility and accountability for the quality of nursing care their clients receive. Standards of nursing practice provide a means for measuring the quality of nursing care a client receives. Each nurse is responsible for interpreting and implementing the standards of nursing practice. Likewise, each nurse must participate with other nurses in the decision-making process for evaluating nursing care…Peer review implies that the nursing care delivered by a group of nurses or an individual nurse is evaluated by individuals of the same rank or standing according to established standards of practice…. Peer review is an organized effort whereby practicing professionals review the quality and appropriateness of services ordered or performed by their professional peers. Peer review in nursing is the process by which practicing registered nurses systematically access, monitor, and make judgments about the quality of nursing care provided by peers as measured against professional standards of practice.
|
||||
In Nursing, as in other professions, peer review applies professional control to practice, and is used by professionals to hold themselves accountable for their services to the public and the organization. Peer review plays a role in affecting the quality of outcomes, fostering practice development, and maintaining professional autonomy. American Nurses Association guidelines define peer review as the process by which practitioners of the same rank, profession, or setting critically appraise each other's work performance against established standards. The professionals who are best acquainted with the requirements and demands of the role are the givers and receivers of the review.
|
||||
Nursing peer review appears to have gained momentum as a result of growth of hospital participation in the American Nursing Association's Magnet Program. Even so, less than 7% of U.S. hospitals have qualified. Magnet hospitals must have at least 2 years of experience with a peer review evaluation process designed to improve practice and performance for all RNs for at least 2 years. The literature on nursing peer review is more limited than that which has been developed for physician peer review, and has focused more on annual performance appraisal than on case review. No aggregate studies of clinical nursing peer review practices have been published. Nevertheless, more sophisticated studies have been reported.
|
||||
Nursing professionals have historically been less likely to participate or be subject to peer review. This is changing, as is the previously limited extensiveness (for example, no aggregate studies of clinical nursing peer review practices had been published as of 2010) of the literature on nursing peer review
|
||||
Mostly what is mistakenly referred to as "peer review" in clinical practice is really a form of the annual performance evaluation. The annual performance review is a managerial process and does not meet the definition or outcomes needed related to peer review. Other organizational practices may violate the peer review guidelines set forth 1988 by the ANA 1988. The most frequent violation is the performance of direct care peer review by managers. One of the reasons for the confusion is that the ANA guidelines for peer review had been out of print prior to being reprinted and updated in 2011.
|
||||
The early ANA Peer Review Guidelines (1988) and Code of Ethics for Nurses (2001) focus on maintaining standards of nursing practice and upgrading nursing care in three contemporary focus areas for peer review. The three dimensions of peer review are: (a) quality and safety, (b) role actualization, and (c) practice advancement. Each area of contemporary peer review has an organizational, unit, and individual focus. The following six peer review practice principles stem from and are grounded in the 1988 ANA Guidelines and may help to assure an evidence-based and consistent approach to peer review:
|
||||
1. A peer is someone of the same rank.
|
||||
2. Peer review is practice focused.
|
||||
3. Feedback is timely, routine and a continuous expectation.
|
||||
4. Peer review fosters a continuous learning culture of patient safety and best practice.
|
||||
5. Feedback is not anonymous.
|
||||
6. Feedback incorporates the developmental stage of the nurse.
|
||||
Written and standardized operating procedures for peer review also need development and adoption by the direct care staff and incorporation into the professional practice model (shared governance) bylaws.
|
||||
Confusion exists about the differences between the Professional Peer Review process, the Annual Performance Review (APR) and the role of peer evaluation. The APR is a managerial human resource function performed with direct reports, and is aimed at defining, aligning and recognizing each employee's contribution to the organization's success. In contrast, professional peer review is conducted within the professional practice model and is not a managerial accountability. Peer evaluation is the process of getting feedback on one's specific role competencies or "at work" behaviors from people that one works within the department and from other departments. "Colleague evaluation" is a more appropriate term than "peer evaluation" as this is not a form of professional peer review.
|
||||
|
||||
=== Pharmacy ===
|
||||
There is limited published information about peer review among pharmacists.
|
||||
|
||||
== History ==
|
||||
22
data/en.wikipedia.org/wiki/Clinical_peer_review-2.md
Normal file
22
data/en.wikipedia.org/wiki/Clinical_peer_review-2.md
Normal file
@ -0,0 +1,22 @@
|
||||
---
|
||||
title: "Clinical peer review"
|
||||
chunk: 3/5
|
||||
source: "https://en.wikipedia.org/wiki/Clinical_peer_review"
|
||||
category: "reference"
|
||||
tags: "science, encyclopedia"
|
||||
date_saved: "2026-05-05T09:52:52.389008+00:00"
|
||||
instance: "kb-cron"
|
||||
---
|
||||
|
||||
The first documented description of a peer review process is found in the Ethics of the Physician written by Ishap bin Ali al-Rahawi (854–931) of al-Raha, Syria. His work, as well as later Arabic medical manuals, states that a visiting physician must always make duplicate notes of a patient's condition on every visit. When the patient was cured or had died, the notes of the physician were examined by a local medical council of other physicians, who would review the practicing physician's notes to decide whether his or her performance met the required standards of medical care. If their reviews were negative, the practicing physician could face a lawsuit from a maltreated patient. Such practices are known to have continued into the 11th century.
|
||||
In the 1900s, peer review methods evolved in relation to the pioneering work of Codman's End Result System and Ponton's concept of Medical Audit. Lembcke, himself a major contributor to audit methodology, in reviewing this history, notes the pre-emptive influence of hospital standardization promoted by the American College of Surgeons (ACS) following WWI. The Joint Commission (on Accreditation of Hospitals) followed the ACS in this role from 1952. Medicare legislation, enacted in 1964, boosted the stature and influence of the Joint Commission because the conditions for hospital participation required a credible medical care review program and the regulations stipulated that Joint Commission accreditation would guarantee payment eligibility. What was once a sporadic process, became hardwired in most hospitals following the medical audit model. The widespread creation of new programs was hampered, however, by limitations in the available process models, tools, training and implementation support.
|
||||
Medical audit is a focused study of the process and/or outcomes of care for a specified patient cohort using pre-defined criteria. Audits are typically organized around a diagnosis, procedure or clinical situation. It remains the predominant mode of peer review in Europe and other countries.
|
||||
In the US, however, the lack of perceived effectiveness of medical audit led to revisions of Joint Commission standards in 1980. Those modified standards dispensed with the audit requirement and called for an organized system of Quality Assurance (QA). About the same time, hospital and physicians were facing escalating malpractice insurance costs. In response to these combined pressures, they began to adopt "generic screens" for potential substandard care. These screens were originally developed to evaluate the feasibility of a no-fault medical malpractice insurance plan and were never validated as a tool to improve quality of care. Despite warnings from the developers, their use became widespread. In the process, a QA model for peer review evolved with a narrow focus on the question of whether or not the standard of care had been met. It has persisted despite the many criticisms of its methods and effectiveness. Today, its methods are increasingly recognized to be outdated and incongruent with the Quality Improvement (QI) principles that have been increasingly embraced by healthcare organizations.
|
||||
There is good evidence that contemporary peer review process can be further improved. The American College of Obstetrics and Gynecology has offered a Voluntary Review of Quality of Care Program for more than 2 decades. Perceived issues with the adequacy of peer review were an explicit reason for requesting this service by 15% of participating hospitals, yet recommendations for improved peer review process were made to 60%. A 2007 study of peer review in US hospitals found wide variation in practice. The more effective programs had more features consistent with quality improvement principles. There were substantial opportunities for program improvement. The implication was that a new QI model for peer review seems to be evolving.
|
||||
While it is premature to judge the potential effectiveness of this model, a 2009 study confirmed these findings in a separate sampling of hospitals. It also showed that important differences among programs predict a meaningful portion of the variation on 32 objective measures of patient care quality and safety. These findings were extended by cohort follow-up studies conducted in 2011 and 2015–16.
|
||||
The 2015-16 study refined QI model identifying 20 features that distinguish the most effective programs. These include among other factors: aiming first and foremost at improving quality, standardizing review process, maintaining high quality of case review, promoting self-reporting of adverse events, near misses and hazardous conditions, identifying opportunities for improvement in the review process (as opposed to casting blame), providing timely clinical performance feedback, recognizing clinical excellence, and establishing effective program governance. as additional multivariate predictors of the impact of clinical peer review on quality and safety, medical staff perceptions of the program, and clinician engagement in quality and safety initiatives. The online supplement to the report includes a program self assessment tool which is also available as a free online utility. Despite a persistently high annual rate of major program change, about two-thirds of programs still have significant opportunity for improvement. It is argued that the outmoded QA model perpetuates a culture of blame that is toxic to efforts to advance quality and high reliability among both physicians and nurses.
|
||||
|
||||
== Legal and regulatory environment ==
|
||||
|
||||
=== United States ===
|
||||
In the US, peer review activity is generally protected under state statutes. The protection may include confidentiality of the review process and protection to reviewers and institutions for good faith efforts to improve quality and safety through review activity. Such statutes may also specify whether or not the physician conducting the review must be in active practice. The nature of that protection varies widely. For example, Texas is generally considered to have fairly robust protections, whereas Florida protections were undermined by a constitutional amendment that exposed peer review data to discovery.
|
||||
43
data/en.wikipedia.org/wiki/Clinical_peer_review-3.md
Normal file
43
data/en.wikipedia.org/wiki/Clinical_peer_review-3.md
Normal file
@ -0,0 +1,43 @@
|
||||
---
|
||||
title: "Clinical peer review"
|
||||
chunk: 4/5
|
||||
source: "https://en.wikipedia.org/wiki/Clinical_peer_review"
|
||||
category: "reference"
|
||||
tags: "science, encyclopedia"
|
||||
date_saved: "2026-05-05T09:52:52.389008+00:00"
|
||||
instance: "kb-cron"
|
||||
---
|
||||
|
||||
==== Health Care Quality Improvement Act ====
|
||||
US federal law generally trumps state law. The federal Health Care Quality Improvement Act ("HCQIA"), 42 U.S.C. § 11112, enacted in 1986, sets standards that professional review actions must meet in order to receive protection under the Act. It requires that the action be taken in the reasonable belief that it will advance healthcare quality based on facts obtained through reasonable efforts with due process and fairness to the involved physician. When peer review leads to an action to limit or revoke clinical privileges, the physician is entitled to both a fair hearing and the right of appeal.
|
||||
Congress explicitly stated the rationale for this legislation as follows:
|
||||
|
||||
|
||||
(1) The increasing occurrence of medical malpractice and the need to improve the quality of medical care
|
||||
have become nationwide problems that warrant greater efforts than those that can be undertaken by
|
||||
any individual State.
|
||||
(2) There is a national need to restrict the ability of incompetent physicians to move from State to State
|
||||
without disclosure or discovery of the physician's previous damaging or incompetent performance.
|
||||
(3) This nationwide problem can be remedied through effective professional peer review.
|
||||
(4) The threat of private money damage liability under Federal laws, including treble damage liability under
|
||||
Federal antitrust law, unreasonably discourages physicians from participating in effective professional
|
||||
peer review.
|
||||
(5) There is an overriding national need to provide incentive and protection for physicians engaging in
|
||||
effective professional peer review.
|
||||
|
||||
From the time of the HCQIA, there has been good alignment between regulatory and accrediting bodies with respect to due process requirements for physician disciplinary actions. These formalities apply primarily to questions of competence (credentialing and privileging) rather than performance (routine clinical peer review). It would be most unusual to find a hospital whose medical staff bylaws did not conform.
|
||||
|
||||
==== National Practitioner Data Bank ====
|
||||
HCQIA enabled the creation of a National Practitioner Data Bank and required hospitals, state medical boards and other health care entities who engage in formal peer review activities to report all disciplinary actions that affect clinical privileges for more than 30 days. This includes incidents in which a provider voluntarily resigns privileges while under investigation. An entity that fails to report as required may lose HCQIA protections for three years.
|
||||
The HCQIA (§ 11135) requires hospitals to query the NPDB in their initial credentialing and bi-annual provider re-credentialing processes. Structurally, this process fulfills the congressional intention of restricting movement of incompetent physicians. Disciplinary actions may be a red flag for issues of global incompetence, but the problem may be focal, not global. Thus, the NPDB has been criticized for having the unintended consequence of having adverse economic impact on providers who were reported regardless of the magnitude of the issue. Even so, gross under-reporting of adverse actions remains an issue.
|
||||
|
||||
==== Patient Safety and Quality Improvement Act ====
|
||||
The Patient Safety and Quality Improvement Act of 2005 ("Patient Safety Act"), Public Law 109–41, USC 299b-21-b-26 amended title IX of the Public Health Service Act to create a general framework to support and protect voluntary initiatives to improve quality and patient safety in all healthcare settings through reporting to Patient Safety Organizations (PSO). This was intended to include peer review. The final rule promulgated by the Agency for Healthcare Research and Quality in 2008 at 42 CFR Part 3 also includes protections against reprisals for good-faith reporters of adverse events, near misses and hazardous conditions. Several Florida health systems subsequently formed PSOs in expectation of using federal statutory protections to maintain the confidentiality of peer review activity that would have been exposed under Amendment 7. The subsequent legal challenges to this strategy go beyond the scope of this article.
|
||||
|
||||
== External peer review ==
|
||||
In the US, following enactment of the HCQIA, executives from various national medical associations and health care organizations formed the non-profit American Medical Foundation for Peer Review and Education to provide independent assessment of medical care.
|
||||
A 2007 study showed that the vast majority of physician peer review is done "in house": 87% of US hospitals send less than 1% of their peer review cases to external agencies. The external review process is generally reserved for cases requiring special expertise for evaluation or for situations in which the independent opinion of an outside reviewer would be helpful. The process is significantly more costly than in-house review, since the majority of hospital review is done as a voluntary contribution of the medical staff.
|
||||
Mandated external peer review has not played an enduring role in the US, but was tested back in the 70s. A 1972 amendment to the Social Security Act established Professional Standards Review Organizations (PSRO) with a view to controlling escalating Medicare costs through physician-organized review. The PSRO model was not considered to be effective and was replaced in 1982 by a further act of Congress which established Utilization and Quality Control Peer Review Organizations (PROs). This model too was fraught with limitations. Studies of its methods called into question its reliability and validity for peer review. A survey of Iowa state medical society members in the early 90s regarding perceptions of the PRO program illustrated the potential harm of a poorly designed program. Furthermore, the Institute of Medicine issued a report identifying the system of care as the root cause of many instances of poor quality. As a result, in the mid-90s, the PROs changed their focus and methods; and began to de-emphasize their role as agents of external peer review. The change was completed by 2002, when they were renamed Quality Improvement Organizations.
|
||||
In contrast, external peer review has been used by German hospitals to lower their standardized mortality rate
|
||||
|
||||
== Abuse ==
|
||||
32
data/en.wikipedia.org/wiki/Clinical_peer_review-4.md
Normal file
32
data/en.wikipedia.org/wiki/Clinical_peer_review-4.md
Normal file
@ -0,0 +1,32 @@
|
||||
---
|
||||
title: "Clinical peer review"
|
||||
chunk: 5/5
|
||||
source: "https://en.wikipedia.org/wiki/Clinical_peer_review"
|
||||
category: "reference"
|
||||
tags: "science, encyclopedia"
|
||||
date_saved: "2026-05-05T09:52:52.389008+00:00"
|
||||
instance: "kb-cron"
|
||||
---
|
||||
|
||||
Sham peer review is a name given to the abuse of a medical peer review process to attack a doctor for personal or other non-medical reasons. State medical boards have withheld medical records from court to frame innocent physicians as negligent. Another type of review similar to sham peer review is "incompetent peer review," in which the reviewers are unable to accurately assess the quality of care provided by their colleagues.
|
||||
Controversy exists over whether medical peer review has been used as a competitive weapon in turf wars among physicians, hospitals, HMOs, and other entities and whether it is used in retaliation for whistleblowing. Many medical staff laws specify guidelines for the timeliness of peer review, in compliance with JCAHO standards, but state medical boards are not bound by such timely peer review and occasionally litigate cases for more than five years. Abuse is also referred to as "malicious peer review" by those who consider it endemic, and they allege that the creation of the National Practitioner Data Bank under the 1986 Healthcare Quality Improvement Act (HCQIA) facilitates such abuse, creating a 'third-rail' or a 'first-strike' mentality by granting significant immunity from liability to doctors and others who participate in peer reviews.
|
||||
The American Medical Association conducted an investigation of medical peer review in 2007 and concluded that while it is easy to allege misconduct, proven cases of malicious peer review are rare. Parenthetically, it is difficult to prove wrongdoing on behalf of a review committee that can use their clinical and administrative privileges to conceal exculpatory evidence.
|
||||
The California legislature framed its statutes so as to allow that a peer review can be found in court to have been improper due to bad faith or malice, in which case the peer reviewers' immunities from civil liability "fall by the wayside".
|
||||
Dishonesty by healthcare institutions is well-described in the literature and there is no incentive for those that lie to the public about patient care to be honest with a peer review committee.
|
||||
Incidents of alleged sham peer review are numerous and include cases such as Khajavi v. Feather River Anesthesiology Medical Group, Mileikowsky v. Tenet, and Roland Chalifoux.
|
||||
Defenders of the Health Care Quality Improvement Act state that the National Practitioner Data Bank protects patients by helping preventing errant physicians who have lost their privileges in one state from traveling to practice in another state. Physicians who allege they have been affected by sham peer review are also less able to find work when they move to another state, as Roland Chalifoux did. Moreover, neither opponents or supporters of the NPDB can be completely satisfied, as Chalifoux' case shows that just as physicians who were unjustly accused may be deprived of work in this way, those who have erred might still find work in other states.
|
||||
|
||||
== See also ==
|
||||
Subpoena duces tecum
|
||||
Utilization management
|
||||
|
||||
== References ==
|
||||
|
||||
== Further reading ==
|
||||
NPDB Guidebook: The NPDB Guidebook (last accessed 2/11/19)
|
||||
Steve Twedt (2003-10-26). "The Cost of Courage: How the tables turn on doctors". Post-Gazette. PG Publishing.
|
||||
Magazine Staff (2006-08-01). "All Is NOT Calm on the Hospital-Medical Staff Front". Southern California Physician. LACMA Services Inc. Archived from the original on 2006-09-02.
|
||||
Charles Bond (2005-11-15). "Editorial in Response to "What Is Sham Peer Review?"". Medscape General Medicine. 7 (4): 48.
|
||||
Goldstein H. (2006). "Appraising the performance of performance appraisals". IEEE Spectrum. 38 (11): 61–63. doi:10.1109/6.963247.
|
||||
Philip L. Merkel (Winter 2004). "Physicians Policing Physicians: the Development of Medical Staff Peer Review Law at California Hospitals" (PDF). University of San Francisco Law Review. 38 U.S.F. L. Rev. 301.
|
||||
Bryan G. Hall (Fall 2003). "The Health Care Quality Improvement Act of 1986 and Physician Peer Reviews: Success or Failure?" (PDF). University of South Dakota Elder Law Forum.
|
||||
74
data/en.wikipedia.org/wiki/Code_review-0.md
Normal file
74
data/en.wikipedia.org/wiki/Code_review-0.md
Normal file
@ -0,0 +1,74 @@
|
||||
---
|
||||
title: "Code review"
|
||||
chunk: 1/1
|
||||
source: "https://en.wikipedia.org/wiki/Code_review"
|
||||
category: "reference"
|
||||
tags: "science, encyclopedia"
|
||||
date_saved: "2026-05-05T09:52:53.547978+00:00"
|
||||
instance: "kb-cron"
|
||||
---
|
||||
|
||||
Code review (sometimes referred to as peer review) is a software quality assurance activity in which one or more people examine the source code of a computer program, either after implementation or during the development process. The persons performing the checking, excluding the author, are called "reviewers". At least one reviewer must not be the code's author.
|
||||
Code review differs from related software quality assurance techniques like static code analysis, self-checks, testing, and pair programming. Static analysis relies primarily on automated tools, self-checks involve only the author, testing requires code execution, and pair programming is performed continuously during development rather than as a separate step.
|
||||
|
||||
|
||||
== Goal ==
|
||||
Although direct discovery of quality problems is often the main goal, code reviews are usually performed to reach a combination of goals:
|
||||
|
||||
Improving code quality – Improve internal code quality and maintainability through better readability, uniformity, and understandability
|
||||
Detecting defects – Improve quality regarding external aspects, especially correctness, but also find issues such as performance problems, security vulnerabilities, and injected malware
|
||||
Learning/Knowledge transfer – Sharing codebase knowledge, solution approaches, and quality expectations, both to the reviewers and the author
|
||||
Increase sense of mutual responsibility – Increase a sense of collective code ownership and solidarity
|
||||
Finding better solutions – Generate ideas for new and better solutions and ideas beyond the specific code at hand
|
||||
Complying with QA guidelines, ISO/IEC standards – Code reviews are mandatory in some contexts, such as air traffic software and safety-critical software
|
||||
|
||||
|
||||
== Review types ==
|
||||
Several variations of code review processes exist, with additional types specified in IEEE 1028.
|
||||
|
||||
Management reviews
|
||||
Technical reviews
|
||||
Inspections
|
||||
Walk-throughs
|
||||
Audits
|
||||
|
||||
|
||||
=== Inspection (formal) ===
|
||||
The first code review process that was studied and described in detail was called "Inspection" by its inventor, Michael Fagan. Fagan inspection is a formal process that involves a careful and detailed execution with multiple participants and phases. In formal code reviews, software developers attend a series of meetings to examine code line by line, often using printed copies. Research has shown formal inspections to be extremely thorough and highly effective at identifying defects.
|
||||
|
||||
|
||||
=== Regular change-based code review (Walk-throughs) ===
|
||||
Software development teams typically adopt a more lightweight review process in which the scope of each review relates to changes to the codebase corresponding to a ticket, user story, commit, or some other unit of work. Furthermore, there are rules or conventions that integrate the review task into the development workflow through conventions like mandatory review of all tickets, commonly as part of a pull request, instead of explicitly planning each review. Such a process is called "regular, change-based code review". There are many variations of this basic process.
|
||||
A 2017 survey of 240 development teams found that 90% of teams using code review followed a change-based process, with 60% specifically using regular change-based review.
|
||||
Major software corporations known to use changed-based code review include Microsoft,
|
||||
Google,
|
||||
and Facebook.
|
||||
|
||||
|
||||
== Efficiency and effectiveness ==
|
||||
Ongoing research by Capers Jones analyzing over 12,000 software development projects found formal inspections had a latent defect discovery rate of 60-65%, while informal inspections detected fewer than 50% of defects. The latent defect discovery rate for most forms of testing is about 30%. A code review case study published in the book Best Kept Secrets of Peer Code Review contradicted the Capers Jones study, finding that lightweight reviews can uncover as many bugs as formal reviews while being faster and less costly.
|
||||
Studies indicate that up to 75% of code review comments affect software evolvability and maintainability rather than functionality, suggesting that code reviews are an excellent tool for software companies with long product or system life cycles. Therefore, less than 15% of issues discussed in code reviews relate directly to bugs.
|
||||
|
||||
|
||||
=== Guidelines ===
|
||||
Research indicates review effectiveness correlates with review speed. Optimal code review rates range from 200 to 400 lines of code per hour. Inspecting and reviewing more than a few hundred lines of code per hour for critical software (such as safety critical embedded software) may be too fast to find errors.
|
||||
|
||||
|
||||
=== Supporting tools ===
|
||||
Static code analysis tools assist reviewers by automatically checking source code for known vulnerabilities and defect patterns, particularly for large chunks of code. A 2012 study by VDC Research reports that 17.6% of the embedded software engineers surveyed currently use automated tools to support peer code review and 23.7% plan to use them within two years.
|
||||
|
||||
|
||||
== See also ==
|
||||
Committer
|
||||
Software review
|
||||
Software quality
|
||||
Best coding practices
|
||||
List of software development philosophies
|
||||
|
||||
|
||||
== External links ==
|
||||
Five Code Review Antipatterns
|
||||
Java Magazine, Best of 2020
|
||||
|
||||
|
||||
== References ==
|
||||
159
data/en.wikipedia.org/wiki/Collocation_(remote_sensing)-0.md
Normal file
159
data/en.wikipedia.org/wiki/Collocation_(remote_sensing)-0.md
Normal file
@ -0,0 +1,159 @@
|
||||
---
|
||||
title: "Collocation (remote sensing)"
|
||||
chunk: 1/2
|
||||
source: "https://en.wikipedia.org/wiki/Collocation_(remote_sensing)"
|
||||
category: "reference"
|
||||
tags: "science, encyclopedia"
|
||||
date_saved: "2026-05-05T09:53:37.734675+00:00"
|
||||
instance: "kb-cron"
|
||||
---
|
||||
|
||||
Collocation is a procedure used in remote sensing
|
||||
to match measurements from two or more different instruments.
|
||||
This is done for two main reasons:
|
||||
for validation purposes when comparing measurements of the same variable,
|
||||
and to relate measurements of two different variables
|
||||
either for performing retrievals or for prediction.
|
||||
In the second case the data is later fed into some type of statistical
|
||||
inverse method
|
||||
such as an artificial neural network, statistical classification algorithm,
|
||||
kernel estimator or a linear least squares.
|
||||
In principle, most collocation problems can be solved by a nearest neighbor search,
|
||||
but in practice there are many other considerations involved and the best method is
|
||||
highly specific to the particular matching of instruments.
|
||||
Here we deal with some of the most important considerations along with specific examples.
|
||||
There are at least two main considerations when performing collocations.
|
||||
The first is the sampling pattern of the instrument.
|
||||
Measurements may be dense and regular, such as those from a cross-track
|
||||
scanning satellite instrument. In this case, some form of interpolation
|
||||
may be appropriate. On the other hand, the measurements may be
|
||||
sparse, such as a one-off field campaign designed for some
|
||||
particular validation exercise.
|
||||
The second consideration is the instrument footprint, which
|
||||
can range from something approaching a point measurement
|
||||
such as that of a radiosonde, or it might be several
|
||||
kilometers in diameter such as that of a satellite-mounted,
|
||||
microwave radiometer. In the latter case, it is appropriate
|
||||
to take into account the instrument antenna pattern when
|
||||
making comparisons with another instrument having both a smaller
|
||||
footprint and a denser sampling, that is, several measurements
|
||||
from the one instrument will fit into the footprint of the other.
|
||||
Just as the instrument has a spatial footprint, it will also have
|
||||
a temporal footprint, often called the integration time.
|
||||
While the integration time is usually less than a second,
|
||||
which for meteorological applications is essentially instantaneous,
|
||||
there are many instances where some form of time averaging can considerably
|
||||
ease the collocation process.
|
||||
The collocations will need to be screened based on both the time
|
||||
and length scales of the phenomenon of interest.
|
||||
This will further facilitate the collocation process since
|
||||
remote sensing and other measurement data is almost always
|
||||
binned in some way.
|
||||
Certain atmospheric phenomena such as clouds or convection are quite transient
|
||||
so that we need not consider collocations with a time error of more than an hour or so.
|
||||
Sea ice, on the other hand, moves and evolves quite slowly, so that
|
||||
measurements separated by as much as a day or more might still be useful.
|
||||
|
||||
== Satellites ==
|
||||
|
||||
The satellites that most concern us are those with a low-Earth, polar orbit since geostationary satellites view the same point throughout their lifetime.
|
||||
The diagram shows measurements from AMSU-B
|
||||
instruments mounted on three satellites over a period of 12 hours.
|
||||
This illustrates both the orbit path and the scan pattern which runs crosswise.
|
||||
Since the orbit of a satellite is deterministic,
|
||||
barring orbit maneuvers, we can predict the location of the
|
||||
satellite at a given time and, by extension, the location of
|
||||
the measurement pixels.
|
||||
In theory, collocations can be performed by inverting the
|
||||
determining equations starting from the desired time period.
|
||||
In practice, partially processed data (usually referred to as
|
||||
level 1b, 1c or level 2) contain the coordinates of each of
|
||||
the measurement pixels and
|
||||
it is common to simply feed these coordinates to a nearest neighbor search.
|
||||
As mentioned previously, the satellite data is always binned
|
||||
in some manner.
|
||||
At minimum, the data will be arranged in
|
||||
swaths extending from pole to pole.
|
||||
The swaths will be labelled by time period and the
|
||||
approximate location known.
|
||||
|
||||
== Radiosondes ==
|
||||
|
||||
Radiosondes are particularly important for collocation studies
|
||||
because they measure atmospheric variables more accurately and more
|
||||
directly than satellite or other remote-sensing instruments.
|
||||
In addition, radiosonde samples are effectively instantaneous point measurements.
|
||||
One issue with radiosondes carried aloft by weather balloons is
|
||||
balloon drift. In,
|
||||
this is handled by averaging all the satellite pixels within a 50 km radius
|
||||
of the balloon launch.
|
||||
|
||||
If high-resolution sonde data, which normally has a constant
|
||||
sampling rate or includes the measurement time, is used,
|
||||
then the lateral motion can be traced from the wind data.
|
||||
Even with low-resolution data, the motion can still
|
||||
be approximated by assuming a constant ascent rate.
|
||||
Excepting a short bit towards the end,
|
||||
the linear ascent can be clearly seen in the figure above.
|
||||
We can show that the ascent rate of a balloon is given
|
||||
by the following equation
|
||||
|
||||
|
||||
|
||||
|
||||
v
|
||||
=
|
||||
|
||||
|
||||
|
||||
|
||||
g
|
||||
k
|
||||
h
|
||||
(
|
||||
1
|
||||
−
|
||||
|
||||
R
|
||||
|
||||
a
|
||||
|
||||
|
||||
|
||||
/
|
||||
|
||||
|
||||
R
|
||||
|
||||
s
|
||||
|
||||
|
||||
)
|
||||
|
||||
|
||||
c
|
||||
|
||||
D
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
{\displaystyle v={\sqrt {\frac {gkh(1-R_{a}/R_{s})}{c_{D}}}}}
|
||||
|
||||
|
||||
where g is gravitational acceleration,
|
||||
k relates the height, h, and surface area, A,
|
||||
of the balloon to its volume: V = khA;
|
||||
Rs is the equivalent "gas constant" of the balloon,
|
||||
Ra is the gas constant of the air
|
||||
and cD is the drag coefficient of the balloon.
|
||||
Substituting some sensible values for each of the constants,
|
||||
k=1. (the balloon is a perfect cylinder), h=2. m, cD = 1.
|
||||
and Ra is the gas constant of helium,
|
||||
returns an ascent rate of 4.1 m/s. Compare this with the
|
||||
values shown in the histogram which compiles all of the
|
||||
radiosonde launches from the Polarstern research vessel
|
||||
over a period of eleven years between 1992 and 2003.
|
||||
85
data/en.wikipedia.org/wiki/Collocation_(remote_sensing)-1.md
Normal file
85
data/en.wikipedia.org/wiki/Collocation_(remote_sensing)-1.md
Normal file
@ -0,0 +1,85 @@
|
||||
---
|
||||
title: "Collocation (remote sensing)"
|
||||
chunk: 2/2
|
||||
source: "https://en.wikipedia.org/wiki/Collocation_(remote_sensing)"
|
||||
category: "reference"
|
||||
tags: "science, encyclopedia"
|
||||
date_saved: "2026-05-05T09:53:37.734675+00:00"
|
||||
instance: "kb-cron"
|
||||
---
|
||||
|
||||
== Interpolation ==
|
||||
For gridded data such as assimilation or reanalysis data,
|
||||
interpolation is likely the most appropriate method for performing any type of comparison.
|
||||
A specific point in both physical position and time is easy to locate
|
||||
within the grid and interpolation performed between the nearest neighbors.
|
||||
Linear interpolation (bilinear, trilinear etc.) is the most common,
|
||||
though cubic is used as well but is probably not worth the extra computational overhead.
|
||||
If the variable of interest has a relatively smooth rate of change (temperature is a good example of this because it
|
||||
has a diffusion mechanism, radiative transfer, not available to other atmospheric variables),
|
||||
then interpolation can eliminate much of the error associated with collocation.
|
||||
Interpolation may also be appropriate for many types of satellite instruments,
|
||||
for instance a cross-track scanning instrument like Landsat.
|
||||
In data derived from the Advanced Microwave Sounding Unit (AMSU) are
|
||||
interpolated (although not for the purposes of collocation) using a slight variation
|
||||
of trilinear interpolation.
|
||||
Since measurements within a single scan track are laid out in an approximately rectangular
|
||||
grid, bilinear interpolation can be performed.
|
||||
By searching for the nearest overlapping scan track both forwards and backwards in time,
|
||||
the spatial interpolates can then be interpolated in time.
|
||||
This technique works better with derived quantities rather than raw brightness temperatures since
|
||||
the scan angle will already have been accounted for.
|
||||
For instruments with a more irregular sampling pattern, such as the Advanced Microwave
|
||||
Scanning Radiometer-EOS (AMSR-E) instrument which has a circular scanning pattern,
|
||||
we need a more general form of interpolation such as kernel estimation.
|
||||
A method commonly used for this particular instrument, as well as SSM/I,
|
||||
is a simple daily average within regularly gridded, spatial bins
|
||||
.
|
||||
|
||||
== Trajectories ==
|
||||
To collocate measurements of a medium- to long-lived atmospheric tracer with a second instrument, running trajectories can considerably improve the accuracy. It also simplifies the analysis somewhat: a trajectory is run both forwards and backwards from the measurement location and between the desired time window. Note that the acceptable time window has now become longer because the error from transport induced changes in the tracer is removed: the tracer lifetime would be a good window to use. Since the trajectories provide a location for every point in time within the time window, there is no need to check multiple measurements from the second instrument. Every time within the trajectory is checked for the distance criterion but within a very narrow window. Alternatively, the exact times of the measurements for the second instrument are interpolated within the trajectory. Only the smallest distance error below the threshold is used and the distance criterion can be made smaller as a consequence.
|
||||
|
||||
== Example: Pol-Ice Campaign ==
|
||||
|
||||
Collocations of sea ice thickness and brightness temperatures taken during the
|
||||
Pol-Ice Campaign are an excellent example since they illustrate many of the most important principles as well as demonstrating the necessity of taking into account the individual case. The Pol-Ice campaign was conducted in the N. Baltic in March 2007 as part of the SMOS-Ice project in preparation for the launch of the Soil Moisture and Ocean Salinity satellite.
|
||||
Because of the low frequency of the SMOS instrument, it is hoped that it will render
|
||||
information on sea ice thickness, therefore the campaign comprised measurements
|
||||
of both sea ice thickness and emitted brightness temperature.
|
||||
Brightness temperatures were measured with the EMIRAD L-band microwave radiometer
|
||||
|
||||
carried on board an airplane. Ice thickness was measured with the E-M Bird ice thickness meter which was carried by a helicopter. The E-M Bird measures ice thickness with a combination of inductance measurements to determine the location of the ice-water interface and a laser altimeter to measure the height of the ice surface.
|
||||
The map above shows the flight tracks of both instruments which were approximately coincident but obviously subject to pilot error.
|
||||
|
||||
Since the flight paths of both aircraft were approximately linear, the first step in the collocation process was to convert all the coincident flights to Cartesian coordinates with the x-axis being lateral distance and the y-axis transverse distance. In this way, collocations can be performed in two ways: crudely, by matching only the x distances, and more precisely by matching both coordinates.
|
||||
More importantly, the footprint size of the radiometer is many times larger
|
||||
than that of the E-M Bird meter. The figure to the left shows
|
||||
the antenna response function for the radiometer.
|
||||
The full width at half maximum is 31 degrees.
|
||||
Since the aircraft was flying at approximately 500 m, this translates
|
||||
to a footprint size of 200 m or more.
|
||||
Meanwhile, the footprint size of the E-M Bird was roughly 40 m
|
||||
with a sample spacing of only 2 to 4 m.
|
||||
Rather than looking to nearest neighbors, which would have produced
|
||||
poor results, a weighted average of the thickness measurements was
|
||||
performed for each radiometer measurement.
|
||||
Weights were calculated based on the radiometer response function which is almost
|
||||
a perfect Gaussian up to about 45 degrees.
|
||||
Points could be excluded based on distance along the flight
|
||||
path.
|
||||
For validation of sea ice emissivity forward model calculations,
|
||||
this was further refined by performing an emissivity calculation
|
||||
for each thickness measurement and averaging over the radiometer
|
||||
footprint.
|
||||
|
||||
The figure below illustrates relative measurement locations
|
||||
from each of the instruments used in the Pol-Ice campaign.
|
||||
Two overpasses are shown: one from the airplane carrying the
|
||||
EMIRAD radiometer and one from the helicopter carrying the
|
||||
E-M Bird instrument.
|
||||
The x-axis is along the line of the flight path.
|
||||
EMIRAD footprints are drawn with lines, E-M Bird
|
||||
inductance measurements are represented by circles
|
||||
and LIDAR measurements with dots.
|
||||
|
||||
== References ==
|
||||
21
data/en.wikipedia.org/wiki/Combinatorial_data_analysis-0.md
Normal file
21
data/en.wikipedia.org/wiki/Combinatorial_data_analysis-0.md
Normal file
@ -0,0 +1,21 @@
|
||||
---
|
||||
title: "Combinatorial data analysis"
|
||||
chunk: 1/1
|
||||
source: "https://en.wikipedia.org/wiki/Combinatorial_data_analysis"
|
||||
category: "reference"
|
||||
tags: "science, encyclopedia"
|
||||
date_saved: "2026-05-05T09:53:38.930340+00:00"
|
||||
instance: "kb-cron"
|
||||
---
|
||||
|
||||
In statistics, combinatorial data analysis (CDA) is the study of data sets where the order in which objects are arranged is important. CDA can be used either to determine how well a given combinatorial construct reflects the observed data, or to search for a suitable combinatorial construct that does fit the data.
|
||||
|
||||
|
||||
== See also ==
|
||||
Cluster analysis
|
||||
Geometric data analysis
|
||||
Structured data analysis (statistics)
|
||||
Seriation (statistics)
|
||||
|
||||
|
||||
== References ==
|
||||
14
data/en.wikipedia.org/wiki/Commentary_article-0.md
Normal file
14
data/en.wikipedia.org/wiki/Commentary_article-0.md
Normal file
@ -0,0 +1,14 @@
|
||||
---
|
||||
title: "Commentary article"
|
||||
chunk: 1/1
|
||||
source: "https://en.wikipedia.org/wiki/Commentary_article"
|
||||
category: "reference"
|
||||
tags: "science, encyclopedia"
|
||||
date_saved: "2026-05-05T09:52:54.688511+00:00"
|
||||
instance: "kb-cron"
|
||||
---
|
||||
|
||||
A commentary article is a short narrowly focused article that is usually commissioned by an Academic journal, though they are sometimes published from unsolicited submissions.
|
||||
|
||||
|
||||
== References ==
|
||||
32
data/en.wikipedia.org/wiki/Concept_drift-0.md
Normal file
32
data/en.wikipedia.org/wiki/Concept_drift-0.md
Normal file
@ -0,0 +1,32 @@
|
||||
---
|
||||
title: "Concept drift"
|
||||
chunk: 1/4
|
||||
source: "https://en.wikipedia.org/wiki/Concept_drift"
|
||||
category: "reference"
|
||||
tags: "science, encyclopedia"
|
||||
date_saved: "2026-05-05T09:53:40.159460+00:00"
|
||||
instance: "kb-cron"
|
||||
---
|
||||
|
||||
In predictive analytics, data science, machine learning and related fields, concept drift or drift is an evolution of data that invalidates the data model. It happens when the statistical properties of the target variable, which the model is trying to predict, change over time in unforeseen ways. This causes problems because the predictions become less accurate as time passes. Drift detection and drift adaptation are of paramount importance in the fields that involve dynamically changing data and data models.
|
||||
|
||||
== Predictive model decay ==
|
||||
In machine learning and predictive analytics this drift phenomenon is called concept drift. In machine learning, a common element of a data model are the statistical properties, such as probability distribution of the actual data. If they deviate from the statistical properties of the training data set, then the learned predictions may become invalid, if the drift is not addressed.
|
||||
|
||||
== Data configuration decay ==
|
||||
Another important area is software engineering, where three types of data drift affecting data fidelity may be recognized. Changes in the software environment ("infrastructure drift") may invalidate software infrastructure configuration. "Structural drift" happens when the data schema changes, which may invalidate databases. "Semantic drift" is changes in the meaning of data while the structure does not change. In many cases this may happen in complicated applications when many independent developers introduce changes without proper awareness of the effects of their changes in other areas of the software system.
|
||||
For many application systems, the nature of data on which they operate are subject to changes for various reasons, e.g., due to changes in business model, system updates, or switching the platform on which the system operates.
|
||||
In the case of cloud computing, infrastructure drift that may affect the applications running on cloud may be caused by the updates of cloud software.
|
||||
There are several types of detrimental effects of data drift on data fidelity. Data corrosion is passing the drifted data into the system undetected. Data loss happens when valid data are ignored due to non-conformance with the applied schema. Squandering is the phenomenon when new data fields are introduced upstream in the data processing pipeline, but somewhere downstream these data fields are absent.
|
||||
|
||||
== Inconsistent data ==
|
||||
"Data drift" may refer to the phenomenon when database records fail to match the real-world data due to the changes in the latter over time. This is a common problem with databases involving people, such as customers, employees, citizens, residents, etc. Human data drift may be caused by unrecorded changes in personal data, such as place of residence or name, as well as due to errors during data input.
|
||||
"Data drift" may also refer to inconsistency of data elements between several replicas of a database. The reasons can be difficult to identify. A simple drift detection is to run checksum regularly. However the remedy may be not so easy.
|
||||
|
||||
== Examples ==
|
||||
The behavior of the customers in an online shop may change over time. For example, if weekly merchandise sales are to be predicted, and a predictive model has been developed that works satisfactorily. The model may use inputs such as the amount of money spent on advertising, promotions being run, and other metrics that may affect sales. The model is likely to become less and less accurate over time – this is concept drift. In the merchandise sales application, one reason for concept drift may be seasonality, which means that shopping behavior changes seasonally. Perhaps there will be higher sales in the winter holiday season than during the summer, for example. Concept drift generally occurs when the covariates that comprise the data set begin to explain the variation of your target set less accurately — there may be some confounding variables that have emerged, and that one simply cannot account for, which renders the model accuracy to progressively decrease with time. Generally, it is advised to perform health checks as part of the post-production analysis and to re-train the model with new assumptions upon signs of concept drift.
|
||||
|
||||
== Possible remedies ==
|
||||
To prevent deterioration in prediction accuracy because of concept drift, reactive and tracking solutions can be adopted. Reactive solutions retrain the model in reaction to a triggering mechanism, such as a change-detection test or control charts from statistical process control, to explicitly detect concept drift as a change in the statistics of the data-generating process. When concept drift is detected, the current model is no longer up-to-date and must be replaced by a new one to restore prediction accuracy. A shortcoming of reactive approaches is that performance may decay until the change is detected. Tracking solutions seek to track the changes in the concept by continually updating the model. Methods for achieving this include online machine learning, frequent retraining on the most recently observed samples, and maintaining an ensemble of classifiers where one new classifier is trained on the most recent batch of examples and replaces the oldest classifier in the ensemble.
|
||||
Contextual information, when available, can be used to better explain the causes of the concept drift: for instance, in the sales prediction application, concept drift might be compensated by adding information about the season to the model. By providing information about the time of the year, the rate of deterioration of your model is likely to decrease, but concept drift is unlikely to be eliminated altogether. This is because actual shopping behavior does not follow any static, finite model. New factors may arise at any time that influence shopping behavior, the influence of the known factors or their interactions may change.
|
||||
Concept drift cannot be avoided for complex phenomena that are not governed by fixed laws of nature. All processes that arise from human activity, such as socioeconomic processes, and biological processes are likely to experience concept drift. Therefore, periodic retraining, also known as refreshing, of any model is necessary.
|
||||
27
data/en.wikipedia.org/wiki/Concept_drift-1.md
Normal file
27
data/en.wikipedia.org/wiki/Concept_drift-1.md
Normal file
@ -0,0 +1,27 @@
|
||||
---
|
||||
title: "Concept drift"
|
||||
chunk: 2/4
|
||||
source: "https://en.wikipedia.org/wiki/Concept_drift"
|
||||
category: "reference"
|
||||
tags: "science, encyclopedia"
|
||||
date_saved: "2026-05-05T09:53:40.159460+00:00"
|
||||
instance: "kb-cron"
|
||||
---
|
||||
|
||||
=== Remedy methods ===
|
||||
DDM (Drift Detection Method): detects drift by monitoring the model's error rate over time. When the error rate passes a set threshold, it enters a warning phase, and if it passes another threshold, it enters a drift phase.
|
||||
EDDM (Early Drift Detection Method): improves DDM's detection rate by tracking the average distance between two errors instead of only the error rate.
|
||||
ADWIN (Adaptive Windowing): dynamically stores a window of recent data and warns the user if it detects a significant change between the statistics of the window's earlier data compared to more recent data.
|
||||
KSWIN (Kolmogorov–Smirnov Windowing): detects drift based on the Kolmogorov-Smirnov statistical test.
|
||||
DDM and EDDM: Concept Drift Detection
|
||||
|
||||
online supervised methods that rely on sequential error monitoring to estimate the evolving error rate.
|
||||
ADWIN and KSWIN: Windowing
|
||||
|
||||
maintain a "window", a subset of the most recent data, of the data stream, which it checks for statistical differences across the window.
|
||||
|
||||
== Applications in security ==
|
||||
Concept drift is a recurring issue in security analytics, especially in malware and intrusion detection. In these systems, models are often trained on past logs, binaries or network traces, but the behaviour of attackers changes over time as new malware families, obfuscation techniques and campaigns appear. When the data no longer resemble the training set, the decision boundaries learned by classifiers or anomaly detectors can become misaligned with the current threat landscape and detection performance can drop unless the models are updated or replaced.
|
||||
Several studies on Windows malware model detection as an evolving data stream and track how performance changes as time passes. They show that classifiers trained on a fixed time window can perform well on nearby data but deteriorate quickly when evaluated on samples collected months or years later, even when large amounts of training data are available. In order to keep up with this, security systems often use sliding or adaptive windows, which restrict training to the most recent portion of the data so that older, less relevant examples are gradually discarded. They also employ drift detectors such as ADWIN and KSWIN that monitor error rates or changes in the distribution of recent observations and signal when the statistics of the incoming stream differ significantly from the past, prompting retraining or model replacement.
|
||||
Related problems appear in spam filtering, fraud detection and intrusion detection, where adversaries change content, patterns of activity or network behavior to evade models trained on historical data. In these settings drift can be gradual, as new types of spam or fraud emerge, or abrupt, after a sudden shift in attack techniques. Common strategies to remain effective include updating models with recent labelled examples, using ensembles that give more weight to classifiers trained on newer data, and designing features that are less sensitive to superficial changes in how attacks are carried out.
|
||||
Research on machine learning for security has also shown that not handling concept drift correctly during evaluation can lead to a lot of bias. Researchers studied 30 learning-based security systems and found that many detectors were only tested in short time periods or lab-only. This ignores temporal correlations, non-stationarity, and the way attacks change in the real world, which can make these systems seem much more effective than they really are when they are put to use. The authors point out ways that people can snoop on time, like using features from newer malware samples when training models that are supposed to work on older data, calculating normalization statistics, or embeddings on the whole dataset. They also talk about using old benchmark datasets that don't train models for current threats. Researchers suggest using time-aware evaluation protocols that keep causal ordering, don't let future information leak into training, and make sure that security data takes into account both time and space so that reported performance better shows the effects of concept drift in real-world situations.
|
||||
36
data/en.wikipedia.org/wiki/Concept_drift-2.md
Normal file
36
data/en.wikipedia.org/wiki/Concept_drift-2.md
Normal file
@ -0,0 +1,36 @@
|
||||
---
|
||||
title: "Concept drift"
|
||||
chunk: 3/4
|
||||
source: "https://en.wikipedia.org/wiki/Concept_drift"
|
||||
category: "reference"
|
||||
tags: "science, encyclopedia"
|
||||
date_saved: "2026-05-05T09:53:40.159460+00:00"
|
||||
instance: "kb-cron"
|
||||
---
|
||||
|
||||
=== Concept Drift, Concept Evolution, and Adversarial Manipulation ===
|
||||
Concept drift refers to when the relationship between input features and their corresponding labels changes gradually over time. Consider a simple example: phishing emails. The older kinds of attacks would leverage prominent signals, such as the presence of the word "lottery," while newer variants would introduce more polished phrasing, such as "lottery and not scam," in attempts to appear more legitimate and evade detection. This is a form of concept drift because the underlying class—phishing—remains the same, but the feature patterns that define it shift as attackers adapt. It is important, though, to distinguish concept drift from its related but inherently different challenge: concept evolution.“Concept evolution” occurs when an entirely new attack family emerges with feature patterns the model has never encountered. The model is exposed to new features and patterns that were not present in the training data. In the multi-class setting, it means a completely new label is introduced. For the binary setting (benign vs malicious), a completely new malicious family is introduced; however, the model still tries to make predictions based on outdated feature patterns. One such example would be if a model is trained with both trojan and ransomware features, but then crypto-miner malware emerged. This is still "malicious" in label; however, it means that features are now new for the model. Concept drift and concept evolution pinpoint how fast attackers adapt and keep evolving their techniques to exploit the weaknesses in cyber-defense models. The attackers regularly update their payloads, communication methodologies, and even their depended-upon APIs. This translates to a model that is only trained on historical data becoming outdated unless updated or designed to learn continuously.
|
||||
These behavior changes do not always happen organically; attackers can introduce them intentionally. The primary causes of concept drift are adversarial attacks because slight, carefully crafted changes to malware by the attacker will affect the input-label relationship, leading to models misclassifying malware as benign. Adversarial attacks generally fall into two settings: white-box and black-box. In white-box settings, an attacker fully accesses the model with its architecture, parameters, and gradients. One example in white-box would be those attacks in which an attacker uses gradient-based approaches to build slight perturbations to a sample, causing its misclassification. On the other extreme are black-box attacks. An adversary knows the input and output of the model but not its inner workings. Attackers can probe the model's outputs—sometimes sending thousands of slightly varied inputs—until something slips past the detector, or they train a substitute model and then generate adversarial examples that transfer to the original system. These tactics gradually shift the patterns a model observes, accelerating drift and making previously reliable defenses far less effective.
|
||||
These attacker-driven manipulations not only cause concept drift but also reveal the weaknesses of modern learning techniques such as transfer learning. Transfer learning enables defenders to take an existing model, pre-trained for some other task, and fine-tune it rather than training from scratch. This has considerably increased success across diverse domains such as image classification and natural language processing. For malware detection, for instance, Microsoft and Intel recently demonstrated that converting malware binaries into grayscale images enables malware detection using pre-trained vision models, yielding strong performance—for example, 87% recall with only 0.1% false positives. While such an approach reduces the training time and computational cost drastically, it naturally brings one important drawback: the base model architecture is often publicly known. An adversary may use this fact and create adversarial examples designed to exploit the known weaknesses of the underlying model, which then enables those manipulations to "transfer" to the newly trained malware detector. In this regard, transfer learning offers the significant advantages of co-opting prior work while also running the risk of inheriting vulnerabilities that attackers can readily exploit.
|
||||
In addition to rapidly evolving attacks, ML based defenses face a known issue of acquiring accurate data or labels. Label aggregators may encounter difficulty when labeling new samples, and initially these labels can be incorrect. As time passes, the labels typically stabilize toward their ground truth. This process can take anywhere from days to years. This is considered a contributing factor to concept drift because during this time the original labels may be updated, new attacks may be produced, or new classes may appear. This phenomenon is called delayed labels, and it contributes to the broader causes of concept drift. Therefore, delayed labels are often taken into account when building defense solutions and their evaluations. To detect drift between labels and sample mappings, drift detectors are commonly used during evaluation.
|
||||
|
||||
== See also ==
|
||||
Data stream mining
|
||||
Data mining
|
||||
Snyk, a company whose portfolio includes drift detection in software applications
|
||||
|
||||
== Further reading ==
|
||||
Many papers have been published describing algorithms for concept drift detection. Only reviews, surveys and overviews are here:
|
||||
|
||||
=== Reviews ===
|
||||
|
||||
== External links ==
|
||||
|
||||
=== Software ===
|
||||
Frouros: An open-source Python library for drift detection in machine learning systems.
|
||||
NannyML: An open-source Python library for detecting univariate and multivariate distribution drift and estimating machine learning model performance without ground truth labels.
|
||||
RapidMiner: Formerly Yet Another Learning Environment (YALE): free open-source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time-varying concepts, and tracking drifting concept. It is used in combination with its data stream mining plugin (formerly concept drift plugin).
|
||||
EDDM (Early Drift Detection Method): free open-source implementation of drift detection methods in Weka.
|
||||
MOA (Massive Online Analysis): free open-source software specific for mining data streams with concept drift. It contains a prequential evaluation method, the EDDM concept drift methods, a reader of ARFF real datasets, and artificial stream generators as SEA concepts, STAGGER, rotating hyperplane, random tree, and random radius based functions. MOA supports bi-directional interaction with Weka.
|
||||
|
||||
=== Datasets ===
|
||||
70
data/en.wikipedia.org/wiki/Concept_drift-3.md
Normal file
70
data/en.wikipedia.org/wiki/Concept_drift-3.md
Normal file
@ -0,0 +1,70 @@
|
||||
---
|
||||
title: "Concept drift"
|
||||
chunk: 4/4
|
||||
source: "https://en.wikipedia.org/wiki/Concept_drift"
|
||||
category: "reference"
|
||||
tags: "science, encyclopedia"
|
||||
date_saved: "2026-05-05T09:53:40.159460+00:00"
|
||||
instance: "kb-cron"
|
||||
---
|
||||
|
||||
==== Real ====
|
||||
USP Data Stream Repository, 27 real-world stream datasets with concept drift compiled by Souza et al. (2020). Access
|
||||
Airline, approximately 116 million flight arrival and departure records (cleaned and sorted) compiled by E. Ikonomovska. Reference: Data Expo 2009 Competition [1]. Access
|
||||
Chess.com (online games) and Luxembourg (social survey) datasets compiled by I. Zliobaite. Access
|
||||
ECUE spam 2 datasets each consisting of more than 10,000 emails collected over a period of approximately 2 years by an individual. Access from S.J.Delany webpage
|
||||
Elec2, electricity demand, 2 classes, 45,312 instances. Reference: M. Harries, Splice-2 comparative evaluation: Electricity pricing, Technical report, The University of South Wales, 1999. Access from J.Gama webpage. Comment on applicability.
|
||||
PAKDD'09 competition data represents the credit evaluation task. It is collected over a five-year period. Unfortunately, the true labels are released only for the first part of the data. Access
|
||||
Sensor stream and Power supply stream datasets are available from X. Zhu's Stream Data Mining Repository. Access
|
||||
SMEAR is a benchmark data stream with a lot of missing values. Environment observation data over 7 years. Predict cloudiness. Access
|
||||
Text mining, a collection of text mining datasets with concept drift, maintained by I. Katakis. Access
|
||||
Gas Sensor Array Drift Dataset, a collection of 13,910 measurements from 16 chemical sensors utilized for drift compensation in a discrimination task of 6 gases at various levels of concentrations. Access
|
||||
|
||||
==== Other ====
|
||||
KDD'99 competition data contains simulated intrusions in a military network environment. It is often used as a benchmark to evaluate handling concept drift. Access
|
||||
|
||||
==== Synthetic ====
|
||||
Extreme verification latency benchmark Souza, V.M.A.; Silva, D.F.; Gama, J.; Batista, G.E.A.P.A. (2015). "Data Stream Classification Guided by Clustering on Nonstationary Environments and Extreme Verification Latency". Proceedings of the 2015 SIAM International Conference on Data Mining (SDM). SIAM. pp. 873–881. doi:10.1137/1.9781611974010.98. ISBN 978-1-61197-401-0. S2CID 19198944. Access from Nonstationary Environments – Archive.
|
||||
Sine, Line, Plane, Circle and Boolean Data Sets Minku, L.L.; White, A.P.; Yao, X. (2010). "The Impact of Diversity on On-line Ensemble Learning in the Presence of Concept Drift" (PDF). IEEE Transactions on Knowledge and Data Engineering. 22 (5): 730–742. Bibcode:2010ITKDE..22..730M. doi:10.1109/TKDE.2009.156. S2CID 16592739. Access from L.Minku webpage.
|
||||
SEA concepts Street, N.W.; Kim, Y. (2001). "A streaming ensemble algorithm (SEA) for large-scale classification" (PDF). KDD'01: Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining. pp. 377–382. doi:10.1145/502512.502568. ISBN 978-1-58113-391-2. S2CID 11868540. Access from J.Gama webpage.
|
||||
STAGGER Schlimmer, J.C.; Granger, R.H. (1986). "Incremental Learning from Noisy Data". Mach. Learn. 1 (3): 317–354. doi:10.1007/BF00116895. S2CID 33776987.
|
||||
Mixed Gama, J.; Medas, P.; Castillo, G.; Rodrigues, P. (2004). "Learning with drift detection". Brazilian symposium on artificial intelligence. Springer. pp. 286–295. doi:10.1007/978-3-540-28645-5_29. ISBN 978-3-540-28645-5. S2CID 2606652.
|
||||
|
||||
==== Data generation frameworks ====
|
||||
Minku, White & Yao 2010 Download from L.Minku webpage.
|
||||
Lindstrom, P.; Delany, S.J.; MacNamee, B. (2008). "Autopilot: Simulating Changing Concepts in Real Data" (PDF). Proceedings of the 19th Irish Conference on Artificial Intelligence & Cognitive Science. pp. 272–263.
|
||||
Narasimhamurthy, A.; Kuncheva, L.I. (2007). "A framework for generating data to simulate changing environments". AIAP'07: Proceedings of the 25th IASTED International Multi-Conference: artificial intelligence and applications. pp. 384–389. Code
|
||||
|
||||
=== Projects ===
|
||||
INFER: Computational Intelligence Platform for Evolving and Robust Predictive Systems (2010–2014), Bournemouth University (UK), Evonik Industries (Germany), Research and Engineering Centre (Poland)
|
||||
HaCDAIS: Handling Concept Drift in Adaptive Information Systems (2008–2012), Eindhoven University of Technology (the Netherlands)
|
||||
KDUS: Knowledge Discovery from Ubiquitous Streams, INESC Porto and Laboratory of Artificial Intelligence and Decision Support (Portugal)
|
||||
ADEPT: Adaptive Dynamic Ensemble Prediction Techniques, University of Manchester (UK), University of Bristol (UK)
|
||||
ALADDIN: autonomous learning agents for decentralised data and information networks (2005–2010)
|
||||
GAENARI: C++ incremental decision tree algorithm. it minimize concept drifting damage. (2022)
|
||||
|
||||
=== Benchmarks ===
|
||||
NAB: The Numenta Anomaly Benchmark, benchmark for evaluating algorithms for anomaly detection in streaming, real-time applications. (2014–2018)
|
||||
|
||||
=== Meetings ===
|
||||
2014
|
||||
[] Special Session on "Concept Drift, Domain Adaptation & Learning in Dynamic Environments" @IEEE IJCNN 2014
|
||||
2013
|
||||
RealStream Real-World Challenges for Data Stream Mining Workshop-Discussion at the ECML PKDD 2013, Prague, Czech Republic.
|
||||
LEAPS 2013 The 1st International Workshop on Learning stratEgies and dAta Processing in nonStationary environments
|
||||
2011
|
||||
LEE 2011 Special Session on Learning in evolving environments and its application on real-world problems at ICMLA'11
|
||||
HaCDAIS 2011 The 2nd International Workshop on Handling Concept Drift in Adaptive Information Systems
|
||||
ICAIS 2011 Track on Incremental Learning
|
||||
IJCNN 2011 Special Session on Concept Drift and Learning Dynamic Environments
|
||||
CIDUE 2011 Symposium on Computational Intelligence in Dynamic and Uncertain Environments
|
||||
2010
|
||||
HaCDAIS 2010 International Workshop on Handling Concept Drift in Adaptive Information Systems: Importance, Challenges and Solutions
|
||||
ICMLA10 Special Session on Dynamic learning in non-stationary environments
|
||||
SAC 2010 Data Streams Track at ACM Symposium on Applied Computing
|
||||
SensorKDD 2010 International Workshop on Knowledge Discovery from Sensor Data
|
||||
StreamKDD 2010 Novel Data Stream Pattern Mining Techniques
|
||||
Concept Drift and Learning in Nonstationary Environments at IEEE World Congress on Computational Intelligence
|
||||
MLMDS'2010 Special Session on Machine Learning Methods for Data Streams at the 10th International Conference on Intelligent Design and Applications, ISDA'10
|
||||
|
||||
== References ==
|
||||
26
data/en.wikipedia.org/wiki/Continuous_analytics-0.md
Normal file
26
data/en.wikipedia.org/wiki/Continuous_analytics-0.md
Normal file
@ -0,0 +1,26 @@
|
||||
---
|
||||
title: "Continuous analytics"
|
||||
chunk: 1/1
|
||||
source: "https://en.wikipedia.org/wiki/Continuous_analytics"
|
||||
category: "reference"
|
||||
tags: "science, encyclopedia"
|
||||
date_saved: "2026-05-05T09:53:41.378945+00:00"
|
||||
instance: "kb-cron"
|
||||
---
|
||||
|
||||
Continuous analytics is a data science process that abandons ETLs and complex batch data pipelines in favor of cloud-native and microservices paradigms. Continuous data processing enables real time interactions and immediate insights with fewer resources.
|
||||
|
||||
|
||||
== Defined ==
|
||||
Analytics is the application of mathematics and statistics to big data. Data scientists write analytics programs to look for solutions to business problems, like forecasting demand or setting an optimal price. The continuous approach runs multiple stateless engines which concurrently enrich, aggregate, infer and act on the data. Data scientists, dashboards and client apps all access the same raw or real-time data derivatives with proper identity-based security, data masking and versioning in real-time.
|
||||
Traditionally, data scientists have not been part of IT development teams, like regular Java programmers. This is because their skills set them apart in their own department not normally related to IT, i.e., math, statistics, and data science. So it is logical to conclude that their approach to writing software code does not enjoy the same efficiencies as the traditional programming team. In particular traditional programming has adopted the Continuous Delivery approach to writing code and the agile methodology. That releases software in a continuous circle, called iterations.
|
||||
Continuous analytics then is the extension of the continuous delivery software development model to the big data analytics development team. The goal of the continuous analytics practitioner then is to find ways to incorporate writing analytics code and installing big data software into the agile development model of automatically running unit and functional tests and building the environment system with automated tools.
|
||||
To make this work means getting data scientists to write their code in the same code repository that regular programmers use so that software can pull it from there and run it through the build process. It also means saving the configuration of the big data cluster (sets of virtual machines) in some kind of repository as well. That facilitates sending out analytics code and big data software and objects in the same automated way as the continuous integration process.
|
||||
|
||||
|
||||
== External links ==
|
||||
Continuous analytics
|
||||
Development model
|
||||
|
||||
|
||||
== References ==
|
||||
699
data/en.wikipedia.org/wiki/Cosine_similarity-0.md
Normal file
699
data/en.wikipedia.org/wiki/Cosine_similarity-0.md
Normal file
@ -0,0 +1,699 @@
|
||||
---
|
||||
title: "Cosine similarity"
|
||||
chunk: 1/3
|
||||
source: "https://en.wikipedia.org/wiki/Cosine_similarity"
|
||||
category: "reference"
|
||||
tags: "science, encyclopedia"
|
||||
date_saved: "2026-05-05T09:53:42.619881+00:00"
|
||||
instance: "kb-cron"
|
||||
---
|
||||
|
||||
In data analysis, cosine similarity is a measure of similarity between two non-zero vectors defined in an inner product space. Cosine similarity is the cosine of the angle between the vectors; that is, it is the dot product of the vectors divided by the product of their lengths. It follows that the cosine similarity does not depend on the magnitudes of the vectors, but only on their angle. The cosine similarity always belongs to the interval
|
||||
|
||||
|
||||
|
||||
[
|
||||
−
|
||||
1
|
||||
,
|
||||
+
|
||||
1
|
||||
]
|
||||
.
|
||||
|
||||
|
||||
{\displaystyle [-1,+1].}
|
||||
|
||||
For example, two proportional vectors have a cosine similarity of +1, two orthogonal vectors have a similarity of 0, and two opposite vectors have a similarity of −1. In some contexts, the component values of the vectors cannot be negative, in which case the cosine similarity is bounded in
|
||||
|
||||
|
||||
|
||||
[
|
||||
0
|
||||
,
|
||||
1
|
||||
]
|
||||
|
||||
|
||||
{\displaystyle [0,1]}
|
||||
|
||||
.
|
||||
For example, in information retrieval and text mining, each word is assigned a different coordinate and a document is represented by the vector of the numbers of occurrences of each word in the document. Cosine similarity then gives a useful measure of how similar two documents are likely to be, in terms of their subject matter, and independently of the length of the documents.
|
||||
The technique is also used to measure cohesion within clusters in the field of data mining.
|
||||
One advantage of cosine similarity is its low complexity, especially for sparse vectors: only the non-zero coordinates need to be considered.
|
||||
Other names for cosine similarity include Orchini similarity and Tucker coefficient of congruence; the Otsuka–Ochiai similarity (see below) is cosine similarity applied to binary data.
|
||||
|
||||
== Definition ==
|
||||
The cosine of two non-zero vectors can be derived by using the Euclidean dot product formula:
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
A
|
||||
|
||||
⋅
|
||||
|
||||
B
|
||||
|
||||
=
|
||||
|
||||
‖
|
||||
|
||||
A
|
||||
|
||||
‖
|
||||
|
||||
|
||||
‖
|
||||
|
||||
B
|
||||
|
||||
‖
|
||||
|
||||
cos
|
||||
|
||||
θ
|
||||
|
||||
|
||||
{\displaystyle \mathbf {A} \cdot \mathbf {B} =\left\|\mathbf {A} \right\|\left\|\mathbf {B} \right\|\cos \theta }
|
||||
|
||||
|
||||
Given two n-dimensional vectors of attributes, A and B, the cosine similarity, cos(θ), is represented using a dot product and magnitude as
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
cosine similarity
|
||||
|
||||
=
|
||||
|
||||
S
|
||||
|
||||
C
|
||||
|
||||
|
||||
(
|
||||
A
|
||||
,
|
||||
B
|
||||
)
|
||||
:=
|
||||
cos
|
||||
|
||||
(
|
||||
θ
|
||||
)
|
||||
=
|
||||
|
||||
|
||||
|
||||
|
||||
A
|
||||
|
||||
⋅
|
||||
|
||||
B
|
||||
|
||||
|
||||
|
||||
‖
|
||||
|
||||
A
|
||||
|
||||
‖
|
||||
‖
|
||||
|
||||
B
|
||||
|
||||
‖
|
||||
|
||||
|
||||
|
||||
=
|
||||
|
||||
|
||||
|
||||
|
||||
∑
|
||||
|
||||
i
|
||||
=
|
||||
1
|
||||
|
||||
|
||||
n
|
||||
|
||||
|
||||
|
||||
|
||||
A
|
||||
|
||||
i
|
||||
|
||||
|
||||
|
||||
B
|
||||
|
||||
i
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
∑
|
||||
|
||||
i
|
||||
=
|
||||
1
|
||||
|
||||
|
||||
n
|
||||
|
||||
|
||||
|
||||
|
||||
A
|
||||
|
||||
i
|
||||
|
||||
|
||||
2
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
⋅
|
||||
|
||||
|
||||
|
||||
∑
|
||||
|
||||
i
|
||||
=
|
||||
1
|
||||
|
||||
|
||||
n
|
||||
|
||||
|
||||
|
||||
|
||||
B
|
||||
|
||||
i
|
||||
|
||||
|
||||
2
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
,
|
||||
|
||||
|
||||
{\displaystyle {\text{cosine similarity}}=S_{C}(A,B):=\cos(\theta )={\mathbf {A} \cdot \mathbf {B} \over \|\mathbf {A} \|\|\mathbf {B} \|}={\frac {\sum \limits _{i=1}^{n}{A_{i}B_{i}}}{{\sqrt {\sum \limits _{i=1}^{n}{A_{i}^{2}}}}\cdot {\sqrt {\sum \limits _{i=1}^{n}{B_{i}^{2}}}}}},}
|
||||
|
||||
|
||||
where
|
||||
|
||||
|
||||
|
||||
|
||||
A
|
||||
|
||||
i
|
||||
|
||||
|
||||
|
||||
|
||||
{\displaystyle A_{i}}
|
||||
|
||||
and
|
||||
|
||||
|
||||
|
||||
|
||||
B
|
||||
|
||||
i
|
||||
|
||||
|
||||
|
||||
|
||||
{\displaystyle B_{i}}
|
||||
|
||||
are the
|
||||
|
||||
|
||||
|
||||
i
|
||||
|
||||
|
||||
{\displaystyle i}
|
||||
|
||||
th components of vectors
|
||||
|
||||
|
||||
|
||||
|
||||
A
|
||||
|
||||
|
||||
|
||||
{\displaystyle \mathbf {A} }
|
||||
|
||||
and
|
||||
|
||||
|
||||
|
||||
|
||||
B
|
||||
|
||||
|
||||
|
||||
{\displaystyle \mathbf {B} }
|
||||
|
||||
, respectively.
|
||||
The resulting similarity ranges from −1 meaning exactly opposite, to +1 meaning exactly the same, with 0 indicating orthogonality (no correlation), while in-between values indicate intermediate similarity or dissimilarity.
|
||||
For text matching, the attribute vectors A and B are usually the term frequency vectors of the documents. Cosine similarity can be seen as a method of normalizing document length during comparison. In the case of information retrieval, the cosine similarity of two documents will range from
|
||||
|
||||
|
||||
|
||||
0
|
||||
→
|
||||
1
|
||||
|
||||
|
||||
{\displaystyle 0\to 1}
|
||||
|
||||
, since the term frequencies cannot be negative. This remains true when using TF-IDF weights. The angle between two term frequency vectors cannot be greater than 90°.
|
||||
If the attribute vectors are normalized by subtracting the vector means (e.g.,
|
||||
|
||||
|
||||
|
||||
A
|
||||
−
|
||||
|
||||
|
||||
|
||||
A
|
||||
¯
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
{\displaystyle A-{\bar {A}}}
|
||||
|
||||
), the measure is called the centered cosine similarity and is equivalent to the Pearson correlation coefficient. For an example of centering,
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
if
|
||||
|
||||
|
||||
A
|
||||
=
|
||||
[
|
||||
|
||||
A
|
||||
|
||||
1
|
||||
|
||||
|
||||
,
|
||||
|
||||
A
|
||||
|
||||
2
|
||||
|
||||
|
||||
|
||||
]
|
||||
|
||||
T
|
||||
|
||||
|
||||
,
|
||||
|
||||
then
|
||||
|
||||
|
||||
|
||||
|
||||
A
|
||||
¯
|
||||
|
||||
|
||||
|
||||
=
|
||||
|
||||
|
||||
[
|
||||
|
||||
|
||||
|
||||
|
||||
(
|
||||
|
||||
A
|
||||
|
||||
1
|
||||
|
||||
|
||||
+
|
||||
|
||||
A
|
||||
|
||||
2
|
||||
|
||||
|
||||
)
|
||||
|
||||
2
|
||||
|
||||
|
||||
,
|
||||
|
||||
|
||||
|
||||
(
|
||||
|
||||
A
|
||||
|
||||
1
|
||||
|
||||
|
||||
+
|
||||
|
||||
A
|
||||
|
||||
2
|
||||
|
||||
|
||||
)
|
||||
|
||||
2
|
||||
|
||||
|
||||
|
||||
]
|
||||
|
||||
|
||||
T
|
||||
|
||||
|
||||
,
|
||||
|
||||
|
||||
{\displaystyle {\text{if}}\,A=[A_{1},A_{2}]^{T},{\text{ then }}{\bar {A}}=\left[{\frac {(A_{1}+A_{2})}{2}},{\frac {(A_{1}+A_{2})}{2}}\right]^{T},}
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
so
|
||||
|
||||
A
|
||||
−
|
||||
|
||||
|
||||
|
||||
A
|
||||
¯
|
||||
|
||||
|
||||
|
||||
=
|
||||
|
||||
|
||||
[
|
||||
|
||||
|
||||
|
||||
|
||||
(
|
||||
|
||||
A
|
||||
|
||||
1
|
||||
|
||||
|
||||
−
|
||||
|
||||
A
|
||||
|
||||
2
|
||||
|
||||
|
||||
)
|
||||
|
||||
2
|
||||
|
||||
|
||||
,
|
||||
|
||||
|
||||
|
||||
(
|
||||
−
|
||||
|
||||
A
|
||||
|
||||
1
|
||||
|
||||
|
||||
+
|
||||
|
||||
A
|
||||
|
||||
2
|
||||
|
||||
|
||||
)
|
||||
|
||||
2
|
||||
|
||||
|
||||
|
||||
]
|
||||
|
||||
|
||||
T
|
||||
|
||||
|
||||
.
|
||||
|
||||
|
||||
{\displaystyle {\text{ so }}A-{\bar {A}}=\left[{\frac {(A_{1}-A_{2})}{2}},{\frac {(-A_{1}+A_{2})}{2}}\right]^{T}.}
|
||||
|
||||
|
||||
=== Cosine distance ===
|
||||
When the distance between two unit-length vectors is defined to be the length of their vector difference then
|
||||
|
||||
|
||||
|
||||
|
||||
dist
|
||||
|
||||
(
|
||||
|
||||
A
|
||||
|
||||
,
|
||||
|
||||
B
|
||||
|
||||
)
|
||||
=
|
||||
|
||||
|
||||
(
|
||||
|
||||
A
|
||||
|
||||
−
|
||||
|
||||
B
|
||||
|
||||
)
|
||||
⋅
|
||||
(
|
||||
|
||||
A
|
||||
|
||||
−
|
||||
|
||||
B
|
||||
|
||||
)
|
||||
|
||||
|
||||
=
|
||||
|
||||
|
||||
|
||||
A
|
||||
|
||||
⋅
|
||||
|
||||
A
|
||||
|
||||
−
|
||||
2
|
||||
(
|
||||
|
||||
A
|
||||
|
||||
⋅
|
||||
|
||||
B
|
||||
|
||||
)
|
||||
+
|
||||
|
||||
B
|
||||
|
||||
⋅
|
||||
|
||||
B
|
||||
|
||||
|
||||
|
||||
=
|
||||
|
||||
|
||||
2
|
||||
(
|
||||
1
|
||||
−
|
||||
|
||||
S
|
||||
|
||||
C
|
||||
|
||||
|
||||
(
|
||||
|
||||
A
|
||||
|
||||
,
|
||||
|
||||
B
|
||||
|
||||
)
|
||||
)
|
||||
|
||||
|
||||
|
||||
.
|
||||
|
||||
|
||||
{\displaystyle \operatorname {dist} (\mathbf {A} ,\mathbf {B} )={\sqrt {(\mathbf {A} -\mathbf {B} )\cdot (\mathbf {A} -\mathbf {B} )}}={\sqrt {\mathbf {A} \cdot \mathbf {A} -2(\mathbf {A} \cdot \mathbf {B} )+\mathbf {B} \cdot \mathbf {B} }}={\sqrt {2(1-S_{C}(\mathbf {A} ,\mathbf {B} ))}}\,.}
|
||||
|
||||
|
||||
Nonetheless the cosine distance is often defined without the square root or factor of 2:
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
cosine distance
|
||||
|
||||
=
|
||||
|
||||
D
|
||||
|
||||
C
|
||||
|
||||
|
||||
(
|
||||
A
|
||||
,
|
||||
B
|
||||
)
|
||||
:=
|
||||
1
|
||||
−
|
||||
|
||||
S
|
||||
|
||||
C
|
||||
|
||||
|
||||
(
|
||||
A
|
||||
,
|
||||
B
|
||||
)
|
||||
|
||||
.
|
||||
|
||||
|
||||
{\displaystyle {\text{cosine distance}}=D_{C}(A,B):=1-S_{C}(A,B)\,.}
|
||||
|
||||
|
||||
It is important to note that, by virtue of being proportional to squared Euclidean distance, the cosine distance is not a true distance metric; it does not exhibit the triangle inequality property — or, more formally, the Schwarz inequality — and it violates the coincidence axiom. To repair the triangle inequality property while maintaining the same ordering, one can convert to Euclidean distance
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
2
|
||||
(
|
||||
1
|
||||
−
|
||||
|
||||
S
|
||||
|
||||
C
|
||||
|
||||
|
||||
(
|
||||
A
|
||||
,
|
||||
B
|
||||
)
|
||||
)
|
||||
|
||||
|
||||
|
||||
|
||||
{\textstyle {\sqrt {2(1-S_{C}(A,B))}}}
|
||||
|
||||
or angular distance θ = arccos(SC(A, B)). Alternatively, the triangular inequality that does work for angular distances can be expressed directly in terms of the cosines; see below.
|
||||
|
||||
=== Angular distance and similarity ===
|
||||
The normalized angle, referred to as angular distance, between any two vectors
|
||||
|
||||
|
||||
|
||||
A
|
||||
|
||||
|
||||
{\displaystyle A}
|
||||
|
||||
and
|
||||
|
||||
|
||||
|
||||
B
|
||||
|
||||
|
||||
{\displaystyle B}
|
||||
|
||||
is a formal distance metric and can be calculated from the cosine similarity. The complement of the angular distance metric can then be used to define angular similarity function bounded between 0 and 1, inclusive.
|
||||
When the vector elements may be positive or negative:
|
||||
593
data/en.wikipedia.org/wiki/Cosine_similarity-1.md
Normal file
593
data/en.wikipedia.org/wiki/Cosine_similarity-1.md
Normal file
@ -0,0 +1,593 @@
|
||||
---
|
||||
title: "Cosine similarity"
|
||||
chunk: 2/3
|
||||
source: "https://en.wikipedia.org/wiki/Cosine_similarity"
|
||||
category: "reference"
|
||||
tags: "science, encyclopedia"
|
||||
date_saved: "2026-05-05T09:53:42.619881+00:00"
|
||||
instance: "kb-cron"
|
||||
---
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
angular distance
|
||||
|
||||
=
|
||||
|
||||
D
|
||||
|
||||
θ
|
||||
|
||||
|
||||
:=
|
||||
|
||||
|
||||
|
||||
arccos
|
||||
|
||||
(
|
||||
|
||||
cosine similarity
|
||||
|
||||
)
|
||||
|
||||
π
|
||||
|
||||
|
||||
=
|
||||
|
||||
|
||||
θ
|
||||
π
|
||||
|
||||
|
||||
|
||||
|
||||
{\displaystyle {\text{angular distance}}=D_{\theta }:={\frac {\arccos({\text{cosine similarity}})}{\pi }}={\frac {\theta }{\pi }}}
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
angular similarity
|
||||
|
||||
=
|
||||
|
||||
S
|
||||
|
||||
θ
|
||||
|
||||
|
||||
:=
|
||||
1
|
||||
−
|
||||
|
||||
angular distance
|
||||
|
||||
=
|
||||
1
|
||||
−
|
||||
|
||||
|
||||
θ
|
||||
π
|
||||
|
||||
|
||||
|
||||
|
||||
{\displaystyle {\text{angular similarity}}=S_{\theta }:=1-{\text{angular distance}}=1-{\frac {\theta }{\pi }}}
|
||||
|
||||
|
||||
Or, if the vector elements are always positive:
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
angular distance
|
||||
|
||||
=
|
||||
|
||||
D
|
||||
|
||||
θ
|
||||
|
||||
|
||||
:=
|
||||
|
||||
|
||||
|
||||
2
|
||||
⋅
|
||||
arccos
|
||||
|
||||
(
|
||||
|
||||
cosine similarity
|
||||
|
||||
)
|
||||
|
||||
π
|
||||
|
||||
|
||||
=
|
||||
|
||||
|
||||
|
||||
2
|
||||
θ
|
||||
|
||||
π
|
||||
|
||||
|
||||
|
||||
|
||||
{\displaystyle {\text{angular distance}}=D_{\theta }:={\frac {2\cdot \arccos({\text{cosine similarity}})}{\pi }}={\frac {2\theta }{\pi }}}
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
angular similarity
|
||||
|
||||
=
|
||||
|
||||
S
|
||||
|
||||
θ
|
||||
|
||||
|
||||
:=
|
||||
1
|
||||
−
|
||||
|
||||
angular distance
|
||||
|
||||
=
|
||||
1
|
||||
−
|
||||
|
||||
|
||||
|
||||
2
|
||||
θ
|
||||
|
||||
π
|
||||
|
||||
|
||||
|
||||
|
||||
{\displaystyle {\text{angular similarity}}=S_{\theta }:=1-{\text{angular distance}}=1-{\frac {2\theta }{\pi }}}
|
||||
|
||||
|
||||
Unfortunately, computing the inverse cosine (arccos) function is slow, making the use of the angular distance more computationally expensive than using the more common (but not metric) cosine distance above.
|
||||
|
||||
=== L2-normalized Euclidean distance ===
|
||||
Another effective proxy for cosine distance can be obtained by
|
||||
|
||||
|
||||
|
||||
|
||||
L
|
||||
|
||||
2
|
||||
|
||||
|
||||
|
||||
|
||||
{\displaystyle L_{2}}
|
||||
|
||||
normalisation of the vectors, followed by the application of normal Euclidean distance. Using this technique each term in each vector is first divided by the magnitude of the vector, yielding a vector of unit length. Then the Euclidean distance over the end-points of any two vectors is a proper metric which gives the same ordering as the cosine distance (a monotonic transformation of Euclidean distance; see below) for any comparison of vectors, and furthermore avoids the potentially expensive trigonometric operations required to yield a proper metric. Once the normalisation has occurred, the vector space can be used with the full range of techniques available to any Euclidean space, notably standard dimensionality reduction techniques. This normalised form distance is often used within many deep learning algorithms.
|
||||
|
||||
=== Otsuka–Ochiai coefficient ===
|
||||
In biology, there is a similar concept known as the Otsuka–Ochiai coefficient named after Yanosuke Otsuka (also spelled as Ōtsuka, Ootsuka or Otuka, Japanese: 大塚 弥之助) and Akira Ochiai (Japanese: 落合 明), also known as the Ochiai–Barkman or Ochiai coefficient, which can be represented as:
|
||||
|
||||
|
||||
|
||||
|
||||
K
|
||||
=
|
||||
|
||||
|
||||
|
||||
|
||||
|
|
||||
|
||||
A
|
||||
∩
|
||||
B
|
||||
|
||||
|
|
||||
|
||||
|
||||
|
||||
|
||||
|
|
||||
|
||||
A
|
||||
|
||||
|
|
||||
|
||||
×
|
||||
|
||||
|
|
||||
|
||||
B
|
||||
|
||||
|
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
{\displaystyle K={\frac {|A\cap B|}{\sqrt {|A|\times |B|}}}}
|
||||
|
||||
|
||||
Here,
|
||||
|
||||
|
||||
|
||||
A
|
||||
|
||||
|
||||
{\displaystyle A}
|
||||
|
||||
and
|
||||
|
||||
|
||||
|
||||
B
|
||||
|
||||
|
||||
{\displaystyle B}
|
||||
|
||||
are sets, and
|
||||
|
||||
|
||||
|
||||
|
||||
|
|
||||
|
||||
A
|
||||
|
||||
|
|
||||
|
||||
|
||||
|
||||
{\displaystyle |A|}
|
||||
|
||||
is the number of elements in
|
||||
|
||||
|
||||
|
||||
A
|
||||
|
||||
|
||||
{\displaystyle A}
|
||||
|
||||
. If sets are represented as bit vectors, the Otsuka–Ochiai coefficient can be seen to be the same as the cosine similarity. It is identical to the score introduced by Godfrey Thomson.
|
||||
In a recent book, the coefficient is tentatively misattributed to another Japanese researcher with the family name Otsuka. The confusion arises because in 1957 Akira Ochiai attributes the coefficient only to Otsuka (no first name mentioned) by citing an article by Ikuso Hamai (Japanese: 浜井 生三), who in turn cites the original 1936 article by Yanosuke Otsuka.
|
||||
|
||||
== Properties ==
|
||||
The most noteworthy property of cosine similarity is that it reflects a relative, rather than absolute, comparison of the individual vector dimensions. For any positive constant
|
||||
|
||||
|
||||
|
||||
a
|
||||
|
||||
|
||||
{\displaystyle a}
|
||||
|
||||
and vector
|
||||
|
||||
|
||||
|
||||
V
|
||||
|
||||
|
||||
{\displaystyle V}
|
||||
|
||||
, the vectors
|
||||
|
||||
|
||||
|
||||
V
|
||||
|
||||
|
||||
{\displaystyle V}
|
||||
|
||||
and
|
||||
|
||||
|
||||
|
||||
a
|
||||
V
|
||||
|
||||
|
||||
{\displaystyle aV}
|
||||
|
||||
are maximally similar. The measure is thus most appropriate for data where frequency is more important than absolute values; notably, term frequency in documents.
|
||||
However more recent metrics with a grounding in information theory, such as Jensen–Shannon, SED, and triangular divergence have been shown to have improved semantics in at least some contexts.
|
||||
|
||||
Cosine similarity is related to Euclidean distance as follows. Denote Euclidean distance by the usual
|
||||
|
||||
|
||||
|
||||
‖
|
||||
A
|
||||
−
|
||||
B
|
||||
‖
|
||||
|
||||
|
||||
{\displaystyle \|A-B\|}
|
||||
|
||||
, and observe that
|
||||
|
||||
|
||||
|
||||
|
||||
‖
|
||||
A
|
||||
−
|
||||
B
|
||||
|
||||
‖
|
||||
|
||||
2
|
||||
|
||||
|
||||
=
|
||||
(
|
||||
A
|
||||
−
|
||||
B
|
||||
)
|
||||
⋅
|
||||
(
|
||||
A
|
||||
−
|
||||
B
|
||||
)
|
||||
=
|
||||
‖
|
||||
A
|
||||
|
||||
‖
|
||||
|
||||
2
|
||||
|
||||
|
||||
+
|
||||
‖
|
||||
B
|
||||
|
||||
‖
|
||||
|
||||
2
|
||||
|
||||
|
||||
−
|
||||
2
|
||||
(
|
||||
A
|
||||
⋅
|
||||
B
|
||||
)
|
||||
|
||||
|
||||
|
||||
{\displaystyle \|A-B\|^{2}=(A-B)\cdot (A-B)=\|A\|^{2}+\|B\|^{2}-2(A\cdot B)\ }
|
||||
|
||||
(polarization identity)
|
||||
by expansion. When A and B are normalized to unit length,
|
||||
|
||||
|
||||
|
||||
‖
|
||||
A
|
||||
|
||||
‖
|
||||
|
||||
2
|
||||
|
||||
|
||||
=
|
||||
‖
|
||||
B
|
||||
|
||||
‖
|
||||
|
||||
2
|
||||
|
||||
|
||||
=
|
||||
1
|
||||
|
||||
|
||||
{\displaystyle \|A\|^{2}=\|B\|^{2}=1}
|
||||
|
||||
so this expression is equal to
|
||||
|
||||
|
||||
|
||||
|
||||
2
|
||||
(
|
||||
1
|
||||
−
|
||||
cos
|
||||
|
||||
(
|
||||
A
|
||||
,
|
||||
B
|
||||
)
|
||||
)
|
||||
.
|
||||
|
||||
|
||||
{\displaystyle 2(1-\cos(A,B)).}
|
||||
|
||||
|
||||
In short, the cosine distance can be expressed in terms of Euclidean distance as
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
D
|
||||
|
||||
C
|
||||
|
||||
|
||||
(
|
||||
A
|
||||
,
|
||||
B
|
||||
)
|
||||
=
|
||||
|
||||
|
||||
|
||||
‖
|
||||
A
|
||||
−
|
||||
B
|
||||
|
||||
‖
|
||||
|
||||
2
|
||||
|
||||
|
||||
|
||||
2
|
||||
|
||||
|
||||
|
||||
|
||||
w
|
||||
h
|
||||
e
|
||||
n
|
||||
|
||||
|
||||
‖
|
||||
A
|
||||
|
||||
‖
|
||||
|
||||
2
|
||||
|
||||
|
||||
=
|
||||
‖
|
||||
B
|
||||
|
||||
‖
|
||||
|
||||
2
|
||||
|
||||
|
||||
=
|
||||
1
|
||||
|
||||
|
||||
{\displaystyle D_{C}(A,B)={\frac {\|A-B\|^{2}}{2}}\quad \mathrm {when} \quad \|A\|^{2}=\|B\|^{2}=1}
|
||||
|
||||
.
|
||||
The Euclidean distance is called the chord distance (because it is the length of the chord on the unit circle) and it is the Euclidean distance between the vectors which were normalized to unit sum of squared values within them.
|
||||
Null distribution: For data which can be negative as well as positive, the null distribution for cosine similarity is the distribution of the dot product of two independent random unit vectors. This distribution has a mean of zero and a variance of
|
||||
|
||||
|
||||
|
||||
1
|
||||
|
||||
/
|
||||
|
||||
n
|
||||
|
||||
|
||||
{\displaystyle 1/n}
|
||||
|
||||
(where
|
||||
|
||||
|
||||
|
||||
n
|
||||
|
||||
|
||||
{\displaystyle n}
|
||||
|
||||
is the number of dimensions), and although the distribution is bounded between −1 and +1, as
|
||||
|
||||
|
||||
|
||||
n
|
||||
|
||||
|
||||
{\displaystyle n}
|
||||
|
||||
grows large the distribution is increasingly well-approximated by the normal distribution. Other types of data such as bitstreams, which only take the values 0 or 1, the null distribution takes a different form and may have a nonzero mean.
|
||||
|
||||
== Triangle inequality for cosine similarity ==
|
||||
The ordinary triangle inequality for angles (i.e., arc lengths on a unit hypersphere) gives us that
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
|
||||
|
||||
|
||||
∠
|
||||
|
||||
A
|
||||
C
|
||||
|
||||
−
|
||||
∠
|
||||
|
||||
C
|
||||
B
|
||||
|
||||
|
||||
|
||||
|
|
||||
|
||||
≤
|
||||
|
||||
∠
|
||||
|
||||
A
|
||||
B
|
||||
|
||||
|
||||
≤
|
||||
|
||||
∠
|
||||
|
||||
A
|
||||
C
|
||||
|
||||
|
||||
+
|
||||
|
||||
∠
|
||||
|
||||
C
|
||||
B
|
||||
|
||||
|
||||
.
|
||||
|
||||
|
||||
{\displaystyle |~\angle {AC}-\angle {CB}~|\leq ~\angle {AB}~\leq ~\angle {AC}~+~\angle {CB}~.}
|
||||
|
||||
|
||||
Because the cosine function decreases as an angle in [0, π] radians increases, the sense of these inequalities is reversed when we take the cosine of each value:
|
||||
409
data/en.wikipedia.org/wiki/Cosine_similarity-2.md
Normal file
409
data/en.wikipedia.org/wiki/Cosine_similarity-2.md
Normal file
@ -0,0 +1,409 @@
|
||||
---
|
||||
title: "Cosine similarity"
|
||||
chunk: 3/3
|
||||
source: "https://en.wikipedia.org/wiki/Cosine_similarity"
|
||||
category: "reference"
|
||||
tags: "science, encyclopedia"
|
||||
date_saved: "2026-05-05T09:53:42.619881+00:00"
|
||||
instance: "kb-cron"
|
||||
---
|
||||
|
||||
|
||||
|
||||
|
||||
cos
|
||||
|
||||
(
|
||||
∠
|
||||
|
||||
A
|
||||
C
|
||||
|
||||
−
|
||||
∠
|
||||
|
||||
C
|
||||
B
|
||||
|
||||
)
|
||||
≥
|
||||
cos
|
||||
|
||||
(
|
||||
∠
|
||||
|
||||
A
|
||||
B
|
||||
|
||||
)
|
||||
≥
|
||||
cos
|
||||
|
||||
(
|
||||
∠
|
||||
|
||||
A
|
||||
C
|
||||
|
||||
+
|
||||
∠
|
||||
|
||||
C
|
||||
B
|
||||
|
||||
)
|
||||
.
|
||||
|
||||
|
||||
{\displaystyle \cos(\angle {AC}-\angle {CB})\geq \cos(\angle {AB})\geq \cos(\angle {AC}+\angle {CB}).}
|
||||
|
||||
|
||||
Using the cosine addition and subtraction formulas, these two inequalities can be written in terms of the original cosines,
|
||||
|
||||
|
||||
|
||||
|
||||
cos
|
||||
|
||||
(
|
||||
A
|
||||
,
|
||||
C
|
||||
)
|
||||
⋅
|
||||
cos
|
||||
|
||||
(
|
||||
C
|
||||
,
|
||||
B
|
||||
)
|
||||
+
|
||||
|
||||
|
||||
|
||||
(
|
||||
|
||||
1
|
||||
−
|
||||
cos
|
||||
|
||||
(
|
||||
A
|
||||
,
|
||||
C
|
||||
|
||||
)
|
||||
|
||||
2
|
||||
|
||||
|
||||
|
||||
)
|
||||
|
||||
⋅
|
||||
|
||||
(
|
||||
|
||||
1
|
||||
−
|
||||
cos
|
||||
|
||||
(
|
||||
C
|
||||
,
|
||||
B
|
||||
|
||||
)
|
||||
|
||||
2
|
||||
|
||||
|
||||
|
||||
)
|
||||
|
||||
|
||||
|
||||
≥
|
||||
cos
|
||||
|
||||
(
|
||||
A
|
||||
,
|
||||
B
|
||||
)
|
||||
,
|
||||
|
||||
|
||||
{\displaystyle \cos(A,C)\cdot \cos(C,B)+{\sqrt {\left(1-\cos(A,C)^{2}\right)\cdot \left(1-\cos(C,B)^{2}\right)}}\geq \cos(A,B),}
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
cos
|
||||
|
||||
(
|
||||
A
|
||||
,
|
||||
B
|
||||
)
|
||||
≥
|
||||
cos
|
||||
|
||||
(
|
||||
A
|
||||
,
|
||||
C
|
||||
)
|
||||
⋅
|
||||
cos
|
||||
|
||||
(
|
||||
C
|
||||
,
|
||||
B
|
||||
)
|
||||
−
|
||||
|
||||
|
||||
|
||||
(
|
||||
|
||||
1
|
||||
−
|
||||
cos
|
||||
|
||||
(
|
||||
A
|
||||
,
|
||||
C
|
||||
|
||||
)
|
||||
|
||||
2
|
||||
|
||||
|
||||
|
||||
)
|
||||
|
||||
⋅
|
||||
|
||||
(
|
||||
|
||||
1
|
||||
−
|
||||
cos
|
||||
|
||||
(
|
||||
C
|
||||
,
|
||||
B
|
||||
|
||||
)
|
||||
|
||||
2
|
||||
|
||||
|
||||
|
||||
)
|
||||
|
||||
|
||||
|
||||
.
|
||||
|
||||
|
||||
{\displaystyle \cos(A,B)\geq \cos(A,C)\cdot \cos(C,B)-{\sqrt {\left(1-\cos(A,C)^{2}\right)\cdot \left(1-\cos(C,B)^{2}\right)}}.}
|
||||
|
||||
|
||||
This form of the triangle inequality can be used to bound the minimum and maximum similarity of two objects A and B if the similarities to a reference object C is already known. This is used for example in metric data indexing, but has also been used to accelerate spherical k-means clustering the same way the Euclidean triangle inequality has been used to accelerate regular k-means.
|
||||
|
||||
== Soft cosine measure ==
|
||||
A soft cosine or ("soft" similarity) between two vectors considers similarities between pairs of features. The traditional cosine similarity considers the vector space model (VSM) features as independent or completely different, while the soft cosine measure proposes considering the similarity of features in VSM, which help generalize the concept of cosine (and soft cosine) as well as the idea of (soft) similarity.
|
||||
For example, in the field of natural language processing (NLP) the similarity among features is quite intuitive. Features such as words, n-grams, or syntactic n-grams can be quite similar, though formally they are considered as different features in the VSM. For example, words "play" and "game" are different words and thus mapped to different points in VSM; yet they are semantically related. In case of n-grams or syntactic n-grams, Levenshtein distance can be applied (in fact, Levenshtein distance can be applied to words as well).
|
||||
For calculating soft cosine, the matrix s is used to indicate similarity between features. It can be calculated through Levenshtein distance, WordNet similarity, or other similarity measures. Then we just multiply by this matrix.
|
||||
Given two N-dimension vectors
|
||||
|
||||
|
||||
|
||||
a
|
||||
|
||||
|
||||
{\displaystyle a}
|
||||
|
||||
and
|
||||
|
||||
|
||||
|
||||
b
|
||||
|
||||
|
||||
{\displaystyle b}
|
||||
|
||||
, the soft cosine similarity is calculated as follows:
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
s
|
||||
o
|
||||
f
|
||||
t
|
||||
_
|
||||
c
|
||||
o
|
||||
s
|
||||
i
|
||||
n
|
||||
e
|
||||
|
||||
|
||||
1
|
||||
|
||||
|
||||
|
||||
(
|
||||
a
|
||||
,
|
||||
b
|
||||
)
|
||||
=
|
||||
|
||||
|
||||
|
||||
|
||||
∑
|
||||
|
||||
i
|
||||
,
|
||||
j
|
||||
|
||||
|
||||
N
|
||||
|
||||
|
||||
|
||||
s
|
||||
|
||||
i
|
||||
j
|
||||
|
||||
|
||||
|
||||
a
|
||||
|
||||
i
|
||||
|
||||
|
||||
|
||||
b
|
||||
|
||||
j
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
∑
|
||||
|
||||
i
|
||||
,
|
||||
j
|
||||
|
||||
|
||||
N
|
||||
|
||||
|
||||
|
||||
s
|
||||
|
||||
i
|
||||
j
|
||||
|
||||
|
||||
|
||||
a
|
||||
|
||||
i
|
||||
|
||||
|
||||
|
||||
a
|
||||
|
||||
j
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
∑
|
||||
|
||||
i
|
||||
,
|
||||
j
|
||||
|
||||
|
||||
N
|
||||
|
||||
|
||||
|
||||
s
|
||||
|
||||
i
|
||||
j
|
||||
|
||||
|
||||
|
||||
b
|
||||
|
||||
i
|
||||
|
||||
|
||||
|
||||
b
|
||||
|
||||
j
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
,
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
{\displaystyle {\begin{aligned}\operatorname {soft\_cosine} _{1}(a,b)={\frac {\sum \nolimits _{i,j}^{N}s_{ij}a_{i}b_{j}}{{\sqrt {\sum \nolimits _{i,j}^{N}s_{ij}a_{i}a_{j}}}{\sqrt {\sum \nolimits _{i,j}^{N}s_{ij}b_{i}b_{j}}}}},\end{aligned}}}
|
||||
|
||||
|
||||
where sij = similarity(featurei, featurej).
|
||||
If there is no similarity between features (sii = 1, sij = 0 for i ≠ j), the given equation is equivalent to the conventional cosine similarity formula.
|
||||
The time complexity of this measure is quadratic, which makes it applicable to real-world tasks. Note that the complexity can be reduced to subquadratic. An efficient implementation of such soft cosine similarity is included in the Gensim open source library.
|
||||
|
||||
== See also ==
|
||||
Sørensen–Dice coefficient
|
||||
Hamming distance
|
||||
Correlation
|
||||
Jaccard index
|
||||
SimRank
|
||||
Information retrieval
|
||||
|
||||
== References ==
|
||||
|
||||
== External links ==
|
||||
Weighted cosine measure
|
||||
A tutorial on cosine similarity using Python
|
||||
@ -0,0 +1,42 @@
|
||||
---
|
||||
title: "Cross-industry standard process for data mining"
|
||||
chunk: 1/1
|
||||
source: "https://en.wikipedia.org/wiki/Cross-industry_standard_process_for_data_mining"
|
||||
category: "reference"
|
||||
tags: "science, encyclopedia"
|
||||
date_saved: "2026-05-05T09:53:43.884915+00:00"
|
||||
instance: "kb-cron"
|
||||
---
|
||||
|
||||
The Cross-industry standard process for data mining, known as CRISP-DM, is an open standard process model that describes common approaches used by data mining experts. It is the most widely-used analytics model.
|
||||
In 2015, IBM released a new methodology called Analytics Solutions Unified Method for Data Mining/Predictive Analytics (also known as ASUM-DM), which refines and extends CRISP-DM.
|
||||
|
||||
|
||||
== History ==
|
||||
CRISP-DM was conceived in 1996 and became a European Union project under the ESPRIT funding initiative in 1997. The project was led by five companies: Integral Solutions Ltd (ISL), Teradata, Daimler AG, NCR Corporation, and OHRA, an insurance company.
|
||||
This core consortium brought different experiences to the project. ISL, was later acquired and merged into SPSS. The computer giant NCR Corporation produced the Teradata data warehouse and its own data mining software. Daimler-Benz had a significant data mining team. OHRA was starting to explore the potential use of data mining.
|
||||
The first version of the methodology was presented at the 4th CRISP-DM SIG Workshop in Brussels in March 1999, and published as a step-by-step data mining guide later that year.
|
||||
Between 2006 and 2008, a CRISP-DM 2.0 SIG was formed, and there were discussions about updating the CRISP-DM process model. The current status of these efforts is not known. However, the original crisp-dm.org website cited in the reviews, and the CRISP-DM 2.0 SIG website are both no longer active.
|
||||
While many non-IBM data mining practitioners use CRISP-DM, IBM is the primary corporation that currently uses the CRISP-DM process model. It makes some of the old CRISP-DM documents available for download and it has incorporated it into its SPSS Modeler product.
|
||||
Based on current research, CRISP-DM is the most widely used form of data-mining model because of its various advantages which solved the existing problems in the data mining industries. Some of the drawbacks of this model is that it does not perform project management activities. The success of CRISP-DM is largely attributable to the fact that it is industry, tool, and application neutral.
|
||||
|
||||
|
||||
== Major phases ==
|
||||
|
||||
CRISP-DM breaks the process of data mining into six major phases:
|
||||
|
||||
Business Understanding
|
||||
Data Understanding
|
||||
Data Preparation
|
||||
Modeling
|
||||
Evaluation
|
||||
Deployment
|
||||
The sequence of the phases is not strict and moving back and forth between different phases is usually required. The arrows in the process diagram indicate the most important and frequent dependencies between phases. The outer circle in the diagram symbolizes the cyclic nature of data mining itself. A data mining process continues after a solution has been deployed. The lessons learned during the process can trigger new, often more focused business questions, and subsequent data mining processes will benefit from the experiences of previous ones.
|
||||
|
||||
|
||||
== Polls and Alternative Process Frameworks ==
|
||||
Polls conducted at the same website (KDNuggets) in 2002, 2004, 2007, and 2014 show that it was the leading methodology used by industry data miners who decided to respond to the survey. The only other data mining approach named in these polls was SEMMA. However, SAS Institute clearly states that SEMMA is not a data mining methodology, but rather a "logical organization of the functional toolset of SAS Enterprise Miner." A review and critique of data mining process models in 2009 called the CRISP-DM the "de facto standard for developing data mining and knowledge discovery projects." Other reviews of CRISP-DM and data mining process models include Kurgan and Musilek's 2006 review, and Azevedo and Santos' 2008 comparison of CRISP-DM and SEMMA. Efforts to update the methodology started in 2006, but have, as of June 2015, not led to a new version, and the "Special Interest Group" (SIG) responsible along with the website has long disappeared (see History of CRISP-DM).
|
||||
In 2024, Harvard Business Review published an updated framework, bizML, that is designed for greater relevance to business personnel and to be specific for machine learning projects in particular, rather than for analytics, data science, or data mining projects in general.
|
||||
|
||||
|
||||
== References ==
|
||||
32
data/en.wikipedia.org/wiki/DOACROSS_parallelism-0.md
Normal file
32
data/en.wikipedia.org/wiki/DOACROSS_parallelism-0.md
Normal file
@ -0,0 +1,32 @@
|
||||
---
|
||||
title: "DOACROSS parallelism"
|
||||
chunk: 1/1
|
||||
source: "https://en.wikipedia.org/wiki/DOACROSS_parallelism"
|
||||
category: "reference"
|
||||
tags: "science, encyclopedia"
|
||||
date_saved: "2026-05-05T09:54:04.174776+00:00"
|
||||
instance: "kb-cron"
|
||||
---
|
||||
|
||||
DOACROSS parallelism is a parallelization technique used to perform Loop-level parallelism by utilizing synchronisation primitives between statements in a loop. This technique is used when a loop cannot be fully parallelized by DOALL parallelism due to data dependencies between loop iterations, typically loop-carried dependencies. The sections of the loop which contain loop-carried dependence are synchronized, while treating each section as a parallel task on its own. Therefore, DOACROSS parallelism can be used to complement DOALL parallelism to reduce loop execution times.
|
||||
|
||||
|
||||
== Description ==
|
||||
DOACROSS parallelism is particularly useful when one statement depends on the values generated by another statement. In such a loop, DOALL parallelism can not be implemented in a straightforward manner. If the first statement blocks the execution of the second statement until the required value has been produced, then the two statements would be able to execute independent of each other (i.e.), each of the aforementioned statements would be parallelized for simultaneous execution using DOALL parallelism.
|
||||
The following pseudocode illustrates the operation of DOACROSS parallelism in such a situation.
|
||||
|
||||
|
||||
=== Example ===
|
||||
In this example, each iteration of the loop requires the value written into a by the previous iteration. However, the entire statement is not dependent on the previous iteration, but only a portion of it. The statement is split into two blocks to illustrate this.The first statement has no loop carried dependence now, and the result of this statement is stored in the variable temp. The post () command is used to signal that the required result has been produced for utilization by other threads. The wait (i-2) command waits for the value a[i-2] before unblocking. The execution time of DOACROSS parallelism largely depends on what fraction of the program suffers from loop-carried dependence. Larger gains are observed when a sizable portion of the loop is affected by loop-carried dependence.
|
||||
|
||||
|
||||
== Drawbacks ==
|
||||
DOACROSS parallelism suffers from significant space and granularity overheads due to the synchronization primitives used. Modern day compilers often overlook this method because of this major disadvantage. The overheads may be reduced by reducing the frequency of synchronization across the loop, by applying the primitives for groups of statements at a time.
|
||||
|
||||
|
||||
== See also ==
|
||||
Data parallelism
|
||||
Task parallelism
|
||||
|
||||
|
||||
== References ==
|
||||
20
data/en.wikipedia.org/wiki/DOE_mean_plot-0.md
Normal file
20
data/en.wikipedia.org/wiki/DOE_mean_plot-0.md
Normal file
@ -0,0 +1,20 @@
|
||||
---
|
||||
title: "DOE mean plot"
|
||||
chunk: 1/1
|
||||
source: "https://en.wikipedia.org/wiki/DOE_mean_plot"
|
||||
category: "reference"
|
||||
tags: "science, encyclopedia"
|
||||
date_saved: "2026-05-05T09:54:06.621514+00:00"
|
||||
instance: "kb-cron"
|
||||
---
|
||||
|
||||
In statistics, specifically the analysis of data in the design of experiments, a DOE mean plot is a graphical tool used to analyze data from an experiment. In it, it demonstrates the relative importance of all the factors in an experiment with respect to a chosen location statistic, in this case the mean. Each factor is plotted and all mean levels of the factor (two or more) are connected with straight lines. The plot is meant to complement other analyses, such as the typical analysis of variance.
|
||||
|
||||
|
||||
== Example ==
|
||||
|
||||
Shown below is an example from Box, Hunter, and Hunter (1978). In it, several factors are chosen to determine their effect(s) on bike speed. These factors are: seat height, dynamo engaged (yes/no), angle of the handlebar, chosen gear, raincoat (on/off), breakfast (yes/no), and tire pressure (low/high), each at two levels. The plot shows the mean response level for each variable (vertical) versus the factor (horizontal).
|
||||
To interpret the graph, all that is needed is to determine the ordering of the factors by length of the line connecting the two dots. The longer the line, the more significant the factor. Therefore, this plot clearly demonstrates that factor 4 (chosen gear) is the most important factor in determining bike speed, factor 2 (dynamo engaged or not) is the second most important factor, and so on.
|
||||
|
||||
|
||||
== References ==
|
||||
37
data/en.wikipedia.org/wiki/Dark_data-0.md
Normal file
37
data/en.wikipedia.org/wiki/Dark_data-0.md
Normal file
@ -0,0 +1,37 @@
|
||||
---
|
||||
title: "Dark data"
|
||||
chunk: 1/1
|
||||
source: "https://en.wikipedia.org/wiki/Dark_data"
|
||||
category: "reference"
|
||||
tags: "science, encyclopedia"
|
||||
date_saved: "2026-05-05T09:53:45.047576+00:00"
|
||||
instance: "kb-cron"
|
||||
---
|
||||
|
||||
Dark data is data which is acquired through various computer network operations but not used in any manner to derive insights or for decision making. The ability of an organisation to collect data can exceed the throughput at which it can analyse the data. In some cases the organisation may not even be aware that the data is being collected. IBM estimate that roughly 90 percent of data generated by sensors and analog-to-digital conversions never get used.
|
||||
In an industrial context, dark data can include information gathered by sensors and telematics.
|
||||
Organizations retain dark data for a multitude of reasons, and it is estimated that most companies are only analyzing 1% of their data. Often it is stored for regulatory compliance and record keeping. Some organizations believe that dark data could be useful to them in the future, once they have acquired better analytic and business intelligence technology to process the information. Because storage is inexpensive, storing data is easy. However, storing and securing the data usually entails greater expenses (or even risk) than the potential return profit.
|
||||
In academic discourse, the term dark data was essentially coined by Bryan P. Heidorn. He uses it to describe research data, especially from the long tail of science (the many, small research projects), which are not or no longer available for research because they disappear in a drawer without adequate data management. Without this, the data become dark, and further reasons for this are e.g. missing metadata annotation, missing data management plans and data curators.
|
||||
|
||||
|
||||
== Analysis ==
|
||||
The term "dark data" very often refers to data that is not amenable to computer processing. For example, a company might have a great deal of data that exists only as scanned page-images. Even the bare text in such documents is not available without something like Optical character recognition, which can vary greatly in accuracy. Even with OCR, the significance of each part of the data is unavailable. An obvious examples is whether a capitalized word is a name or not, and if so, whether it represents a person, place, organization, or even a work of art. Bibliographic and other references, data within tables (that may be labeled quite adequately for humans, but not for processing), and countless assertions represented with the full complexity and ambiguity of human language.
|
||||
A lot of unused data is very valuable, and would be used if it could be; but is blocked because it is in formats that are difficult to process, categorise, identify, and analyse. Often the reason that business does not use their dark data is because of the amount of resources it would take and the difficulty of having that data analysed. In other words, the data is "dark" not because it is not used, but because it cannot (feasibly or affordably) be used, given its poor representation.
|
||||
There are many data representations that can make data much more accessible for automation. However, a great deal of information lacks any such identification of information items or relationships; and much more loses it during "downhill" conversion such as saving to page-oriented representations, printing, scanning, or faxing. The journey back "uphill" can be costly.
|
||||
According to Computer Weekly, 60% of organisations believe that their own business intelligence reporting capability is "inadequate" and 65% say that they have "somewhat disorganised content management approaches".
|
||||
|
||||
|
||||
== Relevance ==
|
||||
Useful data may become dark data after it becomes irrelevant, as it is not processed fast enough. This is called "perishable insights" in "live flowing data". For example, if the geolocation of a customer is known to a business, the business can make offer based on the location, however if this data is not processed immediately, it may be irrelevant in the future. According to IBM, about 60 percent of data loses its value immediately.
|
||||
|
||||
|
||||
== Storage ==
|
||||
According to the New York Times, 90% of energy used by data centres is wasted. If data was not stored, energy costs could be saved. Furthermore, there are costs associated with the underutilisation of information and thus missed opportunities. According to Datamation, "the storage environments of EMEA organizations consist of 54 percent dark data, 32 percent redundant, obsolete and trivial data and 14 percent business-critical data. By 2020, this can add up to $891 billion in storage and management costs that can otherwise be avoided."
|
||||
The continuous storage of dark data can put an organisation at risk, especially if this data is sensitive. In the case of a breach, this can result in serious repercussions. These can be financial, legal and can seriously hurt an organisation's reputation. For example, a breach of private records of customers could result in the stealing of sensitive information, which could result in identity theft. Another example could be the breach of the company's own sensitive information, for example relating to research and development. These risks can be mitigated by assessing and auditing whether this data is useful to the organisation, employing strong encryption and security and finally, if it is determined to be discarded, then it should be discarded in a way that it becomes unretrievable.
|
||||
|
||||
|
||||
== Future ==
|
||||
It is generally considered that as more advanced computing systems for analysis of data are built, the higher the value of dark data will be. It has been noted that "data and analytics will be the foundation of the modern industrial revolution". Of course, this includes data that is currently considered "dark data" since there are not enough resources to process it. All this data that is being collected can be used in the future to bring maximum productivity and an ability for organisations to meet consumers' demand. Technology advancements are helping to leverage this dark data affordably. Furthermore, many organisations do not realise the value of dark data right now, for example in healthcare and education organisations deal with large amounts of data that could create a significant "potential to service students and patients in the manner in which the consumer and financial services pursue their target population".
|
||||
|
||||
|
||||
== References ==
|
||||
25
data/en.wikipedia.org/wiki/Data-driven_model-0.md
Normal file
25
data/en.wikipedia.org/wiki/Data-driven_model-0.md
Normal file
@ -0,0 +1,25 @@
|
||||
---
|
||||
title: "Data-driven model"
|
||||
chunk: 1/1
|
||||
source: "https://en.wikipedia.org/wiki/Data-driven_model"
|
||||
category: "reference"
|
||||
tags: "science, encyclopedia"
|
||||
date_saved: "2026-05-05T09:53:55.946685+00:00"
|
||||
instance: "kb-cron"
|
||||
---
|
||||
|
||||
Data-driven models are a class of computational models that primarily rely on historical data collected throughout a system's or process' lifetime to establish relationships between input, internal, and output variables. Commonly found in numerous articles and publications, data-driven models have evolved from earlier statistical models, overcoming limitations posed by strict assumptions about probability distributions. These models have gained prominence across various fields, particularly in the era of big data, artificial intelligence, and machine learning, where they offer valuable insights and predictions based on the available data.
|
||||
|
||||
|
||||
== Background ==
|
||||
These models have evolved from earlier statistical models, which were based on certain assumptions about probability distributions that often proved to be overly restrictive. The emergence of data-driven models in the 1950s and 1960s coincided with the development of digital computers, advancements in artificial intelligence research, and the introduction of new approaches in non-behavioural modelling, such as pattern recognition and automatic classification.
|
||||
|
||||
|
||||
== Key Concepts ==
|
||||
Data-driven models encompass a wide range of techniques and methodologies that aim to intelligently process and analyse large datasets. Examples include fuzzy logic, fuzzy and rough sets for handling uncertainty, neural networks for approximating functions, global optimization and evolutionary computing, statistical learning theory, and Bayesian methods. These models have found applications in various fields, including economics, customer relations management, financial services, medicine, and the military, among others.
|
||||
Machine learning, a subfield of artificial intelligence, is closely related to data-driven modelling as it also focuses on using historical data to create models that can make predictions and identify patterns. In fact, many data-driven models incorporate machine learning techniques, such as regression, classification, and clustering algorithms, to process and analyse data.
|
||||
In recent years, the concept of data-driven models has gained considerable attention in the field of water resources, with numerous applications, academic courses, and scientific publications using the term as a generalization for models that rely on data rather than physics. This classification has been featured in various publications and has even spurred the development of hybrid models in the past decade. Hybrid models attempt to quantify the degree of physically based information used in hydrological models and determine whether the process of building the model is primarily driven by physics or purely data-based. As a result, data-driven models have become an essential topic of discussion and exploration within water resources management and research.
|
||||
The term "data-driven modelling" (DDM) refers to the overarching paradigm of using historical data in conjunction with advanced computational techniques, including machine learning and artificial intelligence, to create models that can reveal underlying trends, patterns, and, in some cases, make predictions Data-driven models can be built with or without detailed knowledge of the underlying processes governing the system behavior, which makes them particularly useful when such knowledge is missing or fragmented.
|
||||
|
||||
|
||||
== References ==
|
||||
@ -0,0 +1,15 @@
|
||||
---
|
||||
title: "Data-informed decision-making"
|
||||
chunk: 1/1
|
||||
source: "https://en.wikipedia.org/wiki/Data-informed_decision-making"
|
||||
category: "reference"
|
||||
tags: "science, encyclopedia"
|
||||
date_saved: "2026-05-05T09:53:57.178692+00:00"
|
||||
instance: "kb-cron"
|
||||
---
|
||||
|
||||
Data-informed decision-making (DIDM) refers to the collection and analysis of data to guide decisions and improve chances of success. Another form of this process is referred to as data-driven decision-making, "which is defined similarly as making decisions based on hard data as opposed to intuition, observation, or guesswork." DIDM is used in education communities, where data is used with the goal of helping students and improving curricula, among other fields in which data is used to inform decisions. While "data based decision-making" is a more common term, "data-informed decision-making" is the preferred term, since decisions should not be based solely on quantitative data. Data-driven decision-making is commonly used in the context of business growth and entrepreneurship. Many educators have access to some type of a data system for analyzing their students' data. These data systems present data to educators in an over-the-counter data format (embedding labels, supplemental documentation, and a help system, making key package/display and content decisions) to improve the success of educators' data-informed decision-making. In business, fostering and actively supporting data-driven decision-making in their firm and among their colleagues may be one of the central responsibilities of CIOs (Chief Information Officers) or CDOs (Chief Data Officers).
|
||||
Assessment in higher education is a form of data-driven decision-making aimed at using evidence of what students learn to improve curriculum, student learning, and teaching. Standardized tests, grades, and student work scored by rubrics are forms of student learning outcomes assessment. Formative assessments, Formative assessments, specifically, allow educators to use data from student performance more immediately to modify teaching and learning strategies. There are numerous organizations aimed at promoting the assessment of student learning through DIDM, including the National Institute for Learning Outcomes Assessment, the Association for the Assessment of Student Learning in Higher Education, and, to an extent, the Association of American Colleges and Universities.
|
||||
|
||||
|
||||
== References ==
|
||||
33
data/en.wikipedia.org/wiki/Data_access-0.md
Normal file
33
data/en.wikipedia.org/wiki/Data_access-0.md
Normal file
@ -0,0 +1,33 @@
|
||||
---
|
||||
title: "Data access"
|
||||
chunk: 1/1
|
||||
source: "https://en.wikipedia.org/wiki/Data_access"
|
||||
category: "reference"
|
||||
tags: "science, encyclopedia"
|
||||
date_saved: "2026-05-05T09:53:46.267466+00:00"
|
||||
instance: "kb-cron"
|
||||
---
|
||||
|
||||
Data access is a generic term referring to a process which has both an IT-specific meaning and other connotations involving access rights in a broader legal and/or political sense. In the former it typically refers to software and activities related to storing, retrieving, or acting on data housed in a database or other repository.
|
||||
|
||||
|
||||
== Details ==
|
||||
Two fundamental types of data access exist:
|
||||
|
||||
sequential access (as in magnetic tape, for example)
|
||||
random access (as in indexed media)
|
||||
Data access crucially involves authorization to access different data repositories. Data access can help distinguish the abilities of administrators and users. For example, administrators may have the ability to remove, edit and add data, while general users may not even have "read" rights if they lack access to particular information.
|
||||
Historically, each repository (including each different database, file system, etc.), might require the use of different methods and languages, and many of these repositories stored their content in different and incompatible formats.
|
||||
Over the years standardized languages, methods, and formats, have developed to serve as interfaces between the often proprietary, and always idiosyncratic, specific languages and methods. Such standards include SQL (1974- ), ODBC (ca 1990- ), JDBC, XQJ, ADO.NET, XML, XQuery, XPath (1999- ), and Web Services.
|
||||
Some of these standards enable translation of data from unstructured (such as HTML or free-text files) to structured (such as XML or SQL).
|
||||
Structures such as connection strings and DBURLs
|
||||
can attempt to standardise methods of connecting to databases.
|
||||
|
||||
|
||||
== See also ==
|
||||
Right of access to personal data
|
||||
Data access object
|
||||
Data access layer
|
||||
|
||||
|
||||
== References ==
|
||||
@ -4,7 +4,7 @@ chunk: 1/5
|
||||
source: "https://en.wikipedia.org/wiki/Data_analysis"
|
||||
category: "reference"
|
||||
tags: "science, encyclopedia"
|
||||
date_saved: "2026-05-05T06:27:21.931767+00:00"
|
||||
date_saved: "2026-05-05T09:53:29.395055+00:00"
|
||||
instance: "kb-cron"
|
||||
---
|
||||
|
||||
|
||||
@ -4,7 +4,7 @@ chunk: 2/5
|
||||
source: "https://en.wikipedia.org/wiki/Data_analysis"
|
||||
category: "reference"
|
||||
tags: "science, encyclopedia"
|
||||
date_saved: "2026-05-05T06:27:21.931767+00:00"
|
||||
date_saved: "2026-05-05T09:53:29.395055+00:00"
|
||||
instance: "kb-cron"
|
||||
---
|
||||
|
||||
|
||||
@ -4,7 +4,7 @@ chunk: 3/5
|
||||
source: "https://en.wikipedia.org/wiki/Data_analysis"
|
||||
category: "reference"
|
||||
tags: "science, encyclopedia"
|
||||
date_saved: "2026-05-05T06:27:21.931767+00:00"
|
||||
date_saved: "2026-05-05T09:53:29.395055+00:00"
|
||||
instance: "kb-cron"
|
||||
---
|
||||
|
||||
|
||||
@ -4,7 +4,7 @@ chunk: 4/5
|
||||
source: "https://en.wikipedia.org/wiki/Data_analysis"
|
||||
category: "reference"
|
||||
tags: "science, encyclopedia"
|
||||
date_saved: "2026-05-05T06:27:21.931767+00:00"
|
||||
date_saved: "2026-05-05T09:53:29.395055+00:00"
|
||||
instance: "kb-cron"
|
||||
---
|
||||
|
||||
|
||||
@ -4,7 +4,7 @@ chunk: 5/5
|
||||
source: "https://en.wikipedia.org/wiki/Data_analysis"
|
||||
category: "reference"
|
||||
tags: "science, encyclopedia"
|
||||
date_saved: "2026-05-05T06:27:21.931767+00:00"
|
||||
date_saved: "2026-05-05T09:53:29.395055+00:00"
|
||||
instance: "kb-cron"
|
||||
---
|
||||
|
||||
|
||||
@ -0,0 +1,42 @@
|
||||
---
|
||||
title: "Data analysis for fraud detection"
|
||||
chunk: 1/2
|
||||
source: "https://en.wikipedia.org/wiki/Data_analysis_for_fraud_detection"
|
||||
category: "reference"
|
||||
tags: "science, encyclopedia"
|
||||
date_saved: "2026-05-05T09:53:47.474137+00:00"
|
||||
instance: "kb-cron"
|
||||
---
|
||||
|
||||
Fraud represents a significant problem for governments and businesses and specialized analysis techniques for discovering fraud using them are required. Some of these methods include knowledge discovery in databases (KDD), data mining, machine learning and statistics. They offer applicable and successful solutions in different areas of electronic fraud crimes.
|
||||
In general, the primary reason to use data analytics techniques is to tackle fraud since many internal control systems have serious weaknesses. For example, the currently prevailing approach employed by many law enforcement agencies to detect companies involved in potential cases of fraud consists in receiving circumstantial evidence or complaints from whistleblowers. As a result, a large number of fraud cases remain undetected and unprosecuted. In order to effectively test, detect, validate, correct error and monitor control systems against fraudulent activities, businesses entities and organizations rely on specialized data analytics techniques such as data mining, data matching, the sounds like function, regression analysis, clustering analysis, and gap analysis. Techniques used for fraud detection fall into two primary classes: statistical techniques and artificial intelligence.
|
||||
|
||||
== Statistical techniques ==
|
||||
|
||||
Examples of statistical data analysis techniques are:
|
||||
|
||||
Data preprocessing techniques for detection, validation, error correction, and filling up of missing or incorrect data.
|
||||
Calculation of various statistical parameters such as averages, quantiles, performance metrics, probability distributions, and so on. For example, the averages may include average length of call, average number of calls per month and average delays in bill payment.
|
||||
Models and probability distributions of various business activities either in terms of various parameters or probability distributions.
|
||||
Computing user profiles.
|
||||
Time-series analysis of time-dependent data.
|
||||
Clustering and classification to find patterns and associations among groups of data.
|
||||
Data matching Data matching is used to compare two sets of collected data. The process can be performed based on algorithms or programmed loops. Trying to match sets of data against each other or comparing complex data types. Data matching is used to remove duplicate records and identify links between two data sets for marketing, security or other uses.
|
||||
Sounds like Function is used to find values that sound similar. The Phonetic similarity is one way to locate possible duplicate values, or inconsistent spelling in manually entered data. The ‘sounds like’ function converts the comparison strings to four-character American Soundex codes, which are based on the first letter, and the first three consonants after the first letter, in each string.
|
||||
Regression analysis allows you to examine the relationship between two or more variables of interest. Regression analysis estimates relationships between independent variables and a dependent variable. This method can be used to help understand and identify relationships among variables and predict actual results.
|
||||
Gap analysis is used to determine whether business requirements are being met, if not, what are the steps that should be taken to meet successfully.
|
||||
Matching algorithms to detect anomalies in the behavior of transactions or users as compared to previously known models and profiles. Techniques are also needed to eliminate false alarms, estimate risks, and predict future of current transactions or users.
|
||||
Some forensic accountants specialize in forensic analytics which is the procurement and analysis of electronic data to reconstruct, detect, or otherwise support a claim of financial fraud. The main steps in forensic analytics are data collection, data preparation, data analysis, and reporting. For example, forensic analytics may be used to review an employee's purchasing card activity to assess whether any of the purchases were diverted or divertible for personal use.
|
||||
|
||||
== Artificial intelligence ==
|
||||
Fraud detection is a knowledge-intensive activity. The main AI techniques used for fraud detection include:
|
||||
|
||||
Data mining to classify, cluster, and segment the data and automatically find associations and rules in the data that may signify interesting patterns, including those related to fraud.
|
||||
Expert systems to encode expertise for detecting fraud in the form of rules.
|
||||
Pattern recognition to detect approximate classes, clusters, or patterns of suspicious behavior either automatically (unsupervised) or to match given inputs.
|
||||
Machine learning techniques to automatically identify characteristics of fraud.
|
||||
Neural nets to independently generate classification, clustering, generalization, and forecasting that can then be compared against conclusions raised in internal audits or formal financial documents such as 10-Q.
|
||||
Other techniques such as link analysis, Bayesian networks, decision theory, and sequence matching are also used for fraud detection. A new and novel technique called System properties approach has also been employed where ever rank data is available.
|
||||
Statistical analysis of research data is the most comprehensive method for determining if data fraud exists. Data fraud as defined by the Office of Research Integrity (ORI) includes fabrication, falsification and plagiarism.
|
||||
|
||||
== Machine learning and data mining ==
|
||||
@ -0,0 +1,40 @@
|
||||
---
|
||||
title: "Data analysis for fraud detection"
|
||||
chunk: 2/2
|
||||
source: "https://en.wikipedia.org/wiki/Data_analysis_for_fraud_detection"
|
||||
category: "reference"
|
||||
tags: "science, encyclopedia"
|
||||
date_saved: "2026-05-05T09:53:47.474137+00:00"
|
||||
instance: "kb-cron"
|
||||
---
|
||||
|
||||
Early data analysis techniques were oriented toward extracting quantitative and statistical data characteristics. These techniques facilitate useful data interpretations and can help to get better insights into the processes behind the data. Although the traditional data analysis techniques can indirectly lead us to knowledge, it is still created by human analysts.
|
||||
To go beyond, a data analysis system has to be equipped with a substantial amount of background knowledge, and be able to perform reasoning tasks involving that knowledge and the data provided. In effort to meet this goal, researchers have turned to ideas from the machine learning field. This is a natural source of ideas, since the machine learning task can be described as turning background knowledge and examples (input) into knowledge (output).
|
||||
If data mining results in discovering meaningful patterns, data turns into information. Information or patterns that are novel, valid and potentially useful are not merely information, but knowledge. One speaks of discovering knowledge, before hidden in the huge amount of data, but now revealed.
|
||||
The machine learning and artificial intelligence solutions may be classified into two categories: 'supervised' and 'unsupervised' learning. These methods seek for accounts, customers, suppliers, etc. that behave 'unusually' in order to output suspicion scores, rules or visual anomalies, depending on the method.
|
||||
Whether supervised or unsupervised methods are used, note that the output gives us only an indication of fraud likelihood. No stand alone statistical analysis can assure that a particular object is a fraudulent one, but they can identify them with very high degrees of accuracy. As a result, effective collaboration between machine learning model and human analysts is vital to the success of fraud detection applications.
|
||||
|
||||
=== Supervised learning ===
|
||||
|
||||
In supervised learning, a random sub-sample of all records is taken and manually classified as either 'fraudulent' or 'non-fraudulent' (task can be decomposed on more classes to meet algorithm requirements). Relatively rare events such as fraud may need to be over sampled to get a big enough sample size. These manually classified records are then used to train a supervised machine learning algorithm. After building a model using this training data, the algorithm should be able to classify new records as either fraudulent or non-fraudulent.
|
||||
Supervised neural networks, fuzzy neural nets, and combinations of neural nets and rules, have been extensively explored and used for detecting fraud in mobile phone networks and financial statement fraud.
|
||||
Bayesian learning neural network is implemented for credit card fraud detection, telecommunications fraud, auto claim fraud detection, and medical insurance fraud.
|
||||
Hybrid knowledge/statistical-based systems, where expert knowledge is integrated with statistical power, use a series of data mining techniques for the purpose of detecting cellular clone fraud. Specifically, a rule-learning program to uncover indicators of fraudulent behaviour from a large database of customer transactions is implemented.
|
||||
Cahill et al. (2000) design a fraud signature, based on data of fraudulent calls, to detect telecommunications fraud. For scoring a call for fraud its probability under the account signature is compared to its probability under a fraud signature. The fraud signature is updated sequentially, enabling event-driven fraud detection.
|
||||
Link analysis comprehends a different approach. It relates known fraudsters to other individuals, using record linkage and social network methods.
|
||||
This type of detection is only able to detect frauds similar to those which have occurred previously and been classified by a human. To detect a novel type of fraud may require the use of an unsupervised machine learning algorithm.
|
||||
|
||||
=== Unsupervised learning ===
|
||||
|
||||
In contrast, unsupervised methods don't make use of labelled records.
|
||||
Bolton and Hand use Peer Group Analysis and Break Point Analysis applied on spending behaviour in credit card accounts. Peer Group Analysis detects individual objects that begin to behave in a way different from objects to which they had previously been similar. Another tool Bolton and Hand develop for behavioural fraud detection is Break Point Analysis. Unlike Peer Group Analysis, Break Point Analysis operates on the account level. A break point is an observation where anomalous behaviour for a particular account is detected. Both the tools are applied on spending behaviour in credit card accounts.
|
||||
A combination of unsupervised and supervised methods for credit card fraud detection is in Carcillo et al (2019).
|
||||
|
||||
== Geolocation ==
|
||||
|
||||
== Available datasets ==
|
||||
A major limitation for the validation of existing fraud detection methods is the lack of public datasets. One of the few examples is the Credit Card Fraud Detection dataset made available by the ULB Machine Learning Group.
|
||||
|
||||
== See also ==
|
||||
|
||||
== References ==
|
||||
41
data/en.wikipedia.org/wiki/Data_exploration-0.md
Normal file
41
data/en.wikipedia.org/wiki/Data_exploration-0.md
Normal file
@ -0,0 +1,41 @@
|
||||
---
|
||||
title: "Data exploration"
|
||||
chunk: 1/1
|
||||
source: "https://en.wikipedia.org/wiki/Data_exploration"
|
||||
category: "reference"
|
||||
tags: "science, encyclopedia"
|
||||
date_saved: "2026-05-05T09:53:48.638463+00:00"
|
||||
instance: "kb-cron"
|
||||
---
|
||||
|
||||
Data exploration is an approach similar to initial data analysis, whereby a data analyst uses visual exploration to understand what is in a dataset and the characteristics of the data, rather than through traditional data management systems. These characteristics can include size or amount of data, completeness of the data, correctness of the data, possible relationships amongst data elements or files/tables in the data.
|
||||
Data exploration is typically conducted using a combination of automated and manual activities. Automated activities can include data profiling or data visualization or tabular reports to give the analyst an initial view into the data and an understanding of key characteristics.
|
||||
This is often followed by manual drill-down or filtering of the data to identify anomalies or patterns identified through the automated actions. Data exploration can also require manual scripting and queries into the data (e.g. using languages such as SQL or R) or using spreadsheets or similar tools to view the raw data.
|
||||
All of these activities are aimed at creating a mental model and understanding of the data in the mind of the analyst, and defining basic metadata (statistics, structure, relationships) for the data set that can be used in further analysis.
|
||||
Once this initial understanding of the data is had, the data can be pruned or refined by removing unusable parts of the data (data cleansing), correcting poorly formatted elements and defining relevant relationships across datasets. This process is also known as determining data quality.
|
||||
Data exploration can also refer to the ad hoc querying or visualization of data to identify potential relationships or insights that may be hidden in the data and does not require to formulate assumptions beforehand.
|
||||
Traditionally, this had been a key area of focus for statisticians, with John Tukey being a key evangelist in the field. Today, data exploration is more widespread and is the focus of data analysts and data scientists; the latter being a relatively new role within enterprises and larger organizations.
|
||||
|
||||
|
||||
== Interactive Data Exploration ==
|
||||
This area of data exploration has become an area of interest in the field of machine learning. This is a relatively new field and is still evolving. As its most basic level, a machine-learning algorithm can be fed a data set and can be used to identify whether a hypothesis is true based on the dataset. Common machine learning algorithms can focus on identifying specific patterns in the data. Many common patterns include regression and classification or clustering, but there are many possible patterns and algorithms that can be applied to data via machine learning.
|
||||
By employing machine learning, it is possible to find patterns or relationships in the data that would be difficult or impossible to find via manual inspection, trial and error or traditional exploration techniques.
|
||||
|
||||
|
||||
== Software ==
|
||||
Trifacta – a data preparation and analysis platform
|
||||
Paxata – self-service data preparation software
|
||||
Alteryx – data blending and advanced data analytics software
|
||||
Microsoft Power BI - interactive visualization and data analysis tool
|
||||
OpenRefine - a standalone open source desktop application for data clean-up and data transformation
|
||||
Tableau software – interactive data visualization software
|
||||
|
||||
|
||||
== See also ==
|
||||
Exploratory data analysis
|
||||
Machine learning
|
||||
Data profiling
|
||||
Data visualization
|
||||
|
||||
|
||||
== References ==
|
||||
@ -4,7 +4,7 @@ chunk: 1/1
|
||||
source: "https://en.wikipedia.org/wiki/Data_farming"
|
||||
category: "reference"
|
||||
tags: "science, encyclopedia"
|
||||
date_saved: "2026-05-05T09:49:53.929022+00:00"
|
||||
date_saved: "2026-05-05T09:53:49.876847+00:00"
|
||||
instance: "kb-cron"
|
||||
---
|
||||
|
||||
|
||||
84
data/en.wikipedia.org/wiki/Data_fusion-0.md
Normal file
84
data/en.wikipedia.org/wiki/Data_fusion-0.md
Normal file
@ -0,0 +1,84 @@
|
||||
---
|
||||
title: "Data fusion"
|
||||
chunk: 1/2
|
||||
source: "https://en.wikipedia.org/wiki/Data_fusion"
|
||||
category: "reference"
|
||||
tags: "science, encyclopedia"
|
||||
date_saved: "2026-05-05T09:53:51.162435+00:00"
|
||||
instance: "kb-cron"
|
||||
---
|
||||
|
||||
Data fusion is the process of integrating multiple data sources to produce more consistent, accurate, and useful information than that provided by any individual data source.
|
||||
Data fusion processes are often categorized as low, intermediate, or high, depending on the processing stage at which fusion takes place. Low-level data fusion combines several sources of raw data to produce new raw data. The expectation is that fused data is more informative and synthetic than the original inputs.
|
||||
For example, sensor fusion is also known as (multi-sensor) data fusion and is a subset of information fusion.
|
||||
The concept of data fusion has origins in the evolved capacity of humans and animals to incorporate information from multiple senses to improve their ability to survive. For example, a combination of sight, touch, smell, and taste may indicate whether a substance is edible.
|
||||
|
||||
== The JDL/DFIG model ==
|
||||
|
||||
In the mid-1980s, the Joint Directors of Laboratories formed the Data Fusion Subpanel (which later became known as the Data Fusion Group). With the advent of the World Wide Web, data fusion thus included data, sensor, and information fusion. The JDL/DFIG introduced a model of data fusion that divided the various processes. Currently, the six levels with the Data Fusion Information Group (DFIG) model are:
|
||||
|
||||
Level 0: Source Preprocessing (or Data Assessment)
|
||||
Level 1: Object Assessment
|
||||
Level 2: Situation Assessment
|
||||
Level 3: Impact Assessment (or Threat Refinement)
|
||||
Level 4: Process Refinement (or Resource Management)
|
||||
Level 5: User Refinement (or Cognitive Refinement)
|
||||
Level 6: Mission Refinement (or Mission Management)
|
||||
Although the JDL Model (Level 1–4) is still in use today, it is often criticized for its implication that the levels necessarily happen in order and also for its lack of adequate representation of the potential for a human-in-the-loop. The DFIG model (Level 0–5) explored the implications of situation awareness, user refinement, and mission management. Despite these shortcomings, the JDL/DFIG models are useful for visualizing the data fusion process, facilitating discussion and common understanding, and important for systems-level information fusion design.
|
||||
|
||||
== Geospatial applications ==
|
||||
In the geospatial (GIS) domain, data fusion is often synonymous with data integration. In these applications, there is often a need to combine diverse data sets into a unified (fused) data set which includes all of the data points and time steps from the input data sets. The fused data set is different from a simple combined superset in that the points in the fused data set contain attributes and metadata which might not have been included for these points in the original data set.
|
||||
A simplified example of this process is shown below where data set "α" is fused with data set β to form the fused data set δ. Data points in set "α" have spatial coordinates X and Y and attributes A1 and A2. Data points in set β have spatial coordinates X and Y and attributes B1 and B2. The fused data set contains all points and attributes.
|
||||
|
||||
In a simple case where all attributes are uniform across the entire analysis domain, the attributes may be simply assigned: M?, N?, Q?, R? to M, N, Q, R. In a real application, attributes are not uniform and some type of interpolation is usually required to properly assign attributes to the data points in the fused set.
|
||||
|
||||
In a much more complicated application, marine animal researchers use data fusion to combine animal tracking data with bathymetric, meteorological, sea surface temperature (SST) and animal habitat data to examine and understand habitat utilization and animal behavior in reaction to external forces such as weather or water temperature. Each of these data sets exhibit a different spatial grid and sampling rate so a simple combination would likely create erroneous assumptions and taint the results of the analysis. But through the use of data fusion, all data and attributes are brought together into a single view in which a more complete picture of the environment is created. This enables scientists to identify key locations and times and form new insights into the interactions between the environment and animal behaviors.
|
||||
In the figure at right, rock lobsters are studied off the coast of Tasmania. Hugh Pederson of the University of Tasmania used data fusion software to fuse southern rock lobster tracking data (color-coded for in yellow and black for day and night, respectively) with bathymetry and habitat data to create a unique 4D picture of rock lobster behavior.
|
||||
|
||||
== Data integration ==
|
||||
|
||||
In applications outside of the geospatial domain, differences in the usage of the terms Data integration and Data fusion apply. In areas such as business intelligence, for example, data integration is used to describe the combining of data, whereas data fusion is integration followed by reduction or replacement. Data integration might be viewed as set combination wherein the larger set is retained, whereas fusion is a set reduction technique with improved confidence.
|
||||
|
||||
=== Application areas ===
|
||||
|
||||
== From multiple traffic sensing modalities ==
|
||||
The data from the different sensing technologies can be combined in intelligent ways to determine the traffic state accurately. A Data fusion based approach that utilizes the road side collected acoustic, image and sensor data has been shown to combine the advantages of the different individual methods.
|
||||
|
||||
== Decision fusion ==
|
||||
In many cases, geographically dispersed sensors are severely energy- and bandwidth-limited. Therefore, the raw data concerning a certain phenomenon are often summarized in a few bits from each sensor. When inferring on a binary event (i.e.,
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
H
|
||||
|
||||
|
||||
|
||||
0
|
||||
|
||||
|
||||
|
||||
|
||||
{\displaystyle {\mathcal {H}}_{0}}
|
||||
|
||||
or
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
H
|
||||
|
||||
|
||||
|
||||
1
|
||||
|
||||
|
||||
|
||||
|
||||
{\displaystyle {\mathcal {H}}_{1}}
|
||||
|
||||
), in the extreme case only binary decisions are sent from sensors to a Decision Fusion Center (DFC) and combined in order to obtain improved classification performance.
|
||||
49
data/en.wikipedia.org/wiki/Data_fusion-1.md
Normal file
49
data/en.wikipedia.org/wiki/Data_fusion-1.md
Normal file
@ -0,0 +1,49 @@
|
||||
---
|
||||
title: "Data fusion"
|
||||
chunk: 2/2
|
||||
source: "https://en.wikipedia.org/wiki/Data_fusion"
|
||||
category: "reference"
|
||||
tags: "science, encyclopedia"
|
||||
date_saved: "2026-05-05T09:53:51.162435+00:00"
|
||||
instance: "kb-cron"
|
||||
---
|
||||
|
||||
== For enhanced contextual awareness ==
|
||||
With a multitude of built-in sensors including motion sensor, environmental sensor, position sensor, a modern mobile device typically gives mobile applications access to a number of sensory data which could be leveraged to enhance the contextual awareness. Using signal processing and data fusion techniques such as feature generation, feasibility study and principal component analysis (PCA) such sensory data will greatly improve the positive rate of classifying the motion and contextual relevant status of the device. Many context-enhanced information techniques are provided by Snidaro, et al.
|
||||
|
||||
== Statistical methods ==
|
||||
|
||||
=== Bayesian auto-regressive Gaussian processes ===
|
||||
|
||||
Gaussian processes are a popular machine learning model. If an auto-regressive relationship between the data is assumed, and each data source is assumed to be a Gaussian process, this constitutes a non-linear Bayesian regression problem.
|
||||
|
||||
=== Semiparametric estimation ===
|
||||
Many data fusion methods assume common conditional distributions across several data sources. Recently, methods have been developed to enable efficient estimation within the resulting semiparametric model.
|
||||
|
||||
== See also ==
|
||||
Data assimilation
|
||||
Data munging
|
||||
Image fusion
|
||||
Information integration
|
||||
Integrative level
|
||||
Meta-analysis
|
||||
Sensor fusion
|
||||
|
||||
== References ==
|
||||
|
||||
=== Sources ===
|
||||
General references
|
||||
Hall, Dave L.; Llinas, James (1997). "Introduction to Multisensor Data Fusion". Proceedings of the IEEE. 85 (1): 6–23. doi:10.1109/5.554205.
|
||||
Blasch, Erik; Kadar, Ivan; Salerno, John; Kokar, Mieczyslaw M.; Das, Subrata; Powell, Gerald M.; Corkill, Daniel D.; Ruspini, Enrique H. (2006). "Issues and Challenges in Situation Assessment (Level 2 Fusion)" (PDF). Journal of Advances in Information Fusion. 1 (2). Archived from the original (PDF) on 2015-05-27.
|
||||
|
||||
== Bibliography ==
|
||||
Hall, David L.; McMullen, Sonya A. H. (2004). Mathematical Techniques in Multisensor Data Fusion, Second Edition. Norwood, MA: Artech House, Inc. ISBN 978-1-5805-3335-5.
|
||||
Mitchell, H. B. (2007). Multi-sensor Data Fusion – An Introduction. Berlin: Springer-Verlag. ISBN 978-3-540-71463-7.
|
||||
Das, S. (2008). High-Level Data Fusion. Norwood, MA: Artech House Publishers. ISBN 978-1-59693-281-4.
|
||||
|
||||
== External links ==
|
||||
|
||||
Discriminant Correlation Analysis (DCA)
|
||||
Sensordata Fusion, An Introduction
|
||||
International Society of Information Fusion
|
||||
Sensor Fusion for Nanopositioning
|
||||
63
data/en.wikipedia.org/wiki/Data_profiling-0.md
Normal file
63
data/en.wikipedia.org/wiki/Data_profiling-0.md
Normal file
@ -0,0 +1,63 @@
|
||||
---
|
||||
title: "Data profiling"
|
||||
chunk: 1/1
|
||||
source: "https://en.wikipedia.org/wiki/Data_profiling"
|
||||
category: "reference"
|
||||
tags: "science, encyclopedia"
|
||||
date_saved: "2026-05-05T09:53:52.402806+00:00"
|
||||
instance: "kb-cron"
|
||||
---
|
||||
|
||||
Data profiling is the process of examining the data available from an existing information source (e.g. a database or a file) and collecting statistics or informative summaries about that data. The purpose of these statistics may be to:
|
||||
|
||||
Find out whether existing data can be easily used for other purposes
|
||||
Improve the ability to search data by tagging it with keywords, descriptions, or assigning it to a category
|
||||
Assess data quality, including whether the data conforms to particular standards or patterns
|
||||
Assess the risk involved in integrating data in new applications, including the challenges of joins
|
||||
Discover metadata of the source database, including value patterns and distributions, key candidates, foreign-key candidates, and functional dependencies
|
||||
Assess whether known metadata accurately describes the actual values in the source database
|
||||
Understanding data challenges early in any data intensive project, so that late project surprises are avoided. Finding data problems late in the project can lead to delays and cost overruns.
|
||||
Have an enterprise view of all data, for uses such as master data management, where key data is needed, or data governance for improving data quality.
|
||||
|
||||
|
||||
== Introduction ==
|
||||
Data profiling refers to the analysis of information for use in a data warehouse in order to clarify the structure, content, relationships, and derivation rules of the data. Profiling helps to not only understand anomalies and assess data quality, but also to discover, register, and assess enterprise metadata. The result of the analysis is used to determine the suitability of the candidate source systems, usually giving the basis for an early go/no-go decision, and also to identify problems for later solution design.
|
||||
|
||||
|
||||
== How data profiling is conducted ==
|
||||
Data profiling utilizes methods of descriptive statistics such as minimum, maximum, mean, mode, percentile, standard deviation, frequency, variation, aggregates such as count and sum, and additional metadata information obtained during data profiling such as data type, length, discrete values, uniqueness, occurrence of null values, typical string patterns, and abstract type recognition. The metadata can then be used to discover problems such as illegal values, misspellings, missing values, varying value representation, and duplicates.
|
||||
Different analyses are performed for different structural levels. E.g. single columns could be profiled individually to get an understanding of frequency distribution of different values, type, and use of each column. Embedded value dependencies can be exposed in a cross-columns analysis. Finally, overlapping value sets possibly representing foreign key relationships between entities can be explored in an inter-table analysis.
|
||||
Normally, purpose-built tools are used for data profiling to ease the process. The computational complexity increases when going from single column, to single table, to cross-table structural profiling. Therefore, performance is an evaluation criterion for profiling tools.
|
||||
|
||||
|
||||
== When is data profiling conducted? ==
|
||||
According to Kimball, data profiling is performed several times and with varying intensity throughout the data warehouse developing process. A light profiling assessment should be undertaken immediately after candidate source systems have been identified and DW/BI business requirements have been satisfied. The purpose of this initial analysis is to clarify at an early stage if the correct data is available at the appropriate detail level and that anomalies can be handled subsequently. If this is not the case the project may be terminated.
|
||||
Additionally, more in-depth profiling is done prior to the dimensional modeling process in order assess what is required to convert data into a dimensional model. Detailed profiling extends into the ETL system design process in order to determine the appropriate data to extract and which filters to apply to the data set.
|
||||
Additionally, data profiling may be conducted in the data warehouse development process after data has been loaded into staging, the data marts, etc. Conducting data at these stages helps ensure that data cleaning and transformations have been done correctly and in compliance of requirements.
|
||||
|
||||
|
||||
=== Benefits and examples ===
|
||||
Data profiling produces concise summaries of datasets that reveal structure and quality issues. Typical outputs include statistics and dependency findings that help analysts understand data, accelerate integration/migration, and improve data quality programs.
|
||||
|
||||
Common profiling statistics and checks
|
||||
Completeness (e.g., percentage of NULL/blank values).
|
||||
Distinctness/uniqueness (number of distinct values; candidate keys).
|
||||
Value distributions (frequencies, histograms, ranges, outliers).
|
||||
Pattern/format frequencies (e.g., regex or mask patterns for codes, emails, addresses).
|
||||
Structural dependencies such as keys, functional dependencies, and inclusion/foreign-key dependencies used for integration and schema matching.
|
||||
How results are used
|
||||
Profiling provides baselines and evidence for data quality dimensions (such as completeness, accuracy, consistency, timeliness) and supplies metrics for remediation; many data governance frameworks treat profiling as an input to ongoing data quality management.
|
||||
Example. For a column ``customer_email``, a profile may report 98.2% non-NULL values, 97.9% values matching an email pattern, top domains by frequency, and no candidate key; for a shipment table, a functional dependency such as ``order_id → customer_id`` may be discovered to support integration.
|
||||
|
||||
|
||||
== See also ==
|
||||
Data quality
|
||||
Data governance
|
||||
Master data management
|
||||
Database normalization
|
||||
Data visualization
|
||||
Analysis paralysis
|
||||
Data analysis
|
||||
|
||||
|
||||
== References ==
|
||||
37
data/en.wikipedia.org/wiki/Data_science-0.md
Normal file
37
data/en.wikipedia.org/wiki/Data_science-0.md
Normal file
@ -0,0 +1,37 @@
|
||||
---
|
||||
title: "Data science"
|
||||
chunk: 1/2
|
||||
source: "https://en.wikipedia.org/wiki/Data_science"
|
||||
category: "reference"
|
||||
tags: "science, encyclopedia"
|
||||
date_saved: "2026-05-05T09:53:53.540719+00:00"
|
||||
instance: "kb-cron"
|
||||
---
|
||||
|
||||
Data science is an interdisciplinary academic field that uses statistics, scientific computing, scientific methods, processing, scientific visualization, algorithms, and systems to extract or extrapolate knowledge from potentially noisy, structured, or unstructured data.
|
||||
Data science plays a critical role in modern decision-making by enabling organizations to extract actionable insights from large and complex datasets.
|
||||
Data science also integrates domain knowledge from the underlying application domain (e.g., natural sciences, information technology, and medicine). Data science is multifaceted and can be described as a science, a research paradigm, a research method, a discipline, a workflow, and a profession.
|
||||
Data science is "a concept to unify statistics, data analysis, informatics, and their related methods" to "understand and analyze actual phenomena" with data. It uses techniques and theories drawn from many fields within the context of mathematics, statistics, computer science, information science, and domain knowledge. However, data science is distinct from computer science and information science. Turing Award winner Jim Gray imagined data science as a "fourth paradigm" of science (empirical, theoretical, computational, and now data-driven) and asserted that "everything about science is changing because of the impact of information technology" and the data deluge.
|
||||
Data science is often described as a multidisciplinary and rapidly evolving field because it draws on techniques from diverse areas, such as computer science, statistics, information science, and other subject-specific disciplines. Some researchers argue that the combination of the different fields resembles how information science was decades ago (Mayernik, 2023). These similarities help clarify how data science gradually became its own field of study.
|
||||
A data scientist is a professional who creates programming code and combines it with statistical knowledge to summarize data.
|
||||
|
||||
== Foundations ==
|
||||
Data science is an interdisciplinary field focused on extracting knowledge from typically large data sets and applying the knowledge from that data to solve problems in other application domains. The field encompasses preparing data for analysis, formulating data science problems, analyzing data, and summarizing these findings. As such, it incorporates skills from computer science, mathematics, data visualization, graphic design, communication, and business.
|
||||
Vasant Dhar writes that statistics emphasizes quantitative data and description. In contrast, data science deals with quantitative and qualitative data (e.g., from images, text, sensors, transactions, customer information, etc.) and emphasizes prediction and action. Andrew Gelman of Columbia University has described statistics as a non-essential part of data science. Stanford professor David Donoho writes that data science is not distinguished from statistics by the size of datasets or use of computing and that many graduate programs misleadingly advertise their analytics and statistics training as the essence of a data-science program. He describes data science as an applied field growing out of traditional statistics.
|
||||
|
||||
== Etymology ==
|
||||
|
||||
=== Early usage ===
|
||||
In 1962, John Tukey described a field he called "data analysis", which resembles modern data science. In 1985, in a lecture given to the Chinese Academy of Sciences in Beijing, C. F. Jeff Wu used the term "data science" for the first time as an alternative name for statistics. Later, attendees at a 1992 statistics symposium at the University of Montpellier II acknowledged the emergence of a new discipline focused on data of various origins and forms, combining established concepts and principles of statistics and data analysis with computing.
|
||||
The term "data science" has been traced back to 1974, when Peter Naur proposed it as an alternative name to computer science. In his 1974 book Concise Survey of Computer Methods, Peter Naur proposed using the term ‘data science’ rather than ‘computer science’ to reflect the growing emphasis on data-driven methods In 1996, the International Federation of Classification Societies became the first conference to specifically feature data science as a topic. However, the definition was still in flux. After the 1985 lecture at the Chinese Academy of Sciences in Beijing, in 1997 C. F. Jeff Wu again suggested that statistics should be renamed data science. He reasoned that a new name would help statistics shed inaccurate stereotypes, such as being synonymous with accounting or limited to describing data. In 1998, Hayashi Chikio argued for data science as a new, interdisciplinary concept, with three aspects: data design, collection, and analysis.
|
||||
|
||||
=== Modern usage ===
|
||||
In 2012, technologists Thomas H. Davenport and DJ Patil declared "Data Scientist: The Sexiest Job of the 21st Century", a catchphrase that was picked up even by major-city newspapers like the New York Times and the Boston Globe. A decade later, they reaffirmed it, stating that "the job is more in demand than ever with employers".
|
||||
The modern conception of data science as an independent discipline is sometimes attributed to William S. Cleveland. In 2014, the American Statistical Association's Section on Statistical Learning and Data Mining changed its name to the Section on Statistical Learning and Data Science, reflecting the ascendant popularity of data science.
|
||||
Over the last few years, many colleges have begun to create more structured undergraduate programs in data science. According to a report by the National Academies, strong programs typically include training in statistics, computing, ethics, and communication, as well as hands-on work in a specific field (National Academies of Sciences, Engineering, and Medicine, 2018). As schools try to prepare students for jobs that use data, these practices become more common.
|
||||
The professional title of "data scientist" has been attributed to DJ Patil and Jeff Hammerbacher in 2008. Though it was used by the National Science Board in their 2005 report "Long-Lived Digital Data Collections: Enabling Research and Education in the 21st Century", it referred broadly to any key role in managing a digital data collection.
|
||||
|
||||
== Data science and data analysis ==
|
||||
|
||||
In data science, data analysis is the process of inspecting, cleaning, transforming, and modelling data to discover useful information, draw conclusions, and support decision-making. It includes exploratory data analysis (EDA), which uses graphics and descriptive statistics to explore patterns and generate hypotheses, and confirmatory data analysis, which applies statistical inference to test hypotheses and quantify uncertainty.
|
||||
Typical activities comprise:
|
||||
47
data/en.wikipedia.org/wiki/Data_science-1.md
Normal file
47
data/en.wikipedia.org/wiki/Data_science-1.md
Normal file
@ -0,0 +1,47 @@
|
||||
---
|
||||
title: "Data science"
|
||||
chunk: 2/2
|
||||
source: "https://en.wikipedia.org/wiki/Data_science"
|
||||
category: "reference"
|
||||
tags: "science, encyclopedia"
|
||||
date_saved: "2026-05-05T09:53:53.540719+00:00"
|
||||
instance: "kb-cron"
|
||||
---
|
||||
|
||||
data collection and integration;
|
||||
data cleaning and preparation (handling missing values, outliers, encoding, normalisation);
|
||||
feature engineering and selection;
|
||||
visualisation and descriptive statistics;
|
||||
fitting and evaluating statistical or machine-learning models;
|
||||
communicating results and ensuring reproducibility (e.g., reports, notebooks, and dashboards).
|
||||
Lifecycle frameworks such as CRISP-DM describe these steps from business understanding through deployment and monitoring.
|
||||
Data science involves working with larger datasets that often require advanced computational and statistical methods to analyze. Data scientists often work with unstructured data such as text or images and use machine learning algorithms to build predictive models. Data science often uses statistical analysis, data preprocessing, and supervised learning.
|
||||
Recent studies indicate that AI is moving towards data-centric approaches, focusing on the quality of datasets rather than just improving AI models. This trend focuses on improving system performance by cleaning, refining, and labeling data (Bhatt et al., 2024). As AI systems grow larger, the data-centric view has become increasingly important.
|
||||
|
||||
== Cloud computing for data science ==
|
||||
|
||||
Cloud computing can offer access to large amounts of computational power and storage. In big data, where volumes of information are continually generated and processed, these platforms can be used to handle complex and resource-intensive analytical tasks.
|
||||
Some distributed computing frameworks are designed to handle big data workloads. These frameworks can enable data scientists to process and analyze large datasets in parallel, which can reduce processing times.
|
||||
|
||||
== Ethical consideration in data science ==
|
||||
Data science involves collecting, processing, and analyzing data which often includes personal and sensitive information. Ethical concerns include potential privacy violations, bias perpetuation, and negative societal impacts.
|
||||
Ethics education in data science has grown to encompass both technical principles and more expansive philosophical questions. Research indicates that data science ethics courses are increasingly integrating human-centric topics, including fairness, accountability, and responsible decision-making, thereby connecting them to enduring discussions in moral and political philosophy (Colando & Hardin, 2024). The goal of this method is to help students understand how data-driven technologies affect society.
|
||||
Machine learning models can amplify existing biases present in training data, leading to discriminatory or unfair outcomes. Another area of data science that is growing is the push for better ways to cite data. Citing datasets makes it easier for other researchers to understand what data was used and for studies to be repeated (Lafia et al., 2023). These practices give the people who collect and manage data the credit they deserve, which is becoming more important in modern research.
|
||||
|
||||
== See also ==
|
||||
|
||||
Python (programming language)
|
||||
R (programming language)
|
||||
Data engineering
|
||||
Big data
|
||||
Machine learning
|
||||
Artificial intelligence
|
||||
Bioinformatics
|
||||
Astroinformatics
|
||||
Topological data analysis
|
||||
List of data science journals
|
||||
List of data science software
|
||||
List of open-source data science software
|
||||
Data science notebook software
|
||||
|
||||
== References ==
|
||||
85
data/en.wikipedia.org/wiki/Data_version_control-0.md
Normal file
85
data/en.wikipedia.org/wiki/Data_version_control-0.md
Normal file
@ -0,0 +1,85 @@
|
||||
---
|
||||
title: "Data version control"
|
||||
chunk: 1/1
|
||||
source: "https://en.wikipedia.org/wiki/Data_version_control"
|
||||
category: "reference"
|
||||
tags: "science, encyclopedia"
|
||||
date_saved: "2026-05-05T09:53:54.761385+00:00"
|
||||
instance: "kb-cron"
|
||||
---
|
||||
|
||||
Data version control is a method of working with data sets. It is similar to the version control systems used in traditional software development, but is optimized to allow better processing of data and collaboration in the context of data analytics, research, and any other form of data analysis. Data version control may also include specific features and configurations designed to facilitate work with large data sets and data lakes.
|
||||
|
||||
|
||||
== History ==
|
||||
|
||||
|
||||
=== Background ===
|
||||
As early as 1985, researchers recognized the need for defining timing attributes in database tables, which would be necessary for tracking changes to databases. This research continued into the 1990s, and the theory was formalized into practical methods for managing data in relational databases, providing some of the foundational concepts for what would later become data version control.
|
||||
In the early 2010s the size of data sets was rapidly expanding, and relational databases were no longer sufficient to manage the amounts of data organizations were accumulating. The rise of the Apache Hadoop eco system, with HDFS as a storage layer, and later object storage had become dominant in big data operations. Research into data management tools and data version control systems increased sharply, along with demand for such tools from both academia and the private and public sectors.
|
||||
|
||||
|
||||
=== Version controlled databases ===
|
||||
The first versioned database was proposed in 2012 for the SciDB database, and demonstrated it was possible to create chains and trees of different versions of the database while decreasing both the overall storage size and access speeds associated with previous methods. In 2014, a proposal was made to generalize these principles into a platform that could be used for any application.
|
||||
In 2016, a prototype for a data version control system was developed during a Kaggle competition. This software was later used internally at an AI firm, and eventually spun off as a startup. Since then, a number of data version control systems, both open and closed source, have been developed and offered commercially, with a subset dedicated specifically to machine learning.
|
||||
|
||||
|
||||
== Use cases ==
|
||||
|
||||
|
||||
=== Reproducibility ===
|
||||
A wide range of scientific disciplines have adopted automated analysis of large quantities of data, including astrophysics, seismology, biology and medicine, social sciences and economics, and many other fields. The principle of reproducibility is an important aspect of formalizing findings in scientific disciplines, and in the context of data science presents a number of challenges. Most datasets are constantly changing, whether due to the addition of more data or changes in the structure and format of the data, and small changes can have significant effects on the outcome of experiments. Data version control allows for recording the exact state of data sets at a particular moment of time, making it easier to reproduce and understand experimental outcomes. If data practitioners can only know the present state of the data, they may run into a number of challenges such as difficulties in problem debugging or complying with data audits.
|
||||
|
||||
|
||||
=== Development and testing ===
|
||||
Data version control is sometimes used in testing and development of applications that interact with large quantities of data. Some data version control tools allow users to create replicas of their production environment for testing purposes. This approach allows them to test data integration processes such as extract, transform and load (ETL) and understand the changes made to data without having a negative impact on the consumers of the production data.
|
||||
|
||||
|
||||
=== Machine learning and artificial intelligence ===
|
||||
In the context of machine learning, data version control can be used for optimizing the performance of models. It can allow automating the process of analyzing outcomes with different versions of a data set to continuously improve performance. It is possible that open source data version control software could eliminate the need for proprietary AI platforms by extending tools like Git and CI/CD for use by machine learning engineers. Many open-source solutions build on Git-like semantics to provide these capabilities, as Git itself was designed for small text files and doesn't support typical machine learning datasets, which are very large.
|
||||
|
||||
|
||||
=== CI/CD for data ===
|
||||
CI/CD methodologies can be applied to datasets using data version control. Version control enables users to integrate with automation servers that allow establishing a CI/CD process for data. By adding testing platforms to the process, they can guarantee high quality of the data product. In this scenario, teams execute Continuous Integration (CI) tests on data and set checks in place to ensure the data is promoted to production only all the set data quality and data governance criteria are met.
|
||||
|
||||
|
||||
=== Experimentation in isolated environments ===
|
||||
To experiment on a dataset without impacting production data, one can use data version control to create replicas of the production environment where tests can be carried out. Such replicas allow testing and understanding of changes safely applied to data.
|
||||
Data version control tools allow replication environments without the time- and resource-consuming maintenance. Instead, such tools allow objects to be shared using metadata.
|
||||
|
||||
|
||||
=== Rollback ===
|
||||
Continuous changes in data sets can sometimes cause functionality issues or lead to undesired outcomes, especially when applications are using the data. Data version control tools allow for the possibility to roll back a data set to an earlier state. This can be used to restore or improve functionality of an application or to correct errors or bad data which has been mistakenly included.
|
||||
|
||||
|
||||
== Examples ==
|
||||
Version controlled data sources:
|
||||
|
||||
Kaggle
|
||||
Quilt
|
||||
Dolt
|
||||
Kamu
|
||||
Data version control for data lakes:
|
||||
|
||||
LakeFS
|
||||
Project Nessie
|
||||
Git-LFS
|
||||
ML-Ops systems that implement data version control:
|
||||
|
||||
DVC
|
||||
Pachyderm
|
||||
Neptune
|
||||
activeloop
|
||||
graviti
|
||||
dagshub
|
||||
alectio
|
||||
Galileo
|
||||
Voxel51
|
||||
dstack
|
||||
dvid
|
||||
|
||||
|
||||
== See also ==
|
||||
|
||||
|
||||
== References ==
|
||||
61
data/en.wikipedia.org/wiki/Dataset_shift-0.md
Normal file
61
data/en.wikipedia.org/wiki/Dataset_shift-0.md
Normal file
@ -0,0 +1,61 @@
|
||||
---
|
||||
title: "Dataset shift"
|
||||
chunk: 1/1
|
||||
source: "https://en.wikipedia.org/wiki/Dataset_shift"
|
||||
category: "reference"
|
||||
tags: "science, encyclopedia"
|
||||
date_saved: "2026-05-05T09:53:59.453907+00:00"
|
||||
instance: "kb-cron"
|
||||
---
|
||||
|
||||
Dataset shift is a phenomenon in machine learning and statistics in which the joint distribution of input variables and target labels is different in the training phase and the deployment or test phase (i.e.,
|
||||
|
||||
|
||||
|
||||
|
||||
P
|
||||
|
||||
t
|
||||
r
|
||||
a
|
||||
i
|
||||
n
|
||||
|
||||
|
||||
(
|
||||
X
|
||||
,
|
||||
Y
|
||||
)
|
||||
≠
|
||||
|
||||
P
|
||||
|
||||
t
|
||||
e
|
||||
s
|
||||
t
|
||||
|
||||
|
||||
(
|
||||
X
|
||||
,
|
||||
Y
|
||||
)
|
||||
|
||||
|
||||
{\displaystyle P_{train}(X,Y)\neq P_{test}(X,Y)}
|
||||
|
||||
). This happens when the statistical properties of data used to train a model are no longer representative of the data encountered in real-world use, often resulting in degraded predictive performance and diminished generalization ability.
|
||||
Dataset shift is a generic term for a number of particular types of distributional change. Covariate shift is when the distribution of the input features changes, but the conditional relationship between inputs and outputs remains constant . Prior probability shift (or label shift) happens when the distribution of target labels changes, but the conditional distribution of inputs given labels stays the same. Concept shift (also known as concept drift) is the change of the conditional relationship between inputs and outputs that renders previously learned patterns invalid over time.
|
||||
A key challenge for deploying machine learning systems is dataset shift, in particular in dynamic environments where the data distributions change over time. Detecting and mitigating such shifts is an active area of research, e.g., drift detection, domain adaptation, continual learning.
|
||||
|
||||
|
||||
== See also ==
|
||||
Concept drift
|
||||
Domain adaptation
|
||||
Overfitting
|
||||
Statistical classification
|
||||
|
||||
|
||||
== References ==
|
||||
405
data/en.wikipedia.org/wiki/Degree-Rips_bifiltration-0.md
Normal file
405
data/en.wikipedia.org/wiki/Degree-Rips_bifiltration-0.md
Normal file
@ -0,0 +1,405 @@
|
||||
---
|
||||
title: "Degree-Rips bifiltration"
|
||||
chunk: 1/1
|
||||
source: "https://en.wikipedia.org/wiki/Degree-Rips_bifiltration"
|
||||
category: "reference"
|
||||
tags: "science, encyclopedia"
|
||||
date_saved: "2026-05-05T09:54:00.644610+00:00"
|
||||
instance: "kb-cron"
|
||||
---
|
||||
|
||||
The degree-Rips bifiltration is a simplicial filtration used in topological data analysis for analyzing the shape of point cloud data. It is a multiparameter extension of the Vietoris–Rips filtration that possesses greater stability to data outliers than single-parameter filtrations, and which is more amenable to practical computation than other multiparameter constructions. Introduced in 2015 by Lesnick and Wright, the degree-Rips bifiltration is a parameter-free and density-sensitive vehicle for performing persistent homology computations on point cloud data.
|
||||
|
||||
|
||||
== Definition ==
|
||||
It is standard practice in topological data analysis (TDA) to associate a sequence of nested simplicial complexes to a finite data set in order to detect the persistence of topological features over a range of scale parameters. One way to do this is by considering the sequence of Vietoris–Rips complexes of a finite set in a metric space indexed over all scale parameters.
|
||||
If
|
||||
|
||||
|
||||
|
||||
X
|
||||
|
||||
|
||||
{\displaystyle X}
|
||||
|
||||
is a finite set in a metric space, then this construction is known as the Vietoris–Rips (or simply "Rips") filtration on
|
||||
|
||||
|
||||
|
||||
X
|
||||
|
||||
|
||||
{\displaystyle X}
|
||||
|
||||
, commonly denoted
|
||||
|
||||
|
||||
|
||||
|
||||
Rips
|
||||
|
||||
(
|
||||
X
|
||||
)
|
||||
|
||||
|
||||
{\displaystyle {\text{Rips}}(X)}
|
||||
|
||||
or
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
R
|
||||
|
||||
|
||||
(
|
||||
X
|
||||
)
|
||||
|
||||
|
||||
{\displaystyle {\mathcal {R}}(X)}
|
||||
|
||||
. The Rips filtration can be expressed as a functor
|
||||
|
||||
|
||||
|
||||
|
||||
Rips
|
||||
|
||||
(
|
||||
X
|
||||
)
|
||||
:
|
||||
|
||||
R
|
||||
|
||||
→
|
||||
|
||||
S
|
||||
i
|
||||
m
|
||||
p
|
||||
|
||||
|
||||
|
||||
{\displaystyle {\text{Rips}}(X):\mathbb {R} \to \mathbf {Simp} }
|
||||
|
||||
from the real numbers (viewed as a poset category) to the category of simplicial complexes and simplicial maps, a subcategory of the category
|
||||
|
||||
|
||||
|
||||
|
||||
T
|
||||
o
|
||||
p
|
||||
|
||||
|
||||
|
||||
{\displaystyle \mathbf {Top} }
|
||||
|
||||
of topological spaces and continuous maps via the geometric realization functor.
|
||||
The Rips filtration is indexed over a single parameter, but we can capture more information (e.g., density) about the underlying data set by considering multiparameter filtrations. A filtration indexed by the product of two totally-ordered sets is known as a bifiltration, first introduced by Gunnar Carlsson and Afra Zomorodian in 2009.
|
||||
The degree-Rips bifiltration filters each simplicial complex in the Rips filtration by the degree of each vertex in the graph isomorphic to the 1-skeleton at each index. More formally, let
|
||||
|
||||
|
||||
|
||||
(
|
||||
a
|
||||
,
|
||||
b
|
||||
)
|
||||
|
||||
|
||||
{\displaystyle (a,b)}
|
||||
|
||||
be an element of
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
R
|
||||
|
||||
|
||||
2
|
||||
|
||||
|
||||
|
||||
|
||||
{\displaystyle \mathbb {R} ^{2}}
|
||||
|
||||
and define
|
||||
|
||||
|
||||
|
||||
|
||||
G
|
||||
|
||||
a
|
||||
,
|
||||
b
|
||||
|
||||
|
||||
|
||||
|
||||
{\displaystyle G_{a,b}}
|
||||
|
||||
to be the subgraph of the 1-skeleton of
|
||||
|
||||
|
||||
|
||||
|
||||
Rips
|
||||
|
||||
(
|
||||
X
|
||||
|
||||
)
|
||||
|
||||
b
|
||||
|
||||
|
||||
|
||||
|
||||
{\displaystyle {\text{Rips}}(X)_{b}}
|
||||
|
||||
containing all vertices whose degree is at least
|
||||
|
||||
|
||||
|
||||
a
|
||||
|
||||
|
||||
{\displaystyle a}
|
||||
|
||||
. Subsequently building the maximal simplicial complex possible on this 1-skeleton, we obtain a complex
|
||||
|
||||
|
||||
|
||||
|
||||
D-Rips
|
||||
|
||||
(
|
||||
X
|
||||
|
||||
)
|
||||
|
||||
a
|
||||
,
|
||||
b
|
||||
|
||||
|
||||
|
||||
|
||||
{\displaystyle {\text{D-Rips}}(X)_{a,b}}
|
||||
|
||||
. By doing this for all possible vertex degrees, and across all scale parameters in the Rips filtration, we extend the Rips construction to a bifiltration
|
||||
|
||||
|
||||
|
||||
{
|
||||
|
||||
D-Rips
|
||||
|
||||
(
|
||||
X
|
||||
|
||||
)
|
||||
|
||||
a
|
||||
,
|
||||
b
|
||||
|
||||
|
||||
|
||||
}
|
||||
|
||||
(
|
||||
a
|
||||
,
|
||||
b
|
||||
)
|
||||
∈
|
||||
|
||||
|
||||
R
|
||||
|
||||
|
||||
2
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
{\displaystyle \{{\text{D-Rips}}(X)_{a,b}\}_{(a,b)\in \mathbb {R} ^{2}}}
|
||||
|
||||
.
|
||||
Note that since the size of each complex will decrease as
|
||||
|
||||
|
||||
|
||||
a
|
||||
|
||||
|
||||
{\displaystyle a}
|
||||
|
||||
increases, we should identify the indexing set
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
R
|
||||
|
||||
|
||||
2
|
||||
|
||||
|
||||
|
||||
|
||||
{\displaystyle \mathbb {R} ^{2}}
|
||||
|
||||
with
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
R
|
||||
|
||||
|
||||
op
|
||||
|
||||
|
||||
×
|
||||
|
||||
R
|
||||
|
||||
|
||||
|
||||
{\displaystyle \mathbb {R} ^{\text{op}}\times \mathbb {R} }
|
||||
|
||||
, where
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
R
|
||||
|
||||
|
||||
op
|
||||
|
||||
|
||||
|
||||
|
||||
{\displaystyle \mathbb {R} ^{\text{op}}}
|
||||
|
||||
is the opposite poset category of
|
||||
|
||||
|
||||
|
||||
|
||||
R
|
||||
|
||||
|
||||
|
||||
{\displaystyle \mathbb {R} }
|
||||
|
||||
. Therefore the degree-Rips bifiltration can be viewed as a functor
|
||||
|
||||
|
||||
|
||||
|
||||
D-Rips
|
||||
|
||||
(
|
||||
X
|
||||
)
|
||||
:
|
||||
|
||||
|
||||
R
|
||||
|
||||
|
||||
op
|
||||
|
||||
|
||||
×
|
||||
|
||||
R
|
||||
|
||||
→
|
||||
|
||||
S
|
||||
i
|
||||
m
|
||||
p
|
||||
|
||||
|
||||
|
||||
{\displaystyle {\text{D-Rips}}(X):\mathbb {R} ^{\operatorname {op} }\times \mathbb {R} \to \mathbf {Simp} }
|
||||
|
||||
.
|
||||
The idea behind the degree-Rips bifiltration is that vertices of higher degree will correspond to higher density regions of the underlying data set. However, since degree-Rips does not depend on an arbitrary choice of a parameter (such as a pre-selected density parameter, which is a priori difficult to determine), it is a convenient tool for analyzing data.
|
||||
|
||||
|
||||
== Applications to data analysis ==
|
||||
The degree-Rips bifiltration possesses several properties that make it a useful tool in data analysis. For example, each of its skeleta has polynomial size; the k-dimensional skeleton of
|
||||
|
||||
|
||||
|
||||
|
||||
D-Rips
|
||||
|
||||
(
|
||||
X
|
||||
)
|
||||
|
||||
|
||||
{\displaystyle {\text{D-Rips}}(X)}
|
||||
|
||||
has
|
||||
|
||||
|
||||
|
||||
O
|
||||
(
|
||||
|
||||
|
|
||||
|
||||
X
|
||||
|
||||
|
||||
|
|
||||
|
||||
|
||||
k
|
||||
+
|
||||
2
|
||||
|
||||
|
||||
)
|
||||
|
||||
|
||||
{\displaystyle O(|X|^{k+2})}
|
||||
|
||||
simplices, where
|
||||
|
||||
|
||||
|
||||
O
|
||||
|
||||
|
||||
{\displaystyle O}
|
||||
|
||||
denotes an asymptotic upper bound. Moreover, it has been shown that the degree-Rips bifiltration possesses reasonably strong stability properties with respect to perturbations of the underlying data set. Further work has also been done examining the stable components and homotopy types of degree-Rips complexes.
|
||||
The software RIVET was created in order to visualize several multiparameter invariants (i.e., data structures that attempt to capture underlying geometric information of the data) of 2-parameter persistence modules, including the persistent homology modules of the degree-Rips bifiltration. These invariants include the Hilbert function, rank invariant, and fibered barcode.
|
||||
As a follow-up to the introduction of degree-Rips in their original 2015 paper, Lesnick and Wright showed in 2022 that a primary component of persistent homology computations (namely, computing minimal presentations and bigraded Betti numbers) can be achieved efficiently in a way that outperforms other persistent homology software. Methods of improving algorithmic efficiency of multiparameter persistent homology have also been explored that suggest the possibility of substantial speed increases for data analysis tools such as RIVET.
|
||||
The degree-Rips bifiltration has been used for analyzing data clusters with respect to variations in density. There has been some preliminary experimental analysis of the performance of degree-Rips with respect to outliers in particular, but this is an ongoing area of research as of February 2023.
|
||||
|
||||
|
||||
== References ==
|
||||
@ -0,0 +1,157 @@
|
||||
---
|
||||
title: "Directional-change intrinsic time"
|
||||
chunk: 1/1
|
||||
source: "https://en.wikipedia.org/wiki/Directional-change_intrinsic_time"
|
||||
category: "reference"
|
||||
tags: "science, encyclopedia"
|
||||
date_saved: "2026-05-05T09:54:03.000900+00:00"
|
||||
instance: "kb-cron"
|
||||
---
|
||||
|
||||
Directional-change intrinsic time is an event-based operator to dissect a data series into a sequence of alternating trends of defined size
|
||||
|
||||
|
||||
|
||||
δ
|
||||
|
||||
|
||||
{\displaystyle \delta }
|
||||
|
||||
.
|
||||
|
||||
The directional-change intrinsic time operator was developed for the analysis of financial market data series. It is an alternative methodology to the concept of continuous time. Directional-change intrinsic time operator dissects a data series into a set of drawups and drawdowns or up and down trends that alternate with each other. An established trend comes to an end as soon as a trend reversal is observed. A price move that extends a trend is called overshoot and leads to new price extremes.
|
||||
Figure 1 provides an example of a price curve dissected by the directional change intrinsic time operator.
|
||||
The frequency of directional-change intrinsic events maps (1) the volatility of price changes conditional to (2) the selected threshold
|
||||
|
||||
|
||||
|
||||
δ
|
||||
|
||||
|
||||
{\displaystyle \delta }
|
||||
|
||||
. The stochastic nature of the underlying process is mirrored in the non-equal number of intrinsic events observed over equal periods of physical time.
|
||||
Directional-change intrinsic time operator is a noise filtering technique. It identifies regime shifts, when trend changes of a particular size occur and hides price fluctuations that are smaller than the threshold
|
||||
|
||||
|
||||
|
||||
δ
|
||||
|
||||
|
||||
{\displaystyle \delta }
|
||||
|
||||
.
|
||||
|
||||
|
||||
== Application ==
|
||||
The directional-change intrinsic time operator was used to analyze high frequency foreign exchange market data and has led to the discovery of a large set of scaling laws that have not been previously observed. The scaling laws identify properties of the underlying data series, such as the size of the expected price overshoot after an intrinsic time event or the number of expected directional-changes within a physical time interval or price threshold. For example, a scaling relating the expected number of directional-changes
|
||||
|
||||
|
||||
|
||||
N
|
||||
(
|
||||
δ
|
||||
)
|
||||
|
||||
|
||||
{\displaystyle N(\delta )}
|
||||
|
||||
observed over the fixed period to the size of the threshold
|
||||
|
||||
|
||||
|
||||
δ
|
||||
|
||||
|
||||
{\displaystyle \delta }
|
||||
|
||||
:
|
||||
|
||||
|
||||
|
||||
|
||||
N
|
||||
(
|
||||
δ
|
||||
)
|
||||
=
|
||||
|
||||
|
||||
(
|
||||
|
||||
|
||||
δ
|
||||
|
||||
C
|
||||
|
||||
N
|
||||
,
|
||||
D
|
||||
C
|
||||
|
||||
|
||||
|
||||
|
||||
)
|
||||
|
||||
|
||||
|
||||
E
|
||||
|
||||
N
|
||||
,
|
||||
D
|
||||
C
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
{\displaystyle N(\delta )=\left({\frac {\delta }{C_{N,DC}}}\right)^{E_{N,DC}}}
|
||||
|
||||
,
|
||||
where
|
||||
|
||||
|
||||
|
||||
|
||||
C
|
||||
|
||||
N
|
||||
,
|
||||
D
|
||||
C
|
||||
|
||||
|
||||
|
||||
|
||||
{\displaystyle C_{N,DC}}
|
||||
|
||||
and
|
||||
|
||||
|
||||
|
||||
|
||||
E
|
||||
|
||||
N
|
||||
,
|
||||
D
|
||||
C
|
||||
|
||||
|
||||
|
||||
|
||||
{\displaystyle E_{N,DC}}
|
||||
|
||||
are the scaling law coefficients.
|
||||
Other applications of the directional-change intrinsic time in finance include:
|
||||
|
||||
trading strategy characterised by the annual Sharpe ratio 3.04
|
||||
tools designed to monitor liquidity at multiple trend scales.
|
||||
The methodology can also be used for applications beyond economics and finance. It can be applied to other scientific domains and opens a new avenue of research in the area of BigData.
|
||||
|
||||
|
||||
== References ==
|
||||
Text in this draft was copied from Petrov, Vladimir; Golub, Anton; Olsen, Richard (2019). "Instantaneous Volatility Seasonality of High-Frequency Markets in Directional-Change Intrinsic Time". Journal of Risk and Financial Management. 12 (2): 54. doi:10.3390/jrfm12020054. hdl:10419/239003., which is available under a Creative Commons Attribution 4.0 International License.
|
||||
147
data/en.wikipedia.org/wiki/Directional_component_analysis-0.md
Normal file
147
data/en.wikipedia.org/wiki/Directional_component_analysis-0.md
Normal file
@ -0,0 +1,147 @@
|
||||
---
|
||||
title: "Directional component analysis"
|
||||
chunk: 1/2
|
||||
source: "https://en.wikipedia.org/wiki/Directional_component_analysis"
|
||||
category: "reference"
|
||||
tags: "science, encyclopedia"
|
||||
date_saved: "2026-05-05T09:54:01.826255+00:00"
|
||||
instance: "kb-cron"
|
||||
---
|
||||
|
||||
Directional component analysis (DCA) is a statistical method used in climate science for identifying representative patterns of variability in space-time data-sets such as historical climate observations, weather prediction ensembles or climate ensembles.
|
||||
The first DCA pattern is a pattern of weather or climate variability that is both likely to occur (measured using likelihood) and has a large impact (for a specified linear impact function, and given certain mathematical conditions: see below).
|
||||
The first DCA pattern contrasts with the first PCA pattern, which is likely to occur, but may not have a large impact, and with a pattern derived from the gradient of the impact function, which has a large impact, but may not be likely to occur.
|
||||
DCA differs from other pattern identification methods used in climate research, such as EOFs, rotated EOFs and extended EOFs in that it takes into account an external vector, the gradient of the impact.
|
||||
DCA provides a way to reduce large ensembles from weather forecasts or climate models to just two patterns.
|
||||
The first pattern is the ensemble mean, and the second pattern is the DCA pattern, which represents variability around the ensemble mean in a way that takes impact into account.
|
||||
DCA contrasts with other methods that have been proposed for the reduction of ensembles in that it takes impact into account in addition to the structure of the ensemble.
|
||||
|
||||
== Overview ==
|
||||
|
||||
=== Inputs ===
|
||||
DCA is calculated from two inputs:
|
||||
|
||||
a multivariate dataset of weather or climate data, such as historical climate observations, or a weather or climate ensemble
|
||||
a linear impact function. The linear impact function is a function which defines a level of impact for every spatial pattern in the weather or climate data as a weighted sum of the values at different locations in the spatial pattern. An example is the mean value across the spatial pattern. The linear impact function can be generated as the first term in the multivariate Taylor series of a non-linear impact function.
|
||||
|
||||
=== Formula ===
|
||||
Consider a space-time data set
|
||||
|
||||
|
||||
|
||||
X
|
||||
|
||||
|
||||
{\displaystyle X}
|
||||
|
||||
, containing individual spatial pattern vectors
|
||||
|
||||
|
||||
|
||||
x
|
||||
|
||||
|
||||
{\displaystyle x}
|
||||
|
||||
, where the individual patterns are each considered as single samples from a multivariate normal distribution with mean zero and covariance matrix
|
||||
|
||||
|
||||
|
||||
C
|
||||
|
||||
|
||||
{\displaystyle C}
|
||||
|
||||
.
|
||||
We define a linear impact function of a spatial pattern as
|
||||
|
||||
|
||||
|
||||
|
||||
r
|
||||
|
||||
t
|
||||
|
||||
|
||||
x
|
||||
|
||||
|
||||
{\displaystyle r^{t}x}
|
||||
|
||||
, where
|
||||
|
||||
|
||||
|
||||
r
|
||||
|
||||
|
||||
{\displaystyle r}
|
||||
|
||||
is a vector of spatial weights.
|
||||
The first DCA pattern is given in terms the covariance matrix
|
||||
|
||||
|
||||
|
||||
C
|
||||
|
||||
|
||||
{\displaystyle C}
|
||||
|
||||
and the weights
|
||||
|
||||
|
||||
|
||||
r
|
||||
|
||||
|
||||
{\displaystyle r}
|
||||
|
||||
by the proportional expression
|
||||
|
||||
|
||||
|
||||
|
||||
x
|
||||
∝
|
||||
C
|
||||
r
|
||||
|
||||
|
||||
{\displaystyle x\propto Cr}
|
||||
|
||||
.
|
||||
|
||||
The pattern can then be normalized to any length as required.
|
||||
|
||||
=== Properties ===
|
||||
If the weather or climate data is elliptically distributed (e.g., is distributed as a multivariate normal distribution or a multivariate t-distribution) then the first DCA pattern (DCA1) is defined as the spatial pattern with the following mathematical properties:
|
||||
|
||||
DCA1 maximises probability density for a given value of impact
|
||||
DCA1 maximises impact for a given value of probability density
|
||||
DCA1 maximises the product of impact and probability density
|
||||
DCA1 is the conditional expectation, conditional on exceeding a certain level of impact
|
||||
DCA1 is the impact-weighted ensemble mean
|
||||
Any modification of DCA1 will lead to a pattern that is either less extreme, or has a lower probability density.
|
||||
|
||||
=== Rainfall Example ===
|
||||
For instance, in a rainfall anomaly dataset, using an impact metric defined as the total rainfall anomaly, the first DCA pattern is the spatial pattern that has the highest probability density for a given total rainfall anomaly. If the given total rainfall anomaly is chosen to have a large value, then this pattern combines being extreme in terms of the metric (i.e., representing large amounts of total rainfall) with being likely in terms of the pattern, and so is well suited as a representative extreme pattern.
|
||||
|
||||
=== Comparison with PCA ===
|
||||
The main differences between Principal component analysis (PCA) and DCA are
|
||||
|
||||
PCA is a function of just the covariance matrix, and the first PCA pattern is defined so as to maximise explained variance
|
||||
DCA is a function of the covariance matrix and a vector direction (the gradient of the impact function), and the first DCA pattern is defined so as to maximise probability density for a given value of the impact metric
|
||||
As a result, for unit vector spatial patterns:
|
||||
|
||||
The first PCA spatial pattern always corresponds to a higher explained variance, but has a lower value of the impact metric (e.g., the total rainfall anomaly), except in degenerate cases
|
||||
The first DCA spatial pattern always corresponds to a higher value of the impact metric, but has a lower value of the explained variance, except in degenerate cases
|
||||
The degenerate cases occur when the PCA and DCA patterns are equal.
|
||||
Also, given the first PCA pattern, the DCA pattern can be scaled so that:
|
||||
|
||||
The scaled DCA pattern has the same probability density as the first PCA pattern, but higher impact, or
|
||||
The scaled DCA pattern has the same impact as the first PCA pattern, but higher probability density.
|
||||
|
||||
== Two Dimensional Example ==
|
||||
Source:
|
||||
|
||||
Figure 1 gives an example, which can be understood as follows:
|
||||
276
data/en.wikipedia.org/wiki/Directional_component_analysis-1.md
Normal file
276
data/en.wikipedia.org/wiki/Directional_component_analysis-1.md
Normal file
@ -0,0 +1,276 @@
|
||||
---
|
||||
title: "Directional component analysis"
|
||||
chunk: 2/2
|
||||
source: "https://en.wikipedia.org/wiki/Directional_component_analysis"
|
||||
category: "reference"
|
||||
tags: "science, encyclopedia"
|
||||
date_saved: "2026-05-05T09:54:01.826255+00:00"
|
||||
instance: "kb-cron"
|
||||
---
|
||||
|
||||
The two axes represent anomalies of annual mean rainfall at two locations, with the highest total rainfall anomaly values towards the top right corner of the diagram
|
||||
The joint variability of the rainfall anomalies at the two locations is assumed to follow a bivariate normal distribution
|
||||
The ellipse shows a single contour of probability density from this bivariate normal, with higher values inside the ellipse
|
||||
The red dot at the centre of the ellipse shows zero rainfall anomalies at both locations
|
||||
The blue parallel-line arrow shows the principal axis of the ellipse, which is also the first PCA spatial pattern vector
|
||||
In this case, the PCA pattern is scaled so that it touches the ellipse
|
||||
The diagonal straight line shows a line of constant positive total rainfall anomaly, assumed to be at some fairly extreme level
|
||||
The red dotted-line arrow shows the first DCA pattern, which points towards the point at which the diagonal line is tangent to the ellipse
|
||||
In this case, the DCA pattern is scaled so that it touches the ellipse
|
||||
From this diagram, the DCA pattern can be seen to possess the following properties:
|
||||
|
||||
Of all the points on the diagonal line, it is the one with the highest probability density
|
||||
Of all the points on the ellipse, it is the one with the highest total rainfall anomaly
|
||||
It has the same probability density as the PCA pattern, but represents higher total rainfall (i.e., points further towards the top right hand corner of the diagram)
|
||||
Any change of the DCA pattern will reduce either the probability density (if it moves out of the ellipse) or reduce the total rainfall anomaly (if it moves along or into the ellipse)
|
||||
In this case the total rainfall anomaly of the PCA pattern is quite small, because of anticorrelations between the rainfall anomalies at the two locations. As a result, the first PCA pattern is not a good representative example of a pattern with large total rainfall anomaly, while the first DCA pattern is.
|
||||
In
|
||||
|
||||
|
||||
|
||||
n
|
||||
|
||||
|
||||
{\displaystyle n}
|
||||
|
||||
dimensions the ellipse becomes an ellipsoid, the diagonal line becomes an
|
||||
|
||||
|
||||
|
||||
n
|
||||
−
|
||||
1
|
||||
|
||||
|
||||
{\displaystyle n-1}
|
||||
|
||||
dimensional plane, and the PCA and DCA patterns are vectors in
|
||||
|
||||
|
||||
|
||||
n
|
||||
|
||||
|
||||
{\displaystyle n}
|
||||
|
||||
dimensions.
|
||||
|
||||
== Applications ==
|
||||
|
||||
=== Application to Climate Variability ===
|
||||
DCA has been applied to the CRU data-set of historical rainfall variability in order to understand the most likely patterns of rainfall extremes in the US and China.
|
||||
|
||||
=== Application to Ensemble Weather Forecasts ===
|
||||
DCA has been applied to ECMWF medium-range weather forecast ensembles in order to identify the most likely patterns of extreme temperatures in the ensemble forecast.
|
||||
|
||||
=== Application to Ensemble Climate Model Projections ===
|
||||
DCA has been applied to ensemble climate model projections in order to identify the most likely patterns of extreme future rainfall.
|
||||
|
||||
== Derivation of the First DCA Pattern ==
|
||||
Source:
|
||||
Consider a space-time data-set
|
||||
|
||||
|
||||
|
||||
X
|
||||
|
||||
|
||||
{\displaystyle X}
|
||||
|
||||
, containing individual spatial pattern vectors
|
||||
|
||||
|
||||
|
||||
x
|
||||
|
||||
|
||||
{\displaystyle x}
|
||||
|
||||
, where the individual patterns are each considered as single samples from a multivariate normal distribution with mean zero and covariance matrix
|
||||
|
||||
|
||||
|
||||
C
|
||||
|
||||
|
||||
{\displaystyle C}
|
||||
|
||||
.
|
||||
As a function of
|
||||
|
||||
|
||||
|
||||
x
|
||||
|
||||
|
||||
{\displaystyle x}
|
||||
|
||||
, the log probability density is proportional to
|
||||
|
||||
|
||||
|
||||
−
|
||||
|
||||
x
|
||||
|
||||
t
|
||||
|
||||
|
||||
|
||||
C
|
||||
|
||||
−
|
||||
1
|
||||
|
||||
|
||||
x
|
||||
|
||||
|
||||
{\displaystyle -x^{t}C^{-1}x}
|
||||
|
||||
.
|
||||
We define a linear impact function of a spatial pattern as
|
||||
|
||||
|
||||
|
||||
|
||||
r
|
||||
|
||||
t
|
||||
|
||||
|
||||
x
|
||||
|
||||
|
||||
{\displaystyle r^{t}x}
|
||||
|
||||
, where
|
||||
|
||||
|
||||
|
||||
r
|
||||
|
||||
|
||||
{\displaystyle r}
|
||||
|
||||
is a vector of spatial weights.
|
||||
We then seek to find the spatial pattern that maximises the probability density for a given value of the linear impact function. This is equivalent to finding the spatial pattern that maximises the log probability density for a given value of the linear impact function, which is slightly easier to solve.
|
||||
This is a constrained maximisation problem, and can be solved using the method of Lagrange multipliers.
|
||||
The Lagrangian function is given by
|
||||
|
||||
|
||||
|
||||
|
||||
L
|
||||
(
|
||||
x
|
||||
,
|
||||
λ
|
||||
)
|
||||
=
|
||||
−
|
||||
|
||||
x
|
||||
|
||||
t
|
||||
|
||||
|
||||
|
||||
C
|
||||
|
||||
−
|
||||
1
|
||||
|
||||
|
||||
x
|
||||
−
|
||||
λ
|
||||
(
|
||||
|
||||
r
|
||||
|
||||
t
|
||||
|
||||
|
||||
x
|
||||
−
|
||||
1
|
||||
)
|
||||
|
||||
|
||||
{\displaystyle L(x,\lambda )=-x^{t}C^{-1}x-\lambda (r^{t}x-1)}
|
||||
|
||||
|
||||
Differentiating by
|
||||
|
||||
|
||||
|
||||
x
|
||||
|
||||
|
||||
{\displaystyle x}
|
||||
|
||||
and setting to zero gives the solution
|
||||
|
||||
|
||||
|
||||
|
||||
x
|
||||
∝
|
||||
C
|
||||
r
|
||||
|
||||
|
||||
{\displaystyle x\propto Cr}
|
||||
|
||||
|
||||
Normalising so that
|
||||
|
||||
|
||||
|
||||
x
|
||||
|
||||
|
||||
{\displaystyle x}
|
||||
|
||||
is unit vector gives
|
||||
|
||||
|
||||
|
||||
|
||||
x
|
||||
=
|
||||
C
|
||||
r
|
||||
|
||||
/
|
||||
|
||||
(
|
||||
|
||||
r
|
||||
|
||||
t
|
||||
|
||||
|
||||
C
|
||||
C
|
||||
r
|
||||
|
||||
)
|
||||
|
||||
1
|
||||
|
||||
/
|
||||
|
||||
2
|
||||
|
||||
|
||||
|
||||
|
||||
{\displaystyle x=Cr/(r^{t}CCr)^{1/2}}
|
||||
|
||||
|
||||
This is the first DCA pattern.
|
||||
Subsequent patterns can be derived which are orthogonal to the first, to form an orthonormal set and a method for matrix factorisation.
|
||||
|
||||
== References ==
|
||||
57
data/en.wikipedia.org/wiki/Document-oriented_database-0.md
Normal file
57
data/en.wikipedia.org/wiki/Document-oriented_database-0.md
Normal file
@ -0,0 +1,57 @@
|
||||
---
|
||||
title: "Document-oriented database"
|
||||
chunk: 1/2
|
||||
source: "https://en.wikipedia.org/wiki/Document-oriented_database"
|
||||
category: "reference"
|
||||
tags: "science, encyclopedia"
|
||||
date_saved: "2026-05-05T09:54:05.454091+00:00"
|
||||
instance: "kb-cron"
|
||||
---
|
||||
|
||||
A document-oriented database, or document store, is a computer program and data storage system designed for storing, retrieving, and managing document-oriented information, also known as semi-structured data.
|
||||
Document-oriented databases are one of the main categories of NoSQL databases, and the popularity of the term "document-oriented database" has grown alongside the adoption of NoSQL itself. XML databases are a subclass of document-oriented databases optimized for XML documents. Graph databases are similar, but add another layer, the relationship, which allows them to link documents for rapid traversal.
|
||||
Document-oriented databases are conceptually an extension of the key–value store, another type of NoSQL database. In key-value stores, data is treated as opaque by the database, whereas document-oriented systems exploit the internal structure of documents to extract metadata and optimize storage and queries. Although in practice the distinction can be minimal due to modern tooling, document stores are designed to provide a richer programming experience with modern programming techniques.
|
||||
Document databases differ significantly from traditional relational databases (RDBs). Relational databases store data in predefined tables, often requiring an object to be split across multiple tables. In contrast, document databases store all information for a given object in a single document, with each document potentially having a unique structure. This design eliminates the need for object-relational mapping when loading data into the database.
|
||||
|
||||
== Documents ==
|
||||
The central concept of a document-oriented database is the notion of a document. Although implementations vary in their specific definitions, document-oriented databases generally treat documents as self-contained units that encapsulate and encode data in a standardized format. Common encoding formats include XML, YAML, JSON, as well as binary representations such as BSON.
|
||||
Documents in a document store are equivalent to the programming concept of an object. They are not required to adhere to a fixed schema, and documents within the same collection may contain different fields or structures. Fields may be optional, and documents of the same logical type may differ in composition. For example, the following illustrates a document encoded in JSON:
|
||||
|
||||
A second document might be encoded in XML as:
|
||||
|
||||
The two example documents share some structural elements but also contain unique fields. The structure, text, and other data within each document are collectively referred to as the document's content and can be accessed or modified using retrieval or editing operations. Unlike relational databases, in which each record contains the same fields and unused fields are left empty, document-oriented databases do not require uniform fields across documents. This design allows new information to be added to some documents without affecting the structure of others.
|
||||
Document databases often support the storage of additional metadata alongside the document content. Such metadata may relate to organizational features, security, indexing, or other implementation-specific features.
|
||||
|
||||
=== CRUD operations ===
|
||||
The core operations supported by a document-oriented database for manipulating documents are similar to those in other databases. Although terminology is not perfectly standardized, these operations are generally recognized as Create, Read, Update, and Delete (CRUD).
|
||||
|
||||
Creation (C): Adds a new document to the database.
|
||||
Retrieval (R): Retrieves documents or fields based on queries.
|
||||
Update (U): Modifies the contents of existing documents.
|
||||
Deletion (D): Removes documents from the database.
|
||||
|
||||
=== Keys ===
|
||||
Documents in a document-oriented database are addressed via a unique identifier. This identifier, often a string, URI, or path, can be used to retrieve the document from the database. Most document stores maintain an index on the key to optimize retrieval, and in some implementations the key is required when creating or inserting a new document.
|
||||
|
||||
=== Retrieval ===
|
||||
In addition to key-based access, document-oriented databases typically provide an API or query language that enables retrieval based on document content or associated metadata. For example, a query may return all documents with a specific field matching a given value. The available query features, indexing options, and performance characteristics vary across implementations.
|
||||
Document stores differ from key-value stores in that they exploit the internal structure and metadata of stored documents. In many key-value stores, values are treated as opaque or "black-box" data, meaning the database system does not interpret their internal structure. By contrast, document-oriented databases can classify and interpret document content. This enables queries that distinguish between types of data––for example, retrieving all phone numbers containing "555" without also matching a postal code such as "55555."
|
||||
|
||||
=== Editing ===
|
||||
Document databases typically provide mechanisms for updating or editing the content or metadata of a document. Updates may involve replacing the entire document or modifying individual elements or fields within the document.
|
||||
|
||||
=== Organization ===
|
||||
Document database implementations support a variety of methods for organizing documents, including:
|
||||
|
||||
Collections: Groups of documents. Depending on the implementation, a document may be required to belong to a single collection or may be allowed in multiple collections.
|
||||
Tags and non-visible metadata: Additional data stored outside the main document content.
|
||||
Directory hierarchies: Documents organized in a tree-like structure, often based on path or URI.
|
||||
These organizational structures may differ between logical and physical representations (e.g. on disk or in memory).
|
||||
|
||||
== Relationship to other databases ==
|
||||
|
||||
=== Relationship to key-value stores ===
|
||||
A document-oriented database can be viewed as a specialized form of key-value store, which is itself a category of NoSQL database. In a basic key-value store, the stored value is typically treated as opaque by the database system. By contrast, a document-oriented database provides APIs or a query and update language that allows queries and modifications based on the internal structure of the document. For users who do not require advanced query, retrieval, or update capabilities, the distinction between document-oriented databases and key-value stores may be minimal.
|
||||
|
||||
=== Relationship to search engines ===
|
||||
Some search engine and information retrieval systems, such as Apache Solr and Elasticsearch, provide document storage and support core document operations. As a result, they may meet certain functional definitions of a document-oriented database, although their primary design goals differ.
|
||||
48
data/en.wikipedia.org/wiki/Document-oriented_database-1.md
Normal file
48
data/en.wikipedia.org/wiki/Document-oriented_database-1.md
Normal file
@ -0,0 +1,48 @@
|
||||
---
|
||||
title: "Document-oriented database"
|
||||
chunk: 2/2
|
||||
source: "https://en.wikipedia.org/wiki/Document-oriented_database"
|
||||
category: "reference"
|
||||
tags: "science, encyclopedia"
|
||||
date_saved: "2026-05-05T09:54:05.454091+00:00"
|
||||
instance: "kb-cron"
|
||||
---
|
||||
|
||||
=== Relationship to relational databases ===
|
||||
In a relational database, data is organized into predefined types represented as tables. Each table contains rows (records) with a fixed set of columns (fields), so all records in a table share the same structure. Administrators typically define indexes on selected fields to improve query performance. A central principle of relational database design is database normalization, in which data that might otherwise be repeated is stored in separate tables and linked using keys. When records in different tables are related, a foreign key is used to associate them.
|
||||
For example, an address book application may store a contact's name, image, phone numbers, mailing addresses, and email addresses. In a normalized relational design, separate tables might be created for contacts, phone numbers, and email addresses. The phone number table would include a foreign key referencing the associated contact. To reconstruct a complete contact record, the database retrieves related information from each table using the foreign keys and combines it into a single record.
|
||||
In contrast, a document-oriented database stores all data related to an object within a single document, and stored in the database as a single entry. In the address book example,the contact's name, image, and contact information may be stored together in one document. The document is retrieved using a unique key, and all related information is returned together, without needing to look up multiple tables.
|
||||
A key difference between the document-oriented and relational models is that the data formats are not predefined in the document case. In most cases, any sort of document can be stored in a database, and documents can change in type and form over time. For example, a new field such as COUNTRY_FLAG can be added to new documents as they are inserted without affecting existing documents. To aid retrieval, document-oriented systems generally allow the administrator to provide hints to the database for locating certain types of information. These hints work in a similar fashion to indexes in relational databases. Many systems also allow additional metadata outside the content of the document itself, such as tagging entries as part of an address book, which enables retrieval of related information, for instance, all the address book entries. This provides functionality similar to a table, but separates the concept (categories of data) from its physical implementation (tables).
|
||||
In the traditional normalized relational model, objects in the database are represented as separate rows of data, with no inherent structure beyond what is defined in the tables. This can create difficulties when translating programming objects to and from their corresponding database rows, a challenge known as object-relational impedance mismatch. In contrast, document stores often map programming objects directly into the database, preserving much of their internal structure. Databases using this approach are frequently described as NoSQL systems.
|
||||
|
||||
== Implementations ==
|
||||
|
||||
=== XML database implementations ===
|
||||
|
||||
Most XML databases are document-oriented databases.
|
||||
|
||||
== See also ==
|
||||
Database theory
|
||||
Data hierarchy
|
||||
Data analysis
|
||||
Full-text search
|
||||
In-memory database
|
||||
Internet Message Access Protocol (IMAP)
|
||||
Machine-readable document
|
||||
Multi-model database
|
||||
NoSQL
|
||||
Object database
|
||||
Online database
|
||||
Real-time database
|
||||
Relational database
|
||||
Content management system
|
||||
|
||||
== Notes ==
|
||||
|
||||
== References ==
|
||||
|
||||
== Further reading ==
|
||||
Arkin, Assaf. (September 20, 2007). "Read Consistency: Dumb Databases, Smart Services."Labnotes: Don't Let the Bubble Go to Your Head! Archived March, 27 2008 at the Wayback Machine.
|
||||
|
||||
== External links ==
|
||||
DB-Engines Ranking of Document Stores by popularity, updated monthly
|
||||
162
data/en.wikipedia.org/wiki/Double_mass_analysis-0.md
Normal file
162
data/en.wikipedia.org/wiki/Double_mass_analysis-0.md
Normal file
@ -0,0 +1,162 @@
|
||||
---
|
||||
title: "Double mass analysis"
|
||||
chunk: 1/1
|
||||
source: "https://en.wikipedia.org/wiki/Double_mass_analysis"
|
||||
category: "reference"
|
||||
tags: "science, encyclopedia"
|
||||
date_saved: "2026-05-05T09:54:07.787564+00:00"
|
||||
instance: "kb-cron"
|
||||
---
|
||||
|
||||
Double mass analysis is a simple graphical method to evaluate the consistency of hydrological data. The DM approach plots the cumulative data of one variable against the cumulative data of a second variable. A break in the slope of a linear function fit to the data is thought to represent a change in the relation between the variables. This approach provides a robust method to determine a change in the behavior of precipitation and recharge in a simple graphical method. It is a commonly used data analysis approach for investigating the behaviour of records made of hydrological or meteorological data at a number of locations. It is used to determine whether there is a need for corrections to the data - to account for changes in data collection procedures or other local conditions. Such changes may result from a variety of things including changes in instrumentation, changes in observation procedures, or changes in gauge location or surrounding conditions. Double mass analysis for checking consistency of a hydrological or meteorological record is considered to be an essential tool before taking it for analysis purpose. This method is based on the hypothesis that each item of the recorded data of a population is consistent.
|
||||
An example of a double mass analysis is a "double mass plot", or "double mass curve". For this, points and/or a joining line are plotted where the x- and y- coordinates are determined by the running totals of the values observed at two stations. If both stations are affected to the same extent by the same trends then a double mass curve should follow a straight line. A break in the slope of the curve would indicate that conditions have changed at one location but not at another. Breaks in the double-mass curve of such variables are caused by changes in the relation between the variables. These changes may be due to changes in the method of data collection or to physical changes that affect the relation. This technique is based on the principle that when each recorded data comes from the same parent population, they are consistent.
|
||||
|
||||
|
||||
== Procedure for Data Correction ==
|
||||
Let
|
||||
|
||||
|
||||
|
||||
(
|
||||
|
||||
x
|
||||
|
||||
i
|
||||
|
||||
|
||||
,
|
||||
|
||||
y
|
||||
|
||||
i
|
||||
|
||||
|
||||
)
|
||||
|
||||
|
||||
{\displaystyle (x_{i},y_{i})}
|
||||
|
||||
be the data points then the procedure for double mass analysis is as follows;
|
||||
|
||||
Divide the data into
|
||||
|
||||
|
||||
|
||||
|
||||
n
|
||||
|
||||
i
|
||||
|
||||
|
||||
|
||||
|
||||
{\displaystyle n_{i}}
|
||||
|
||||
distinct categories of equal slope (
|
||||
|
||||
|
||||
|
||||
|
||||
S
|
||||
|
||||
i
|
||||
|
||||
|
||||
|
||||
|
||||
{\displaystyle S_{i}}
|
||||
|
||||
).
|
||||
Obtain correction factor for category
|
||||
|
||||
|
||||
|
||||
|
||||
n
|
||||
|
||||
i
|
||||
+
|
||||
1
|
||||
|
||||
|
||||
|
||||
|
||||
{\displaystyle n_{i+1}}
|
||||
|
||||
as;
|
||||
|
||||
|
||||
|
||||
|
||||
c
|
||||
|
||||
i
|
||||
|
||||
|
||||
=
|
||||
|
||||
|
||||
|
||||
S
|
||||
|
||||
i
|
||||
|
||||
|
||||
|
||||
S
|
||||
|
||||
i
|
||||
+
|
||||
1
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
{\displaystyle c_{i}={\frac {S_{i}}{S_{i+1}}}}
|
||||
|
||||
|
||||
Multiply
|
||||
|
||||
|
||||
|
||||
|
||||
n
|
||||
|
||||
i
|
||||
+
|
||||
1
|
||||
|
||||
|
||||
|
||||
|
||||
{\displaystyle n_{i+1}}
|
||||
|
||||
category with
|
||||
|
||||
|
||||
|
||||
|
||||
c
|
||||
|
||||
i
|
||||
|
||||
|
||||
|
||||
|
||||
{\displaystyle c_{i}}
|
||||
|
||||
to get corrected data.
|
||||
After correction, repeat this process until all data points have the same slope.
|
||||
|
||||
|
||||
== See also ==
|
||||
Statistics
|
||||
|
||||
|
||||
== Notes ==
|
||||
|
||||
|
||||
== Further reading ==
|
||||
Dubreuil P. (1974) Initiation à l'analyse hydrologique Masson& Cie et ORSTOM, Paris.
|
||||
46
data/en.wikipedia.org/wiki/Error_level_analysis-0.md
Normal file
46
data/en.wikipedia.org/wiki/Error_level_analysis-0.md
Normal file
@ -0,0 +1,46 @@
|
||||
---
|
||||
title: "Error level analysis"
|
||||
chunk: 1/1
|
||||
source: "https://en.wikipedia.org/wiki/Error_level_analysis"
|
||||
category: "reference"
|
||||
tags: "science, encyclopedia"
|
||||
date_saved: "2026-05-05T09:54:08.944694+00:00"
|
||||
instance: "kb-cron"
|
||||
---
|
||||
|
||||
Error level analysis (ELA) is the analysis of compression artifacts in digital data with lossy compression such as JPEG.
|
||||
|
||||
|
||||
== Principles ==
|
||||
When used, lossy compression is normally applied uniformly to a set of data, such as an image, resulting in a uniform level of compression artifacts.
|
||||
Alternatively, the data may consist of parts with different levels of compression artifacts. This difference may arise from the different parts having been repeatedly subjected to the same lossy compression a different number of times, or the different parts having been subjected to different kinds of lossy compression. A difference in the level of compression artifacts in different parts of the data may therefore indicate that the data has been edited.
|
||||
In the case of JPEG, even a composite with parts subjected to matching compressions will have a difference in the compression artifacts.
|
||||
In order to make the typically faint compression artifacts more readily visible, the data to be analyzed is subjected to an additional round of lossy compression, this time at a known, uniform level, and the result is subtracted from the original data under investigation. The resulting difference image is then inspected manually for any variation in the level of compression artifacts. In 2007, N. Krawetz denoted this method "error level analysis".
|
||||
Additionally, digital data formats such as JPEG sometimes include metadata describing the specific lossy compression used. If in such data the observed compression artifacts differ from those expected from the given metadata description, then the metadata may not describe the actual compressed data, and thus indicate that the data have been edited.
|
||||
|
||||
|
||||
== Limitations ==
|
||||
By its nature, data without lossy compression, such as a PNG image, cannot be subjected to error level analysis. Consequently, since editing could have been performed on data without lossy compression with lossy compression applied uniformly to the edited, composite data, the presence of a uniform level of compression artifacts does not rule out editing of the data.
|
||||
Additionally, any non-uniform compression artifacts in a composite may be removed by subjecting the composite to repeated, uniform lossy compression. Also, if the image color space is reduced to 256 colors or less, for example, by conversion to GIF, then error level analysis will generate useless results.
|
||||
More significant, the actual interpretation of the level of compression artifacts in a given segment of the data is subjective, and the determination of whether editing has occurred is therefore not robust.
|
||||
|
||||
|
||||
== Controversy ==
|
||||
In May 2013, Dr Neal Krawetz used error level analysis on the 2012 World Press Photo of the Year and concluded on his Hacker Factor blog that it was "a composite" with modifications that "fail to adhere to the acceptable journalism standards used by Reuters, Associated Press, Getty Images, National Press Photographer's Association, and other media outlets". The World Press Photo organizers responded by letting two independent experts analyze the image files of the winning photographer and subsequently confirmed the integrity of the files. One of the experts,
|
||||
Hany Farid, said about error level analysis that "It incorrectly labels altered images as original and incorrectly labels original images as altered with the same likelihood". Krawetz responded by clarifying that "It is up to the user to interpret the results. Any errors in identification rest solely on the viewer".
|
||||
In May 2015, the citizen journalism team Bellingcat wrote that error level analysis revealed that the Russian Ministry of Defense had edited satellite images related to the Malaysia Airlines Flight 17 disaster. In a reaction to this, image forensics expert Jens Kriese said about error level analysis: "The method is subjective and not based entirely on science", and that it is "a method used by hobbyists". On his Hacker Factor Blog, the inventor of error level analysis Neal Krawetz criticized both Bellingcat's use of error level analysis as "misinterpreting the results" but also on several points Jens Kriese's "ignorance" regarding error level analysis.
|
||||
|
||||
|
||||
== See also ==
|
||||
Image analysis
|
||||
|
||||
|
||||
== References ==
|
||||
|
||||
"Tutorial: Error Level Analysis". fotoforensic.com. Retrieved 2015-07-23.
|
||||
|
||||
|
||||
== External links ==
|
||||
Image Forensics : Error Level Analysis
|
||||
FotoForensics
|
||||
Fake Image Detector using Error Level Analysis
|
||||
31
data/en.wikipedia.org/wiki/Facet_theory-0.md
Normal file
31
data/en.wikipedia.org/wiki/Facet_theory-0.md
Normal file
@ -0,0 +1,31 @@
|
||||
---
|
||||
title: "Facet theory"
|
||||
chunk: 1/8
|
||||
source: "https://en.wikipedia.org/wiki/Facet_theory"
|
||||
category: "reference"
|
||||
tags: "science, encyclopedia"
|
||||
date_saved: "2026-05-05T09:54:10.375224+00:00"
|
||||
instance: "kb-cron"
|
||||
---
|
||||
|
||||
Facet theory is a metatheory for the multivariate behavioral sciences that posits that scientific theories and measurements can be advanced by discovering relationships between conceptual classifications of research variables and empirical partitions of data-representation spaces. For this purpose, facet theory proposes procedures for (1) Constructing or selecting variables for observation, using the mapping sentence technique (a formal definitional framework for a system of observations), and (2) Analyzing multivariate data, using data representation spaces, notably those depicting similarity measures (e.g., correlations), or partially ordered sets, derived from the data.
|
||||
Facet theory is characterized by its direct concern with the entire content-universe under study, containing many, possibly infinitely many, variables. Observed variables are regarded just as a sample of statistical units from the multitude of variables that make up the investigated attribute (the content-universe). Hence, Facet theory proposes techniques for sampling variables for observation from the entire content universe; and for making inferences from the sample of observed variables to the entire content universe. The sampling of variables is done with the aid of the mapping sentence technique (see Section 1); and inferences from the sample of observed variables to the entire content universe are made with respect to correspondences between conceptual classifications (of attribute-variables or of population-members) and partitions of empirical geometric representation spaces obtained in data analysis (see Sections 2 & 3).
|
||||
Of the many types of representation spaces that have been proposed, two stand out as especially fruitful: Faceted-SSA (Faceted Smallest Space Analysis) for structuring the investigated attribute (see Section 2); and POSAC (Partial Order Scalogram Analysis by base Coordinates) for multiple scaling measurements of the investigated attribute (see Section 3).
|
||||
Inasmuch as observed variables in a behavioral study form in fact but a sample from the content-universe of interest, facet theory's procedures and principles serve to avoid errors that may ensue from incidental sampling of observed variables, thus meeting the challenge of the replication crisis in psychological research and in behavioral research in general.
|
||||
Facet Theory was initiated by Louis Guttman and has been further developed and applied in a variety of disciplines of the behavioral sciences including psychology, sociology, and business administration.
|
||||
|
||||
== The mapping sentence ==
|
||||
|
||||
=== Definition and properties of the mapping sentence ===
|
||||
Definition (Guttman). A mapping sentence is a verbal statement of the domain and of the range of a mapping including connectives between facets as in ordinary language.
|
||||
In the context of behavioral research, a mapping sentence is essentially a function whose domain consists of the respondents and of the stimuli as arguments, and whose image consists of the cartesian product of the ranges of responses to the stimuli, where each response-range is similarly ordered from high to low with respect to a concept common to all stimuli. When stimuli are classified a priori by one or more content criteria, the mapping sentence facilitates stratified sampling of the content-universe. A classification of the stimuli by their content is called a content facet; and the pre-specified set of responses to a stimulus (classifying respondents by their response to that stimulus) is called a range facet.
|
||||
The mapping sentence defines the system of observations to be performed. As such, the mapping sentence provides also the essential concepts in terms of which research hypotheses may be formulated.
|
||||
|
||||
=== An example from intelligence research ===
|
||||
Suppose members pi of a population P are observed with respect to their success in a written verbal intelligence test. Such observations may be described as a mapping from the observed population to the set of possible scores, say, R = {1,...,10}: Pq1 → R, where q1 is the sense in which a specific score is assigned to every individual in the observed population P, i.e., q1 is "verbal intelligence" in this example. Now, one may be interested in observing also the mathematical or, more specifically, the numerical intelligence of the investigated population; and possibly also their spatial intelligence. Each of these kinds of intelligence is a "sense" in which population members pi may be mapped into a range of scores R = {1,...,10}. Thus, 'intelligence' is now differentiated into three types of materials: verbal (q1), numerical (q2) and spatial (q3). Together, P, the population, and Q = {q1, q2, q3}, the set of types of intelligence, form a cartesian product which constitutes the mapping domain. The mapping is from the set of pairs (pi, qj) to the common range of test-scores R = {1,...,10}: P × Q → R.
|
||||
A facet is a set that serves as a component-set of a cartesian product. Thus, P is called the population facet, Q is called a content facet, and the set of scores obtainable for each test is a range facet. The range facets of the various items (variables) need not be identical in size: they may have any finite number of scores, or categories, greater or equal to 2.
|
||||
|
||||
=== The Common Meaning Range (CMR) ===
|
||||
The ranges of the items pertaining to an investigated content-universe – intelligence in this example – should all have a Common Meaning Range (CMR); that is, they must be ordered from high to low with respect to a common meaning. Following Guttman, the common meaning proposed for the ranges of intelligence-items is "correctness with respect to an objective rule".
|
||||
The concept of CMR is central in facet theory: It serves to define the content-universe being studied by specifying the universe of items pertaining to that content-universe. Thus, the mapping-definition of intelligence, advanced by facet theory is:
|
||||
"An item belongs to the universe of intelligence items if and only if its domain requires performance of a cognitive task concerning an objective rule and its range is ordered from high correctness to low correctness with respect to that rule."
|
||||
34
data/en.wikipedia.org/wiki/Facet_theory-1.md
Normal file
34
data/en.wikipedia.org/wiki/Facet_theory-1.md
Normal file
@ -0,0 +1,34 @@
|
||||
---
|
||||
title: "Facet theory"
|
||||
chunk: 2/8
|
||||
source: "https://en.wikipedia.org/wiki/Facet_theory"
|
||||
category: "reference"
|
||||
tags: "science, encyclopedia"
|
||||
date_saved: "2026-05-05T09:54:10.375224+00:00"
|
||||
instance: "kb-cron"
|
||||
---
|
||||
|
||||
An initial framework for observing intelligence could be Mapping Sentence 1.
|
||||
The mapping sentence serves as a unified semantic device for specifying the system of intelligence test items, according to the present conceptualization. Its content facet, the material facet, may now serve as a classification of intelligence test items to be considered. Thus, in designing observations, a stratified sampling of items is afforded by ensuring an appropriate selection of items from each of the material facet elements; that is, from each class of items: the verbal, the numerical and the spatial.
|
||||
|
||||
=== Enriching the mapping sentence ===
|
||||
The research design can be enriched by introducing to the mapping sentence an additional, independent classification of the observations in the form of an additional content-facet, thereby facilitating systematic differentiations of the observations. For example, intelligence items may be classified also according to the cognitive operation required in order to respond correctly to an item: whether rule-recall (memory), rule-application, or rule-inference. Instead of the three sub-content-universes of intelligence defined by the material facet alone, we now have nine sub-content-universes defined by the cartesian multiplication of the material and the mental-operation facets. See mapping sentence 2.
|
||||
|
||||
Another way of enriching a mapping sentence (and the scope of the research) is by adding an element (a class) to an existing content facet; for example, by adding Interpersonal material as a new element to the extant material facet. See Mapping Sentence 3.
|
||||
|
||||
=== Content profiles ===
|
||||
A selection of one element from each of the two content facets defines a content profile which represents a sub-content-universe of intelligence. For example, the content profile (c2, q2) represents the application of rules for performing mathematical computations, such as performing long division. The 3x4=12 sub-content-universes constitute twelve classes of intelligence items. In designing observations, the researcher would strive to include a number of varied items from each of these 12 classes so that the sample of observed items would be representative of the entire intelligence universe. Of course, this stratified sampling of items depends on the researchers' conception of the studied domain, reflected in their choice of content-facets. But, in the larger cycle of the scientific investigation (which includes Faceted SSA of empirical data, see next section), this conception may undergo adjustments and remolding, converging to improved choices of content-facets and observations, and ultimately to robust theories in research domain. In general, mapping sentences may attain high levels of complexity, size and abstraction through various logical operations such as recursion, twist, decomposition and completion.
|
||||
|
||||
=== Cartesian decomposition and completion: an example ===
|
||||
In drafting a mapping sentence, an effort is made to include the most salient content-facets, according to the researcher's existing conception of the investigated domain. And for each content facet, attempt is made to specify its elements (classes) so that they be exhaustive (complete) and exclusive (non-overlapping) of each other. Thus, the element 'interpersonal' has been added to the incumbent 3-element material facet of intelligence by a two-step facet-analytic procedure. Step 1, cartesian decomposition of the 3-element material facet into two binary elementary facets: The Environment Facet, whose elements are 'physical environment' and 'human-environment'; and the Symbolization Facet whose elements are 'symbolic' (or high symbolization), and 'concrete' (or low symbolization). Step 2, cartesian completion of the material facet is then sought by attempting to infer the missing material classifiable as 'human environment' and 'concrete'.
|
||||
|
||||
In facet theory, this 2×2 classification of intelligence-testing material may now be formulated as an hypothesis to be tested empirically, using Faceted Smallest Space Analysis (SSA).
|
||||
|
||||
=== Complementary topics concerning the mapping sentence ===
|
||||
Despite its seemingly rigid appearance, the mapping sentence format can accommodate complex semantic structures such as twists and recursions, while retaining its essential cartesian structure.
|
||||
In addition to guiding the collection of data, mapping sentences have been used to content-analyze varieties of conceptualizations and texts—such as organizational quality, legal documents and even dream stories.
|
||||
|
||||
== Concepts as spaces: faceted SSA ==
|
||||
|
||||
=== Description of Faceted Smallest Space Analysis (Faceted SSA) ===
|
||||
Facet theory conceives of a multivariate attribute as a content-universe defined by the set of all its items, as specified by the attribute mapping-definition, illustrated above. In facet-theoretical data analysis, the attribute (e.g., intelligence) is likened to a geometric space of suitable dimensionality, whose points represent all possible items. Observed items are processed by Faceted SSA, a version of Multidmensional Scaling (MDS) which involves the following steps:
|
||||
38
data/en.wikipedia.org/wiki/Facet_theory-2.md
Normal file
38
data/en.wikipedia.org/wiki/Facet_theory-2.md
Normal file
@ -0,0 +1,38 @@
|
||||
---
|
||||
title: "Facet theory"
|
||||
chunk: 3/8
|
||||
source: "https://en.wikipedia.org/wiki/Facet_theory"
|
||||
category: "reference"
|
||||
tags: "science, encyclopedia"
|
||||
date_saved: "2026-05-05T09:54:10.375224+00:00"
|
||||
instance: "kb-cron"
|
||||
---
|
||||
|
||||
Receiving as input (or computing from input data) a matrix of similarity coefficients, specifying, for each pair of items how similar they are. A common example is the computation of a correlation-coefficient matrix from input data, where the size of a correlation coefficient between two variables reflects the degree of similarity between them.
|
||||
Mapping the items (variables) as points in a geometric space of a given dimensionality while preserving as well as possible the condition: If rij>rkl then dij<dkl for all i,j,k,l where rij is the similarity measure (e.g., correlation coefficient) between variables i,j and dij is the distance between their points in the space. Most often, Euclidean distance function (Minkowsky distance of order 2) is used. But other distance functions, especially the Manhattan distance function (Minkowsky distance of order 1) are called for. (See Subsection Relating POSAC Measurement Space to the SSA Concept Space below.) The goodness-of-fit of the resulting mapping may be assessed by a loss function– Kruskal's Stress coefficient or Guttman's Coefficient of Alienation.
|
||||
Partitioning the space as well as possible, into simple regions (stripe, sectors or concentric rings) whose variables are in 1-1 correspondence with a pre-conceived content-facet. To run this option, content facet(s) must be specified as Faceted SSA input.
|
||||
Step 3 of Faceted SSA incorporates the idea that observed variables included in the Faceted SSA procedure, typically constitute a small subset from the countless items that define the attribute content-universe. But their locations in space may serve as clues that guide the partitioning of the space into regions, in effect classifying all points in space, including those pertaining to unobserved items (had they been observed). This procedure, then, tests the regional hypothesis that the sub-content-universes defined by a content-facet elements exist each as a distinct empirical entity. The Shye-Kingsley Separation Index (SI) assesses the goodness-of-fit of the partition to the content-facet.
|
||||
The spatial scientific imagery suggested by Facet Theory has far reaching consequences that set Facet Theory apart from other statistical procedures and research strategies. Specifically, it facilitates inferences concerning the structure of the entire content-universe investigated, including unobserved items.
|
||||
|
||||
=== Example 1. The structure of intelligence ===
|
||||
Intelligence testing has been conceived as described above, with Mapping Sentence 2 as a framework for its
|
||||
observation. In many studies, different samples of variables conforming to Mapping Sentence 2 have been analyzed confirming two regional hypotheses:
|
||||
|
||||
The Material Content Facet corresponds to a partition of the Faceted SSA map of intelligence into sectors, each containing the items of a single material — verbal, numeric, and figural (spatial).
|
||||
The Cognitive Operation Facet corresponds to a partition of the Faceted SSA map of intelligence into concentric rings, with the innermost ring containing inference items; the middle ring containing the rule-application items; and the outermost ring containing the rule-recall items.
|
||||
The superposition of these two partition patterns results in a scheme known as the Radex Theory of Intelligence, see Figure 1.
|
||||
The radex structure, which originated earlier as "a new approach to factor analysis", has been found also in the study of color perception as well as in other domains of research.
|
||||
Faceted SSA has been applied in a wide variety of research areas including value research social work and criminology and many others.
|
||||
|
||||
=== Example 2. The structure of quality of life ===
|
||||
The Systemic Quality of Life (SQOL) has been defined as the effective functioning of human individuals in four functioning subsystems: the cultural, the social, the physical and the personality subsystems. The axiomatic foundations of SQOL suggest the regional hypothesis that the four subsystems should be empirically validated (i.e., item of each would occupy a distinct region) and that they be mutually oriented in space in a specific 2x2 pattern topologically equivalent to the 2x2 classification shown in Figure 2 (i.e., personality opposite cultural, and physical opposite social). The hypothesis has been confirmed by many studies.
|
||||
|
||||
=== Types of partition patterns ===
|
||||
Of the many possible partitions of a 2-d concept space, three stand out as especially useful for theory construction:
|
||||
|
||||
The Axial Partition Pattern: Partitioning of the space into stripes by parallel lines.
|
||||
The Angular (a/k/a polar) Partition Pattern: Partitioning of the space into sectors by radii emanating from a point in space.
|
||||
The Radial (a/k/a modular) Partition Pattern: Partitioning of the space into concentric rings by concentric circles.
|
||||
The advantages of these partition patterns as likely models for behavioral data are that they are describable by a minimal number of parameters, hence avoid overfitting; and that they are generalizable to partition in spaces of higher dimensionalities.
|
||||
In testing regional hypotheses, the fit of a content-facet to any one of these three models is assessed the Separation Index (SI), a normalized measure of the deviation of variables from the region assigned to them by the model.
|
||||
Concept spaces in higher dimensionalities have been found as well.
|
||||
29
data/en.wikipedia.org/wiki/Facet_theory-3.md
Normal file
29
data/en.wikipedia.org/wiki/Facet_theory-3.md
Normal file
@ -0,0 +1,29 @@
|
||||
---
|
||||
title: "Facet theory"
|
||||
chunk: 4/8
|
||||
source: "https://en.wikipedia.org/wiki/Facet_theory"
|
||||
category: "reference"
|
||||
tags: "science, encyclopedia"
|
||||
date_saved: "2026-05-05T09:54:10.375224+00:00"
|
||||
instance: "kb-cron"
|
||||
---
|
||||
|
||||
=== Principles of faceted SSA: A summary ===
|
||||
1. The attribute under study is represented by a geometric space.
|
||||
2. Variables of the attribute are represented as points in that space. Conversely, every point in the geometric space is a variable of the attribute. This is the Continuity Principle.
|
||||
3. The observed variables, located as points in the empirical Faceted SSA map, constitute but a sample drawn from the many (possibly infinitely many) variables constituting the content universe of the attribute investigated.
|
||||
4. The observed variables chosen for SSA must all belong to the same content universe. This is ensured by including in the SSA only variables whose ranges are similarly ordered with respect to a common meaning (CMR).
|
||||
5. The sample of variables marked on the Faceted SSA map is used as a guide for inferring possible partitions of the SSA-attribute-map into distinct regions, each region representing a component, or subdomain, of the attribute.
|
||||
6. In Facet Theory, relationships between attribute components (such as verbal intelligence and numeric intelligence as components of intelligence), are expressed in geometric terms –such as shapes and spatial orientation – rather than in algebraic terms. Just as one would describe relationships between neighboring countries in terms of their shapes and geographical orientation, not in terms of distances between them.
|
||||
7. The imagery of an attribute as a continuous space, from which variables are sampled, implies that clustering of variables in SSA map has no significance: It is just an artifact of the sampling of the variables. Sampled variables that are clustered together may belong to different subdomains; just as two cities that are close together may be located in different countries. Conversely, variables that are far apart, may belong to the same sub-domain; just as two cities that are far apart may belong to the same country. What matters is the identification of distinct regions with well-defined sub-domains. Facet Theory proposes a way of transcending accidental clustering of variables by focusing on a robust and replicable aspect of the data, namely the partitionability of the attribute-space.
|
||||
These principles bring in new concepts, raise new questions, and opens new ways of understanding behavior. Thus, Facet Theory represents a paradigm of its own for multivariate behavioral research.
|
||||
|
||||
=== Complementary topics in faceted SSA ===
|
||||
Besides analyzing a data matrix of N individuals by n variables, as discussed above, Faceted SSA is usefully employed in additional modes.
|
||||
Direct measures of (dis)similarity. For a given a set of objects and a similarity (or dissimilarity) measure between every pair of objects, Faceted SSA can provide a map whose regions correspond to a specified classification of the objects. For example, in a study of color perception, a sample of spectral colors, with a measure of perceived similarity between every pair of colors, yielded the radex theory of spectral color perception. In a study of community elites, a measure of distance devised between pairs of community leaders, yielded a sociometric map whose regions were interpreted from the perspective of sociological theory.
|
||||
Transposed data matrix. Switching the roles of individuals and variables, Faceted SSA may be applied to individuals rather than to the variables. This rarely used procedure may be justified to the extent variables evenly cover a research domain. For example, intercorrelations between members of a multidisciplinary team of experts were computed based on their human quality-of life value assessments. The resulting Faceted SSA map yielded a radex of disciplines, supporting the association between social institutions and human values.
|
||||
|
||||
== Multiple scaling by POSAC ==
|
||||
|
||||
=== Description of Partial Order Scalogram Analysis by Coordinates (POSAC) ===
|
||||
In Facet Theory, the measurement of investigated individuals (and, by extension, of all individuals belonging to the sampled population) with respect to a multivariate attribute, is based on the following assumptions and conditions:
|
||||
77
data/en.wikipedia.org/wiki/Facet_theory-4.md
Normal file
77
data/en.wikipedia.org/wiki/Facet_theory-4.md
Normal file
@ -0,0 +1,77 @@
|
||||
---
|
||||
title: "Facet theory"
|
||||
chunk: 5/8
|
||||
source: "https://en.wikipedia.org/wiki/Facet_theory"
|
||||
category: "reference"
|
||||
tags: "science, encyclopedia"
|
||||
date_saved: "2026-05-05T09:54:10.375224+00:00"
|
||||
instance: "kb-cron"
|
||||
---
|
||||
|
||||
Variables processed by Facet Theory measurement operations to be described below, evenly cover the attribute content universe. To ensure such coverage, Facet Theory measurement operations are often performed not on the sample of the observed items themselves, but rather on composite variables that represent facet elements that had been validated by Faceted SSA.
|
||||
The sample of individuals is rich enough to allow existing score-profiles of the processed variables to be observed.
|
||||
In the resulting measurement, order relations among individuals should preserve sufficiently well order relations (including comparability and incomparability; see below) between individuals' profiles of the processed variables.
|
||||
The result of the measurement operation yields the smallest number of scales;
|
||||
The resultant scales represent fundamental variables whose interpretation derives from the contents of the observed items, but does not depend on the particular sample of items observed.
|
||||
Partial order analysis of observed data. Let observed items v1,...,vn with a common-meaning range (CMR) represent an investigated content universe; let A1,...,An be their ranges with each Aj ordered from high to low with respect to the common meaning; and let A = A1×A2 × ... × An be the cartesian product of all the range facets, Aj (j = 1,...,n). A system of observations is a mapping P → A from the observed subjects P to A, that is, each subject pi gets a score from each Aj (j = 1,...,n), or pi → [ai1,ai2, ..., ain]
|
||||
|
||||
|
||||
|
||||
≡
|
||||
|
||||
|
||||
{\displaystyle \equiv }
|
||||
|
||||
a(pi). The point a(pi) in A is also called the profile of pi, and the subset A′ of A (
|
||||
|
||||
|
||||
|
||||
|
||||
A
|
||||
′
|
||||
|
||||
⊆
|
||||
A
|
||||
|
||||
|
||||
{\displaystyle A'\subseteq A}
|
||||
|
||||
) of observed profiles is called a scalogram. Facet Theory defines relations between profiles as follows: Two different profiles ai = [ai1,ai2,...,ain] and aj = [aj1,aj2,...,ajn], are comparable, denoted by aiSaj, with ai greater than aj, ai > aj, if and only if aik ≥ ajk for k = 1, ..., n, and aik′ > ajk′ for some k. Two different profiles are incomparable, denoted by ai $ aj, if neither ai > aj nor aj > ai. A, and therefore its subset A′, form a partially ordered set.
|
||||
Facet Theoretical measurement consists in mapping points a(pi) of A' into a coordinate space X of the lowest dimensionality while preserving observed order relations, including incomparability:
|
||||
Definition. The p.o. dimensionality of scalogram A' is the smallest m (m ≤ n) for which there exist m facets X1 ... Xm (each Xi is ordered) and there exists a 1-1 mapping Q:X′ → A′ from X′ (
|
||||
|
||||
|
||||
|
||||
|
||||
X
|
||||
′
|
||||
|
||||
⊆
|
||||
X
|
||||
=
|
||||
|
||||
X
|
||||
|
||||
1
|
||||
|
||||
|
||||
×
|
||||
⋯
|
||||
×
|
||||
|
||||
X
|
||||
|
||||
m
|
||||
|
||||
|
||||
|
||||
|
||||
{\textstyle X'\subseteq X=X_{1}\times \cdots \times X_{m}}
|
||||
|
||||
) to A′ such that a > a′ if and only if x > x′ whenever Q maps points x, x′ in X′ to points a, a′ ∈ A.
|
||||
The coordinate scales, Xi (i = 1, ..., m) represent underlying fundamental variables whose meanings must be inferred in any specific application. The well known Guttman scale [24] (example: 1111, 1121, 1131, 2131, 2231, 2232) is simply a 1-d scalogram, i.e. one all of whose profiles are comparable.
|
||||
The procedure of identifying and interpreting the coordinate scales X1...Xm is called multiple scaling. multiple scaling is facilitated by partial order scalogram analysis by base coordinates (POSAC) for which algorithms and computer programs have been devised. In practice, a particular dimensionality is attempted and a solution that best accommodates the order-preserving condition is sought. The POSAC/LSA program finds an optimal solution in 2-d coordinate space, then goes on to analyze by Lattice Space Analysis (LSA) the role played by each of the variables in structuring the POSAC 2-space, thereby facilitating interpretation of the derived coordinate scales, X1, X2. Recent developments include the algorithms for computerized partitioning of the POSAC space by the range facet of each variable, which induces meaningful intervals on the coordinate scales, X, Y.
|
||||
|
||||
=== Example 3. TV watching patterns: analysis of simplified survey data ===
|
||||
Source:
|
||||
Members of a particular population were asked four questions: whether they watched TV the night before for an hour at 7 PM (hour 1), at 8 PM (hour 2), at 9 PM (hour 3) and at 10 PM (hour 4). A positive answer to a question was recorded as 1, and a negative answer, as 0. Thus, for example, the profile 1010 represents a person who watched TV at 7 PM and at 9 PM but not at 8 PM and at 10 PM. Suppose that out of the 16 combinatorially possible profiles, only the following eleven profiles were observed empirically: 0000, 1000, 0100, 0010, 0001, 1100, 0110, 0011, 1110, 0111, 1111. Figure 3 is an order-preserving mapping of these profiles into a 2-dimensional coordinate space.
|
||||
41
data/en.wikipedia.org/wiki/Facet_theory-5.md
Normal file
41
data/en.wikipedia.org/wiki/Facet_theory-5.md
Normal file
@ -0,0 +1,41 @@
|
||||
---
|
||||
title: "Facet theory"
|
||||
chunk: 6/8
|
||||
source: "https://en.wikipedia.org/wiki/Facet_theory"
|
||||
category: "reference"
|
||||
tags: "science, encyclopedia"
|
||||
date_saved: "2026-05-05T09:54:10.375224+00:00"
|
||||
instance: "kb-cron"
|
||||
---
|
||||
|
||||
Given this POSAC solution, an attempt is made to interpret the two coordinates, X1 and X2, as two fundamental scales of the investigated phenomenon of evening TV watching by the investigated population. This is done by, first, interpreting the intervals (equivalence classes) within each coordinate, and then trying to conceptualize the derived meanings of the ordered intervals, in terms of a meaningful notion that may be attributed to the coordinate.
|
||||
In the present simplified example, this is easy: Inspecting the map, we attempt to identify the feature that distinguishes all profiles with given score in X1. Thus, we find that profiles with X1=4, and only they, represent TV watching in the fourth hour. Profiles with X1 = 3 all have 1 in the third watching hour but 0 at the fourth hour, i.e., the third hour is the latest watching hour. X1 = 2 is assigned to, and only to, profiles whose latest watching hour is the second hour. And, finally, X1 = 1 is for the profile 1000 which represents the fact that the first hour is the only – and therefore the latest – watching hour (ignoring the profile 0000 of those who didn't watch TV at the specified hours, and could be assigned (0,0) in this coordinate-space). Hence, it may be concluded that intervals of coordinate X1 represent j=the latest hour—among the four hours observed—in which TV was watched, (j = 1, ..., 4). Similarly, it is found that intervals of coordinate X2 represent 5 − k for k (k = 1, ..., 4) is the earliest hour of TV watching.
|
||||
Indeed, for profiles of the observed set, which represent a single sequence of continuous TV watching, specification of the earliest and latest watching hours, provide full description of the watching hours.
|
||||
Example 3 illustrates key features of Multiple Scaling by POSAC that render this procedure a theory-based multivariate measurement:
|
||||
|
||||
The two scores assigned by Multiple Scaling to every observed profile—and hence to every person in the observed sample—replace the more numerous scores (four, in the present example) of the observed variables, while retaining all observed order relations, including incomparability. The new scores assess observed persons on the two coordinate-scales, taken to constitute Nature's fundamental variables.
|
||||
The two coordinate-scales have intrinsic meanings that probe into a deeper significance than the observed variables considered severally. In the present example, the earliest and the latest hour indeed exhaust the essential aspects of the pattern of TV watching, given the particular set of observed profiles.
|
||||
The concepts derived for the fundamental, unobserved coordinate-scales retain the CMR—the essential meaning common to all observed variables. In the present example, the CMR is more (vs. less) TV watching. For, considering the observed variables, each of them records high (1) vs. low (0) TV watching in a given hour. And the derived coordinate-scales, too, record high (4) vs. low (1) TV watching, since ceteris paribus, the later is the latest watching hour, the more TV one watches (X1); and the earlier is the earliest watching hour, the more TV one watches ( X2).
|
||||
These features are present also in applications that are less obvious, to produce scales with novel meanings.
|
||||
|
||||
=== Example 4. Measuring distributive justice attitudes ===
|
||||
In the systemic theory of distributive justice (DJ), alternative allocations of a given amount of an educational resource (100 supplementary teaching hours) between gifted and disadvantaged pupils, may be classified by one of four types, the preference for each reflecting one's DJ attitude:
|
||||
Equality, where the gifted and the disadvantaged pupils get the same amount of the supplementary resource;
|
||||
Fairness, where the disadvantaged pupils get more of the resource than the gifted, in proportion to their weakness relative to the gifted;
|
||||
Utility, where the gifted get more of the resource than the disadvantaged pupils (so as to promote future contribution to the general good);
|
||||
Corrective Action, where the disadvantaged pupils get more of the resource than the gifted over and above the proportion of their weakness relative to the gifted pupils, (so as to compensate them for past accumulated disadvantage);
|
||||
Following the Faceted SSA validation of the four DJ modes of Equality, Fairness, Utility, and Corrective Action, profiles based on eight dichotomized DJ attitudes variables observed on a sample of 191 respondents, were created. 35 of the 256 combinatorially possible profile were observed and analyzed by POSAC to obtain the measurement space shown in Figure 4. For each of the variables an optimal partition- line was computed that separates a high from a low score in that variable. (Logically, partition-lines must look like non-increasing step functions.) Then, for each of the four attitude types, the characteristic partition-line was identified as follows:
|
||||
|
||||
Fairness—a straight vertical line;
|
||||
Utility—a straight horizontal line;
|
||||
Equality—an L-shaped line;
|
||||
Corrective action—an inverted-L-shaped line
|
||||
The content significance of the intervals induced by these partition-lines on the X coordinate and on the Y coordinate of the POSAC space, are now identified and thereby define the contents of the X and Y Coordinate Scales of DJ attitudes.
|
||||
The X-coordinate Scale, interpreted as Enhanced Fairness Attitude Scale:
|
||||
|
||||
Interval 1. Low Fairness & Low Equality DJ Attitude
|
||||
Interval 2. Low Fairness & High Equality DJ Attitude
|
||||
Interval 3. High Fairness & Low Corrective Action DJ Attitude
|
||||
Interval 4. High Fairness & High Corrective Action DJ Attitude
|
||||
That is, Enhanced Fairness Attitude, even if low, (interval 1 and 2) is somewhat present when Equality is favored (interval 2). And if Enhanced Fairness Attitude is high (intervals 3 and 4), it reaches the extreme level (interval 4) when Corrective Action is favored.
|
||||
The Y-coordinate Scale, interpreted as Enhanced Utility Attitude Scale:
|
||||
21
data/en.wikipedia.org/wiki/Facet_theory-6.md
Normal file
21
data/en.wikipedia.org/wiki/Facet_theory-6.md
Normal file
@ -0,0 +1,21 @@
|
||||
---
|
||||
title: "Facet theory"
|
||||
chunk: 7/8
|
||||
source: "https://en.wikipedia.org/wiki/Facet_theory"
|
||||
category: "reference"
|
||||
tags: "science, encyclopedia"
|
||||
date_saved: "2026-05-05T09:54:10.375224+00:00"
|
||||
instance: "kb-cron"
|
||||
---
|
||||
|
||||
Interval 1. Low Utility & Low Equality DJ Attitude
|
||||
Interval 2. Low Utility & High Equality DJ Attitude
|
||||
Interval 3. High Utility s & Low Corrective Action DJ Attitude
|
||||
Interval 4. High Utility & High Corrective Action DJ Attitude
|
||||
That is, Enhanced Utility Attitude, even if low, (interval 1 and 2) is somewhat present when Equality is favored (interval 2). If Enhanced Utility Attitude is high (intervals 3 and 4), it reaches the extreme level (interval 4) when Corrective Action is favored. (This may well reflect the sentiment that, in the long run, the advancement of disadvantaged pupils serves the common good.)
|
||||
The meanings of the fundamental variables, X and Y, while relying on the concepts of fairness and of utility, respectively, suggest new notions that modify them. The new notions were christened Enhanced (or Extended) Fairness and Enhanced (or Extended) Utility.
|
||||
|
||||
=== Complementary topics in partial order spaces ===
|
||||
Higher order partition lines. The above simple measurement space illustrates partition-lines that are straight or have one bend. More complex measurement spaces result with items whose partition-lines have two or more bends.
|
||||
While partial order spaces are used mainly for analyzing score profiles (based on range facets), under certain conditions, they may be applied to the analysis of content profiles; i.e., those based on content facets.
|
||||
Relating POSAC Measurement Space to the SSA Concept Space. Based on the same data matrix, POSAC measurement space and Faceted SSA concept space are mathematically related. Proved relationships rely on the introduction of a new kind of coefficient, E*, the coefficient of structural similarity. While E* assesses pairwise similarity between variables, it does depend on variations in the remaining n-2 variables processed. That is, in the spirit of Facet Theory, E* depends on the sampled contents as well as on the sampled population. LSA1 procedure, within 2-dimensional POSAC/LSA program, is a special version of SSA with E* as the similarity coefficient, and with lattice ("city block") as the distance function. Under specified conditions, LSA1 may be readily derived from the boundary scales of the POSAC configuration, thereby highlighting concept/measurement space duality.
|
||||
30
data/en.wikipedia.org/wiki/Facet_theory-7.md
Normal file
30
data/en.wikipedia.org/wiki/Facet_theory-7.md
Normal file
@ -0,0 +1,30 @@
|
||||
---
|
||||
title: "Facet theory"
|
||||
chunk: 8/8
|
||||
source: "https://en.wikipedia.org/wiki/Facet_theory"
|
||||
category: "reference"
|
||||
tags: "science, encyclopedia"
|
||||
date_saved: "2026-05-05T09:54:10.375224+00:00"
|
||||
instance: "kb-cron"
|
||||
---
|
||||
|
||||
== Facet theory: comparisons and comments ==
|
||||
Concerned with the entire cycle of multivariate research – concept definition, observational design, and data analysis for concept-structure and measurement, Facet Theory constitutes a novel paradigm for the behavioral sciences. Hence, only limited aspects of it can be compared with specific statistical methods.
|
||||
A distinctive feature of Facet Theory is its explicit concern with the entire
|
||||
set of variables included in the investigated content-universe, regarding the subset of observed variables as but a sample from which inferences can be made. Hence, clusters of variables, if observed, are of no significance. They are simply unimportant artifacts of the procedure for sampling of the variables. This is in contrast with cluster analysis or factor analysis where recorded clustering patterns determine research results and interpretations. There have been various attempts to describe technical differences between Factor Analysis and Facet Theory. Briefly, it may be said that while Factor Analysis aims to structure the set of variables selected for observation, Facet Theory aims to structure the entire content universe of all variables, observed as well as unobserved, relying on the continuity principle and using regional hypotheses as an inferential procedure.
|
||||
Guttman's SSA, as well as Multidimensional Scaling (MDS) in general, were often described as a procedure for visualizing similarities (e.g., correlations) between analyzed units (e.g., variables) in which the researcher has specific interest. (See, for example, Wikipedia, October 2020: "Multidimensional scaling (MDS) is a means of visualizing the level of similarity of individual cases of a dataset"). Modern Facet Theory, however, concerned with theory construction in the behavioral sciences, assigns SSA/MDS space a different role. Regarding the analyzed units as a sample of statistical units representing all units that pertain to the content-universe, their dispersion in the SSA/MDS space is used to infer the structure of the content universe. Namely, to infer space partitionings that define components of the content-universes and their spatial interrelationships. The inferred structure, if replicated, may suggest a theory in the investigated domain and provide a basis for theory-based measurements.
|
||||
Misgivings and responses
|
||||
One reservation that has been voiced concerns the usefulness of a successful SSA map (one whose partition-pattern matches a content-classification of the mapped variables). What are the consequences of an SSA map? Does such a map qualify as a theory?
|
||||
In response, it may be pointed out that (a) consistently replicated empirical partition-patterns in a domain of research constitute a scientific lawfulness which, as such, are of interest to Science; (b) Often a partition-pattern leads to insights that explain behavior and may have potential applications. For example, the Radex Theory of Intelligence implies that inferential abilities are less differentiated by kinds of material than memory (or rule-recall, see Example 1 above). (c) Faceted SSA is a useful preliminary procedure for performing meaningful non arbitrary measurements by Multiple Scaling (POSAC). See Example 4.
|
||||
A common doubt about SSA was voiced by a sympathetic but mystified user of SSA: "Smallest Space Analysis seems to come up with provocative pictures that an imaginative observer can usually make some sense of –– in fact, I have often referred to SSA as the sociologist's Rorschach test for imagination". Indeed, missing in Facet Theory are statistical significance tests that would indicate the stability of discovered or hypothesized partition patterns across population samples. For example, it is not clear how to compute the probability of obtaining a hypothesized partition pattern, assuming that in fact the variables are randomly dispersed over the SSA map.
|
||||
In response, facet theorists claim that in Facet Theory the stability of research results is established by replications, as is the common practice in the natural sciences. Thus, if the same partition-pattern is observed across many population samples (and if no unexplained counterexamples are recorded), confidence in the research outcome would increase. Moreover, Facet Theory adds a stringent requirement for establishing scientific lawfulness, namely that the hypothesized partition-pattern would hold also across different selections of variables, sampled from the same mapping sentence.
|
||||
Facet Theory is regarded as a promising metatheory for the behavioral sciences by Clyde Coombs, an eminent psychometrician and pioneer of mathematical psychology, who commented: “It is not uncommon for a behavioral theory to be somewhat ambiguous about its domain. The result is that an experiment usually can be performed which will support it and another experiment will disconfirm it. … The problem of how to define the boundaries of a domain, especially in social and behavioral science, is subtle and complex. Guttman’s facet theory (see Shye, 1978) is, I believe, the only substantial attempt to provide a general theory for characterizing domains; in this sense, it is a metatheory. As behavioral science advances so will the need for such theory.”
|
||||
|
||||
== References ==
|
||||
|
||||
== Further reading ==
|
||||
Guttman, R. & Greenbaum, C. W. (1998). "Facet Theory: Its Development and Current Status." European Psychologist, Vol. 3, No. 1, March 1998, pp. 13–36.
|
||||
Levy, S. (Ed.) (1994). Louis Guttman on Theory and Methodology: Selected Writings. Aldershot: Dartmouth.
|
||||
Canter (Ed.) (1985). Facet Theory: Approaches to Social Research. New York: Springer.
|
||||
Guttman, R. (1994). Radex Theory. In Robert J. Sternberg (Ed.), Encyclopedia of Human Intelligence. New York, NY: Macmillan Publishing, 907–912.
|
||||
Hackett, P.M.W. (2021) Facet Theory and the Mapping Sentence: Evolving Philosophy, Use and Declarative Applications, (second, revised and enlarged edition), Basingstoke: Palgrave McMillan Publishers.
|
||||
91
data/en.wikipedia.org/wiki/Feature_engineering-0.md
Normal file
91
data/en.wikipedia.org/wiki/Feature_engineering-0.md
Normal file
@ -0,0 +1,91 @@
|
||||
---
|
||||
title: "Feature engineering"
|
||||
chunk: 1/1
|
||||
source: "https://en.wikipedia.org/wiki/Feature_engineering"
|
||||
category: "reference"
|
||||
tags: "science, encyclopedia"
|
||||
date_saved: "2026-05-05T09:54:11.527727+00:00"
|
||||
instance: "kb-cron"
|
||||
---
|
||||
|
||||
Feature engineering is a preprocessing step in supervised machine learning and statistical modeling which transforms raw data into a more effective set of inputs. Each input comprises several attributes, known as features. By providing models with relevant information, feature engineering significantly enhances their predictive accuracy and decision-making capability.
|
||||
Beyond machine learning, the principles of feature engineering are applied in various scientific fields, including physics. For example, physicists construct dimensionless numbers such as the Reynolds number in fluid dynamics, the Nusselt number in heat transfer, and the Archimedes number in sedimentation. They also develop first approximations of solutions, such as analytical solutions for the strength of materials in mechanics.
|
||||
|
||||
|
||||
== Clustering ==
|
||||
One of the applications of feature engineering has been clustering of feature-objects or sample-objects in a dataset. Especially, feature engineering based on matrix decomposition has been extensively used for data clustering under non-negativity constraints on the feature coefficients. These include Non-Negative Matrix Factorization (NMF), Non-Negative Matrix-Tri Factorization (NMTF), Non-Negative Tensor Decomposition/Factorization (NTF/NTD), etc. The non-negativity constraints on coefficients of the feature vectors mined by the above-stated algorithms yields a part-based representation, and different factor matrices exhibit natural clustering properties. Several extensions of the above-stated feature engineering methods have been reported in literature, including orthogonality-constrained factorization for hard clustering, and manifold learning to overcome inherent issues with these algorithms.
|
||||
Other classes of feature engineering algorithms include leveraging a common hidden structure across multiple inter-related datasets to obtain a consensus (common) clustering scheme. An example is Multi-view Classification based on Consensus Matrix Decomposition (MCMD), which mines a common clustering scheme across multiple datasets. MCMD is designed to output two types of class labels (scale-variant and scale-invariant clustering), and:
|
||||
|
||||
is computationally robust to missing information,
|
||||
can obtain shape- and scale-based outliers,
|
||||
and can handle high-dimensional data effectively.
|
||||
Coupled matrix and tensor decompositions are popular in multi-view feature engineering.
|
||||
|
||||
|
||||
== Predictive modelling ==
|
||||
Feature engineering in machine learning and statistical modeling involves selecting, creating, transforming, and extracting data features. Key components include feature creation from existing data, transforming and imputing missing or invalid features, reducing data dimensionality through methods like Principal Components Analysis (PCA), Independent Component Analysis (ICA), and Linear Discriminant Analysis (LDA), and selecting the most relevant features for model training based on importance scores and correlation matrices.
|
||||
Features vary in significance. Even relatively insignificant features may contribute to a model. Feature selection can reduce the number of features to prevent a model from becoming too specific to the training data set (overfitting).
|
||||
Feature explosion occurs when the number of identified features is too large for effective model estimation or optimization. Common causes include:
|
||||
|
||||
Feature templates - implementing feature templates instead of coding new features
|
||||
Feature combinations - combinations that cannot be represented by a linear system
|
||||
Feature explosion can be limited via techniques such as regularization, kernel methods, and feature selection.
|
||||
|
||||
|
||||
== Automation ==
|
||||
Automation of feature engineering is a research topic that dates back to the 1990s. Machine learning software that incorporates automated feature engineering has been commercially available since 2016. Related academic literature can be roughly separated into two types:
|
||||
|
||||
Multi-relational Decision Tree Learning (MRDTL) uses a supervised algorithm that is similar to a decision tree.
|
||||
Deep Feature Synthesis uses simpler methods.
|
||||
|
||||
|
||||
=== Multi-relational Decision Tree Learning (MRDTL) ===
|
||||
Multi-relational Decision Tree Learning (MRDTL) extends traditional decision tree methods to relational databases, handling complex data relationships across tables. It innovatively uses selection graphs as decision nodes, refined systematically until a specific termination criterion is reached.
|
||||
Most MRDTL studies base implementations on relational databases, which results in many redundant operations. These redundancies can be reduced by using techniques such as tuple id propagation.
|
||||
|
||||
|
||||
=== Open-source implementations ===
|
||||
There are a number of open-source libraries and tools that automate feature engineering on relational data and time series:
|
||||
|
||||
featuretools is a Python library for transforming time series and relational data into feature matrices for machine learning.
|
||||
MCMD: An open-source feature engineering algorithm for joint clustering of multiple datasets.
|
||||
OneBM or One-Button Machine combines feature transformations and feature selection on relational data with feature selection techniques. OneBM helps data scientists reduce data exploration time allowing them to try and error many ideas in short time. On the other hand, it enables non-experts, who are not familiar with data science, to quickly extract value from their data with a little effort, time, and cost.
|
||||
getML community is an open source tool for automated feature engineering on time series and relational data. It is implemented in C/C++ with a Python interface. It has been shown to be at least 60 times faster than tsflex, tsfresh, tsfel, featuretools or kats.
|
||||
tsfresh is a Python library for feature extraction on time series data. It evaluates the quality of the features using hypothesis testing.
|
||||
tsflex is an open source Python library for extracting features from time series data. Despite being 100% written in Python, it has been shown to be faster and more memory efficient than tsfresh, seglearn or tsfel.
|
||||
seglearn is an extension for multivariate, sequential time series data to the scikit-learn Python library.
|
||||
tsfel is a Python package for feature extraction on time series data.
|
||||
kats is a Python toolkit for analyzing time series data.
|
||||
|
||||
|
||||
=== Deep feature synthesis ===
|
||||
The deep feature synthesis (DFS) algorithm beat 615 of 906 human teams in a competition.
|
||||
|
||||
|
||||
== Feature stores ==
|
||||
The feature store is where the features are stored and organized for the explicit purpose of being used to either train models (by data scientists) or make predictions (by applications that have a trained model). It is a central location where you can either create or update groups of features created from multiple different data sources, or create and update new datasets from those feature groups for training models or for use in applications that do not want to compute the features but just retrieve them when it needs them to make predictions.
|
||||
A feature store includes the ability to store code used to generate features, apply the code to raw data, and serve those features to models upon request. Useful capabilities include feature versioning and policies governing the circumstances under which features can be used.
|
||||
Feature stores can be standalone software tools or built into machine learning platforms.
|
||||
|
||||
|
||||
== Alternatives ==
|
||||
Feature engineering can be a time-consuming and error-prone process, as it requires domain expertise and often involves trial and error. Deep learning algorithms may be used to process a large raw dataset without having to resort to feature engineering. However, deep learning algorithms still require careful preprocessing and cleaning of the input data. In addition, choosing the right architecture, hyperparameters, and optimization algorithm for a deep neural network can be a challenging and iterative process.
|
||||
|
||||
|
||||
== See also ==
|
||||
Covariate
|
||||
Data transformation
|
||||
Feature extraction
|
||||
Feature learning
|
||||
Hashing trick
|
||||
Instrumental variables estimation
|
||||
Kernel method
|
||||
List of datasets for machine learning research
|
||||
Scale co-occurrence matrix
|
||||
Space mapping
|
||||
|
||||
|
||||
== References ==
|
||||
|
||||
|
||||
== Further reading ==
|
||||
40
data/en.wikipedia.org/wiki/Funding_by_lottery-0.md
Normal file
40
data/en.wikipedia.org/wiki/Funding_by_lottery-0.md
Normal file
@ -0,0 +1,40 @@
|
||||
---
|
||||
title: "Funding by lottery"
|
||||
chunk: 1/2
|
||||
source: "https://en.wikipedia.org/wiki/Funding_by_lottery"
|
||||
category: "reference"
|
||||
tags: "science, encyclopedia"
|
||||
date_saved: "2026-05-05T09:52:55.911338+00:00"
|
||||
instance: "kb-cron"
|
||||
---
|
||||
|
||||
Funding by lottery is the use of randomization in the allocation of science funding by research funding organizations. As of early 2025, about a dozen funders worldwide have implemented some form of funding by lottery in their funding calls.
|
||||
There is no scholarly consensus on the benefits and drawbacks of funding by lottery. Proponents argue that it can help mitigate biases in funding allocation and minimize the high costs associated with grant writing and peer review. Critics, however, express concerns that diminishing the role of peer review panels in funding decisions could lead to a decline in the quality of funded projects and undermine public trust in funders, in scientific peer review, and ultimately in science at large.
|
||||
|
||||
== Overview ==
|
||||
Science funding is generally distributed through competitive funding calls organized by research funding agencies (commonly referred to as "funders"). Eligible researchers submit research grant proposals in response to these calls, which are then evaluated by expert peer review panels. Funding is typically awarded to the proposals that receive the most positive evaluations from these panels.
|
||||
Compared to this traditional, peer-reviewed science funding, funding by lottery fully or partly replaces peer review panel evaluations with random selection. In a full lottery, all eligible research grant proposals have an equal chance of winning the grant, with no input from a peer review panel. In a partial lottery, only eligible proposals that are positively evaluated by the peer review panel are entered into the lottery and have a chance to win the grant.
|
||||
While some researchers advocate for the funding of science by means of full lotteries, at least for specific funding calls, all known examples of funding lotteries implemented by funders are partial lotteries.
|
||||
|
||||
== History ==
|
||||
The selection mechanism behind funding by lottery has historical roots in the scrutiny and lot, a sortition system established in 14th-century Florence. Similar to contemporary funding by lottery, candidates in scrutiny and lot were first screened for eligibility, and then eligible individuals were randomly selected for political office. The motivation for using randomization is also comparable: in peer review, randomization is proposed as a way to reduce biases and prevent the concentration of funding in a few prestigious labs or research institutions. In Florentine sortition, randomization was designed to curb the concentration of political power among affluent families.
|
||||
In the context of science funding, funding by lottery was first proposed by Daniel S. Greenberg in 1998. Greenberg argued that winning a grant in a competitive funding call is largely a matter of luck – an intuition that was later corroborated by empirical research, showing that peer review panels often struggle to distinguish among proposals around the funding line. Because the evaluations by peer review panels are so unreliable, Greenberg argued, a lottery would be an equally valid but more cost-effective method to allocate funding.
|
||||
The idea remained theoretical until the 2010s, when a few funders began piloting the implementation of partial lotteries. Early adopters included the Health Research Council of New Zealand, the New Zealand Science for Technological Innovation, and the Volkswagen Foundation. These early experiments with partial lotteries were later followed by other Western funders, particularly for calls aimed at supporting transformative, innovative, blue skies, or high-risk, high-reward research.
|
||||
|
||||
== Types ==
|
||||
Funding by lottery can be implemented in different ways. In practice, all funding lotteries used or in use by funders are partial lotteries, meaning that a peer review panel is in charge of evaluating proposals for merit and suitability, and randomization is introduced to assist in the final funding decision among positively evaluated proposals. Types of partial lotteries currently in use are: tie-breaking lotteries, partial lotteries "with bypass", and fundable-pool partial lotteries.
|
||||
|
||||
=== Tie-breaking partial lotteries ===
|
||||
In tie-breaking partial lotteries, research grant proposals are first ranked by a peer review panel, and funding is awarded to the best-rated proposals. However, when multiple proposals receive the same evaluation at the funding threshold – meaning that the panel cannot distinguish among them – random selection is used to break the tie.
|
||||
Funders using tie-breaking partial lotteries in all or some of their calls include the Swiss National Science Foundation, Natural Environment Research Council (UK), Research Council of Norway, and Science Foundation Ireland.
|
||||
|
||||
=== Partial lotteries "with bypass" ===
|
||||
Partial lotteries "with bypass" rely on peer review panels to evaluate research grant proposals and categorize them as not fundable, fundable or outstanding. Not fundable proposals are declined. Fundable proposals enter the lottery pool. Outstanding proposals bypass the lottery and receive direct funding.
|
||||
Funders using this implementation include the Volkswagen Foundation, Austrian Science Fund, and Novo Nordisk Foundation.
|
||||
|
||||
=== Fundable-pool partial lotteries ===
|
||||
In fundable-pool partial lotteries, peer review panels evaluate research grant proposals and categorize them as either fundable or not fundable. Proposals deemed not fundable are declined, whereas those classified as fundable enter a lottery pool for random selection.
|
||||
Adopters of fundable-pool partial lotteries include the Health Research Council of New Zealand, New Zealand Science for Technological Innovation, Canadian New Frontiers in Research Fund, and The British Academy.
|
||||
|
||||
== Scholarly debate ==
|
||||
Surveys conducted in New Zealand and Germany suggest that researchers have mixed opinions on funding by lottery. Respondents seem to be more supportive of partial lotteries than full lotteries. In peer-reviewed articles and opinion pieces, scholars have outlined key arguments in favor and against the use of funding by lottery.
|
||||
26
data/en.wikipedia.org/wiki/Funding_by_lottery-1.md
Normal file
26
data/en.wikipedia.org/wiki/Funding_by_lottery-1.md
Normal file
@ -0,0 +1,26 @@
|
||||
---
|
||||
title: "Funding by lottery"
|
||||
chunk: 2/2
|
||||
source: "https://en.wikipedia.org/wiki/Funding_by_lottery"
|
||||
category: "reference"
|
||||
tags: "science, encyclopedia"
|
||||
date_saved: "2026-05-05T09:52:55.911338+00:00"
|
||||
instance: "kb-cron"
|
||||
---
|
||||
|
||||
=== Arguments in favor ===
|
||||
The main argument in favor of funding by lottery is that randomization can help reduce the costs of grant writing and peer review while promoting fairer and more diverse funding outcomes. More specifically, funding by lottery may curb or eliminate biases in research evaluation, such as reviewer conservatism – i.e. the tendency to favor conventional ideas over more innovative or riskier ones – as well as bias against interdisciplinary research.
|
||||
Furthermore, funding by lottery is seen as way to counteract the growing concentration of research funding in the hands of few renowned labs and research-performing institutions, a phenomenon known as the "Matthew effect" in science funding.
|
||||
Lastly, some have argued that funding by lottery could remove some incentives for research misconduct. Because grant acquisition is considered a marker of academic success, researchers may feel pressured to obtain as much funding as possible. In extreme cases, this can lead to unethical practices, like submitting virtually identical research grant proposals to multiple funding calls – a problem known as "double-dipping". By decoupling grant acquisition from academic prestige, funding by lottery could help mitigate such incentives for misconduct.
|
||||
|
||||
=== Arguments against ===
|
||||
Some scholars highlight the lack of empirical evidence supporting the claimed benefits of funding by lottery, particularly its potential to reduce costs and bias. Others argue that less drastic reforms to peer review could address its shortcomings more effectively.
|
||||
Furthermore, critics of funding by lottery identify additional potential drawbacks. First, eliminating peer review panels, or reducing their role, would deprive applicants of valuable feedback that helps improve and refine their ideas and study designs. Second, randomization could weaken quality-based selection, creating incentives to submit lower-quality proposals, and ultimately leading to a decline in quality standards.
|
||||
In addition, adopting funding by lottery could undermine the legitimacy of funders and their peer review panels, potentially damaging public trust in peer review and the scientific enterprise as a whole.
|
||||
|
||||
== See also ==
|
||||
Funding of science
|
||||
Peer review
|
||||
Science policy
|
||||
|
||||
== References ==
|
||||
72
data/en.wikipedia.org/wiki/Geographic_analytics-0.md
Normal file
72
data/en.wikipedia.org/wiki/Geographic_analytics-0.md
Normal file
@ -0,0 +1,72 @@
|
||||
---
|
||||
title: "Geographic analytics"
|
||||
chunk: 1/1
|
||||
source: "https://en.wikipedia.org/wiki/Geographic_analytics"
|
||||
category: "reference"
|
||||
tags: "science, encyclopedia"
|
||||
date_saved: "2026-05-05T09:54:12.686727+00:00"
|
||||
instance: "kb-cron"
|
||||
---
|
||||
|
||||
Geographic analytics is an analytical approach to strategic management and data analytics to make geographic decisions efficiently. Examples of such decisions are choosing the location for a warehouse or planning the regions for a marketing campaign. Data, information and framing conditions are visualized on maps to derive recommendations for action.
|
||||
In comparison to geographic information systems (GIS), which primarily aim at the representation of information on maps (descriptive analytics), Geographic analytics additionally focuses on making business decisions based on the data visualization on the map (prescriptive analytics).
|
||||
|
||||
|
||||
== Background ==
|
||||
Purely mathematically based approaches that are used in data analytics in order to support management decisions often have the disadvantage that they require a large amount of data. This often results in considerable efforts for gathering, cleaning and understanding the data. Furthermore, there can be “intangible” framing conditions that are disruptive to any purely data-driven optimization solution.
|
||||
Example logistics: For the task to find the optimal location for a warehouse in a logistics network, so-called Center of Gravity models are being used. To minimize the cost, these models use transport volume data, customer locations, cost data, etc., in order to determine the optimal location. However, there are often framing conditions – for example, traffic infrastructure, borders, regulatory and even physical hurdles – which are difficult to be mathematically described and modelled.
|
||||
In practice, such framing conditions are often only recognized at the end of the data-driven analysis. The framing conditions then have to be included into the model, which has to be recalculated. As a worst case, the approach to the problem has to be done from scratch. This results in delays of the decision-making process and often goes along with significant additional efforts.
|
||||
|
||||
|
||||
== Application and objective ==
|
||||
Geographic analytics starts with the visualization of basic data on a map. By involving experts from the field, the visualization is then being used in order to determine framing conditions and focal points of the business problem.
|
||||
As a result, the solution space, i.e. the number of possible solutions of the data analysis, is being reduced. In addition, framing conditions as well as data errors are being recognized in this early stage of the analysis. Only then, traditional data analytics methods are coming into play to find the optimal solution to the problem.
|
||||
With this approach, lesser data is required for the overall analysis and time and effort for the analysis is being significantly reduced. Impracticable and flawed solutions are being identified and excluded upfront.
|
||||
|
||||
|
||||
== Areas of application ==
|
||||
Geographic analytics is being used in connection with data analyses in order to support management decisions that contain a geographical component, such as location decisions, marketing campaigns, service center placements, etc.
|
||||
|
||||
|
||||
=== Example Industries ===
|
||||
Logistics / Supply chain management
|
||||
Example: Planning the location of a distribution center to minimize logistics costs when distributing products to customers.
|
||||
Retail trade
|
||||
Example: Opening of a new shop at a strategically favorable location for commuters
|
||||
Oil and gas Industry
|
||||
Example: Planning the locations of test wells in order to develop a new oil field at the lowest possible cost.
|
||||
|
||||
|
||||
=== Example business areas ===
|
||||
Marketing
|
||||
Example: Development of a regional marketing campaign for a new product
|
||||
Infrastructure
|
||||
Example: Traffic planning / planning of traffic expansions in a large city to reduce congestion times
|
||||
Services
|
||||
Example: Planning the placement of ATMs in order to achieve the highest possible degree of coverage in a region
|
||||
Agriculture
|
||||
Banking
|
||||
Community Development & Planning
|
||||
Disaster Management
|
||||
Mapping & Navigation
|
||||
Military
|
||||
Public Health
|
||||
Risk Assessment
|
||||
Telecommunications Services
|
||||
|
||||
|
||||
== Etymology ==
|
||||
The term and the methodology of geographic analytics were first described in 2013 by Jozo Acksteiner and Claudia Trautmann in the Supply Chain Management Review.
|
||||
|
||||
|
||||
== Source ==
|
||||
This article incorporates text from this source, which is in the public domain: https://mgiss.co.uk/geographic-information-system-different-applications-of-gis/
|
||||
|
||||
|
||||
== References ==
|
||||
|
||||
Claudia Trautmann, Jozo Acksteiner: Geographic Analytics. How HP Visualizes is Supply Chain. In: Supply Chain Management Review January/February, 2013 (Geographic Analytics, How HP Visualizes its Supply Chain)
|
||||
Travis Parker, Jozo Acksteiner: HP Embraces a New Level of Global Analytics. In: Supply Chain Brain, October, 2015 (HP Embraces a New Level of Global Analytics)
|
||||
Dan Gilmore: Supply Chain News. Trip Report. CSCMP Conference Part 2. In: SupplyChainDigest, October, 2015 (Supply Chain News. Trip Report)
|
||||
Jörn Kohlhammer et al. : Solving Problems with Visual Analytics In: Procedia Computer Science, December 2011 (Solving Problems with Visual Analytics)
|
||||
Osthessenzeitung: Logistics Day. Digitlaization and Management. In: Osthessenzeitung, April, 2018 (Logistics Day Digitalization and Management)
|
||||
34
data/en.wikipedia.org/wiki/Health_care_analytics-0.md
Normal file
34
data/en.wikipedia.org/wiki/Health_care_analytics-0.md
Normal file
@ -0,0 +1,34 @@
|
||||
---
|
||||
title: "Health care analytics"
|
||||
chunk: 1/2
|
||||
source: "https://en.wikipedia.org/wiki/Health_care_analytics"
|
||||
category: "reference"
|
||||
tags: "science, encyclopedia"
|
||||
date_saved: "2026-05-05T09:54:13.909352+00:00"
|
||||
instance: "kb-cron"
|
||||
---
|
||||
|
||||
Health care analytics is the health care analysis activities that can be undertaken as a result of data collected from four areas within healthcare: (1) claims and cost data, (2) pharmaceutical and research and development (R&D) data, (3) clinical data (such as collected from electronic medical records (EHRs)), and (4) patient behaviors and preferences data (e.g. patient satisfaction or retail purchases, such as data captured in stores selling personal health products). Health care analytics is a growing industry in many countries including the United States, where it is expected to grow to more than $31 billion by 2022. It is also increasingly important to governments and public health agencies to support health policy and meet public expectations for transparency, as accelerated by the COVID-19 pandemic.
|
||||
Health care analytics allows for the examination of patterns in various healthcare data in order to determine how clinical care can be improved for patients and provider teams, while limiting excessive spending and improving the health of populations. Areas of the industry focuses on clinical analysis, financial analysis, supply chain analysis, as well as marketing, fraud and HR analysis. There is increasing demand in many countries to incorporate social indicators of patients and providers within health care analytics, to inform improvements for health equity, such as in terms of addressing racism in healthcare or the health of Indigenous peoples.
|
||||
|
||||
== Balancing Interests in Healthcare Analytics: innovation, privacy, and patient safety ==
|
||||
Healthcare analytics requires access to comprehensive data, but its usefulness depends on a balance between expansive limits on the collection of data that may risk the protection for patient rights, erroneous conclusions or statistical predictions, and misuse of results. Appropriate policies could support gains in process improvements, cost reductions, personalized medicine, and population health. Additionally, providing incentives to encourage appropriate use may address some concerns but could also inadvertently incentivize the misuse of data. Lastly, creating standards for IT infrastructure may encourage data sharing and use, but those standards would need to be reevaluated on a regular, ongoing basis as the fast pace of technological innovation causes standards and best practices to become quickly outdated.
|
||||
Several areas to improve healthcare analytics through national, regional and local collaborations and legislation have been identified.
|
||||
|
||||
=== Limiting data collection ===
|
||||
The needs of healthcare providers, government agencies, health plans, and researchers for quality data must be met to ensure adequate medical care and to make improvements to the healthcare system, while still ensuring the patients right to privacy. Data collection should be limited to necessity for medical care and by patient preference beyond that care. Such limits would protect patient privacy while minimizing infrastructure costs to house data. When possible, patients should be informed about what data is collected prior to engaging in medical services. For instance in Canada, data collection among Indigenous populations is governed by principles of First Nations ownership, control, access, and possession.
|
||||
|
||||
=== Limiting data use ===
|
||||
Expanding availability of big data increases the risk of statistical errors, erroneous conclusions and predictions, and misuse of results. Evidence supports use of data for process improvements, cost reductions, personalized medicine, and public health. Innovative uses for individual health can harm underserved populations. In the United States, limiting use for denial and exclusion prevents use to determine eligibility for benefits or care and is harmonized with other federal anti-discrimination laws, such as Fair Credit Reporting Act, and is harmonized with anti-discrimination laws like the Civil Rights Act and the Genetic Information Nondiscrimination Act.
|
||||
|
||||
=== Providing incentives to encourage appropriate use ===
|
||||
Increasing vertical integration in both public and private sector providers has created massive databases of electronic health records. In the United States, the ACA has provided Medicare and Medicaid incentives to providers to adopt EHR's. Large healthcare institutions also have internal motivation to apply healthcare analytics, largely for reducing costs by providing preventative care. Policy could increase data use by incentivizing insurers and providers to increase population tracking, which improves outcomes.
|
||||
|
||||
=== Creating standards for the IT infrastructure ===
|
||||
Inappropriate IT infrastructure likely limits healthcare analytics findings and their impact on clinical practice. Establishing standards ensures IT infrastructure capable of housing big data balanced with addressing accessibility, ownership, and privacy. New possibilities could be explored such as private clouds and “a virtual sandbox” consisting of filtered data authorized to the researchers accessing the sandbox. Standards promote easier coordination in information collaboration between different medical and research organizations resulting in significantly improving patient care by improving communication between providers and reducing duplicity and costs.
|
||||
Minimum standards are necessary to balance privacy and accessibility. Standardization helps improve patient care by facilitating research collaboration and easier communication between medical providers. The research can yield preventive care concepts that can reduce patient caseload and avoid long-term medical costs.
|
||||
|
||||
== Healthcare analytics in UAE ==
|
||||
The Dubai Pharmacy College (DPCG) is a pioneer in healthcare data analytics education in the GCC region. DPC offers a Post-graduate certificate course in "Healthcare business data analytics" for healthcare professionals to motivate the intuition to explore the concept of healthcare data analytics and apply innovations in healthcare computing technologies. The aim of the certification program is to provide a platform for interprofessional researchers to utilize the fundamental technology including software applications for intelligent data acquisition, processing, and analysis of healthcare data.
|
||||
|
||||
== Healthcare analytics in the United States ==
|
||||
50
data/en.wikipedia.org/wiki/Health_care_analytics-1.md
Normal file
50
data/en.wikipedia.org/wiki/Health_care_analytics-1.md
Normal file
@ -0,0 +1,50 @@
|
||||
---
|
||||
title: "Health care analytics"
|
||||
chunk: 2/2
|
||||
source: "https://en.wikipedia.org/wiki/Health_care_analytics"
|
||||
category: "reference"
|
||||
tags: "science, encyclopedia"
|
||||
date_saved: "2026-05-05T09:54:13.909352+00:00"
|
||||
instance: "kb-cron"
|
||||
---
|
||||
|
||||
=== Federal government role in health IT ===
|
||||
In the United States, multiple federal entities are heavily involved in health analytics infrastructure. Within the executive branch, the administration itself, Centers for Medicare and Medicaid Services (CMS), and Office of the National Coordinator for Health Information Technology (ONC) each have strategic plans and are involved in determining regulation. Within the legislative branch, multiple committees within the House of Representatives and Senate hold hearings and have opinions on using data and technology to reduce costs and improve outcomes in healthcare.
|
||||
The ONC issued the Federal Health IT Strategic Plan 2015-2020. The plan outlines the steps federal agencies will take to achieve widespread use of health information technology (health IT) and electronic health information to enhance the health IT infrastructure, to advance person-centered and self-managed health, to transform health care delivery and community health, and to foster research, scientific knowledge and innovation. The plan is intended “to provide clarity in federal policies, programs, and actions and includes strategies to align program requirements, harmonize and simplify regulations, and aims to help health IT users to advance the learning health system to achieve better health.”
|
||||
The Strategic Plan includes several key initiatives employing multiple strategies to meet its goals. These include: (1) finalizing and implementing an interoperability roadmap; (2) protecting the privacy and security of health information; (3) identifying, prioritizing and advancing technical standards; (4) increasing user and market confidence in the safety and safe use of health IT; (5) advancing a national communication infrastructure; and (6) collaborating among all stakeholders.
|
||||
|
||||
=== Challenges to address ===
|
||||
|
||||
==== Creating an interoperability roadmap ====
|
||||
Dr.Aryan Chavan challenges to be addressed: (1) variation in how standards are tested and implemented; (2) variation in how health IT stakeholders interpret and implement policies and legal requirements; and (3) reluctance of health IT stakeholders to share and collaborate in ways that might foster consumer engagement.
|
||||
The ONC is working to develop a policy advisory for health information exchange by 2017 that will define and outline basic expectations for trading partners around health information exchange, interoperability and the exchange of information. Current federal and state law only prohibits certain kinds of information blocking in limited and narrow circumstances, for example, under the Health Insurance Portability and Accountability Act (HIPAA) or the Anti-Kickback statute.
|
||||
|
||||
==== Protecting privacy and security ====
|
||||
In addition to HIPAA, many states have their own privacy laws protecting an individual’s health information. State laws that are contrary to HIPAA are generally preempted by the federal requirements unless a specific exception applies. For example, if the state law relates to identifiable health information and provides greater privacy protections, then it is not preempted by HIPAA. Since privacy laws may vary from state-to-state, it may create confusion among health IT stakeholders and make it difficult to ensure privacy compliance.
|
||||
|
||||
==== Establishing common technical standards ====
|
||||
Use of common technical standards is necessary to move electronic health information seamlessly and securely. While some clinical record content, such as laboratory results and clinical measurements are easily standardized other content, such as provider notes may be more difficult to standardize. Methods need to be identified that allow for the standardization of provider notes and other traditionally “free form text” data.
|
||||
The ONC HIT Certification Program certifies that a system meets the technological capability, functionality and security requirements adopted by HHS. ONC will assess the program on an ongoing basis “to ensure it can address and reinforce health IT applications and requirements that support federal value-based and alternative payment models.”
|
||||
|
||||
==== Increasing confidence in safety and safe use of health IT ====
|
||||
Health care consumers, providers and organizations need to feel confident that the health IT products, systems or services they are using are not only secure, safe and useful but that they can switch between products, systems or services without loss of valuable information or undue financial burden. Implementation of the Federal Health IT Strategic Plan 2015-2020, along with the 2013 HHS Health IT Patient Safety Action and Surveillance Plan and 2012 Food and Drug Administration Safety and Innovation Act will attempt to address these concerns.
|
||||
|
||||
==== Developing national communications structure ====
|
||||
A national communications infrastructure is necessary to enable the sharing of electronic health information between stakeholders, including providers, individuals and national emergency first responders. It is also necessary for delivering telehealth services or using mobile health applications. “Expanded, secure, and affordable high-speed wireless and broadband services, choice, and spectrum availability will support electronic health information sharing and use, support the communication required for care delivery, and support the continuity of health care and public health services during disasters and public health emergencies.”
|
||||
|
||||
==== Stakeholder collaboration ====
|
||||
The federal government in its role as contributor, beneficiary and collaborator “aims to encourage private-sector innovators and entrepreneurs, as well as researchers, to use government and government-funded data to create useful applications, products, services, and features that help improve health and health care.” HHS receives funds from the Patient-Centered Outcomes Research Trust Fund to build data capacity for patient-centered outcomes research. It is estimated HHS will receive over $140 million for the period between 2011 and 2019. These funds will be used “to enable a comprehensive, interoperable, and sustainable data network infrastructure to collect, link, and analyze data from multiple sources to facilitate patient-centered outcomes research.”
|
||||
|
||||
==== Legislation ====
|
||||
Meaningful Use, the Patient Protection and Affordable Care Act (ACA) and the declining cost of data storage results in health data being stored, shared, and used by multiple providers, insurance companies, and research institutions. Concerns exist about how organizations gather, store, share, and use personal information, including privacy and confidentiality concerns, as well as the concerns over the quality and accuracy of data collected. Expansion of existing regulation can ensure patient privacy and guard patient safety to balance access to data and the ethical impact of exposing that data.
|
||||
|
||||
== See also ==
|
||||
health information management
|
||||
health informatics
|
||||
public health informatics
|
||||
human resources for health information systems (HRHIS)
|
||||
|
||||
== References ==
|
||||
|
||||
== Further reading ==
|
||||
Adam Tanner (2017). Our Bodies, Our Data: How Companies Make Billions Selling Our Medical Records. Beacon Press. ISBN 978-0807033340.
|
||||
70
data/en.wikipedia.org/wiki/Henry_Oldenburg-0.md
Normal file
70
data/en.wikipedia.org/wiki/Henry_Oldenburg-0.md
Normal file
@ -0,0 +1,70 @@
|
||||
---
|
||||
title: "Henry Oldenburg"
|
||||
chunk: 1/1
|
||||
source: "https://en.wikipedia.org/wiki/Henry_Oldenburg"
|
||||
category: "reference"
|
||||
tags: "science, encyclopedia"
|
||||
date_saved: "2026-05-05T09:53:01.743243+00:00"
|
||||
instance: "kb-cron"
|
||||
---
|
||||
|
||||
Henry Oldenburg (also Henry Oldenbourg) (c. 1618 as Heinrich Oldenburg – 5 September 1677) was a German theologian, diplomat, and natural philosopher, known as one of the creators of modern scientific peer review. He was one of the foremost intelligencers of 17th-century Europe, with a network of correspondents to rival those of Fabri de Peiresc, Marin Mersenne, and Ismaël Boulliau. At the foundation of the Royal Society in London, he took on the task of foreign correspondence, as the first Secretary.
|
||||
|
||||
|
||||
== Early life ==
|
||||
Born in Bremen, Germany, he was trained in theology and received his degree from the local Gymnasyum illustre on 2 November 1639. He had an initial very firm grasp of the German, Latin, and Greek languages. His movements during the 1640s are unclear, but he is thought to have worked as a tutor in England for much of the decade. In 1648 he left England and spent some time in Leiden and Utrecht in the Dutch Republic, where he became conversant in the Dutch language. After a short stay back in Bremen in the spring, he arrived back in London in July 1653 as a diplomatic envoy of the city of Bremen's senate to the Lord Protector Oliver Cromwell for matters concerning the ongoing First Anglo-Dutch War.
|
||||
Settling then in England of the Interregnum, he forged a strong relationship with his lifelong patron Robert Boyle, and with John Milton, who wrote of him approvingly that he had "learnt to speak our language more accurately and fluently than any other foreigner I have ever known" (Correspondence, 1.34). Oldenburg eventually became the tutor to Boyle's nephew, the politician Richard Jones, and travelled with him through France from 1657 to 1660. Here Oldenburg also added to his intellectual baggage the French language, the last European language in which he was completely conversant.
|
||||
Oldenburg married his second wife, Dora Katherina Dury (1654–77), the daughter of Dorothy and John Dury in London on 13 August 1668. Either through Milton, whom he had met earlier in his diplomatic mission, or through Lady Ranelagh, sister to Boyle and the mother of Richard Jones, Oldenburg gained entry to an important intellectual circle, including his fellow German native, Samuel Hartlib, whose extensive web of correspondents Oldenburg was to take over, John Dury who became his father-in-law, and others such as the economist William Petty. Among Oldenburg's correspondents at this time was Baruch Spinoza, whom he was introduced to on a trip to the Netherlands, and to whom he presented a volume of writings on scientific topics by Boyle.
|
||||
|
||||
|
||||
== Secretary of the Royal Society ==
|
||||
After the Restoration he became an early member (original fellow) of the Royal Society (founded in 1660), and served as its first secretary along with John Wilkins, maintaining an extensive network of scientific contacts through Europe. He also became the founding editor of the Philosophical Transactions of the Royal Society. Oldenburg began the practice of sending submitted manuscripts to experts who could judge their quality before publication. This was the beginning of both the modern scientific journal and the practice of peer review. Philosophical Transactions of the Royal Society continues today and is the longest running scientific journal in the world.
|
||||
He was briefly imprisoned in the Tower of London as a suspected spy in 1667, during the Second Anglo-Dutch War. Oldenburg's correspondence was linked to support from the politician Sir Joseph Williamson; in part Oldenburg supplied Williamson with intelligence information.
|
||||
Oldenburg enjoyed good health in his lifetime, but he fell seriously ill on 3 September 1677, and he died two days thereafter at his Pall Mall, London home. He was interred on 7 September at St Mary the Virgin, Bexley. His widow died ten days later.
|
||||
|
||||
|
||||
== Foreign correspondents ==
|
||||
|
||||
|
||||
=== Denmark ===
|
||||
Rasmus Bartholin.
|
||||
|
||||
|
||||
=== Flanders ===
|
||||
René François Walter de Sluse.
|
||||
|
||||
|
||||
=== France ===
|
||||
Adrien Auzout, Henri Justel, Pierre Petit, Ismaël Bullialdus.
|
||||
|
||||
|
||||
=== Germany ===
|
||||
Sir William Curtius, Johann Hevelius, Gottfried Leibniz, Philipp Jacob Sachs von Lewenheimb, Johann Daniel Major, Ehrenfried Walther von Tschirnhaus, Martin Vogel.
|
||||
|
||||
|
||||
=== Italy ===
|
||||
Paolo Boccone, Giovanni Domenico Cassini, Marcello Malpighi., Manfredo Settala
|
||||
|
||||
|
||||
=== Netherlands ===
|
||||
Reinier de Graaf, Christiaan Huygens and his father, Constantijn Huygens, Antoni van Leeuwenhoek, Willem Ten Rhijne, Benedictus/Baruch Spinoza, Peter Serrarius, Jan Swammerdam, Isaac Vossius.
|
||||
|
||||
|
||||
== See also ==
|
||||
Peer review
|
||||
|
||||
|
||||
== References ==
|
||||
|
||||
|
||||
== Further reading ==
|
||||
Jean-Pierre Vittu, "Henry Oldenburg 'Grand intermédiaire'", in "Les grands intermédiaires culturels de la République des Lettres", pub. by Christiane Berkvens-Stevelinck, Hans Bots and Jens Häseler, Paris, Honoré Champion, 2005, pp. 184–209
|
||||
McKie, Douglas. "The arrest and imprisonment of Henry Oldenburg". Notes and Records of the Royal Society of London. 6 (1948/49): 28–47.
|
||||
Thomas Elsmann: Im Schatten des Kaufmanns – Bremische Gelehrte 1600–1900, Bremen: Schünemann 2012, S. 80–99
|
||||
|
||||
|
||||
== External links ==
|
||||
Media related to Henry Oldenburg at Wikimedia Commons
|
||||
Works by Henry Oldenburg at Project Gutenberg
|
||||
Works by or about Henry Oldenburg at the Internet Archive
|
||||
The Correspondence of Henry Oldenburg in EMLO
|
||||
29
data/en.wikipedia.org/wiki/Iconography_of_correlations-0.md
Normal file
29
data/en.wikipedia.org/wiki/Iconography_of_correlations-0.md
Normal file
@ -0,0 +1,29 @@
|
||||
---
|
||||
title: "Iconography of correlations"
|
||||
chunk: 1/1
|
||||
source: "https://en.wikipedia.org/wiki/Iconography_of_correlations"
|
||||
category: "reference"
|
||||
tags: "science, encyclopedia"
|
||||
date_saved: "2026-05-05T09:54:15.084817+00:00"
|
||||
instance: "kb-cron"
|
||||
---
|
||||
|
||||
In exploratory data analysis, the iconography of correlations, or representation of correlations, is a data visualization technique which replaces a numeric correlation matrix by its graphical projection onto a diagram. Correlations between variables are represented by solid lines (positive correlations) or dotted lines (negative correlations) plotted on the diagram, where shorter lengths or thicker lines (or both) represent greater projection components and thus greater degree of correlation.
|
||||
|
||||
|
||||
== History ==
|
||||
The idea of using graphical models to represent correlations has a notable application in genomic mapping, though the iconography of correlations is more general for not assuming the probability distribution, since it only relies on representing the correlation coefficients geometrically.
|
||||
The iconography of correlations first dates to 1975, applied to marine geochemistry in a 1981 thesis, and later in a 1982 data analysis article. Afterwards, the method was applied widely in the aerospace industry but for about fifteen years manufacturers kept it fairly confidential; generally, they preferred not to broadcast useful techniques to their competitors.
|
||||
In 1997 the first company was incorporated to distribute iconography of correlations software. Since then the topic of iconography of correlations has been incorporated into university courses, and typical topical articles' citation lists have rapidly and greatly expanded, particularly in the fields of medicine and mass spectrometry.
|
||||
|
||||
|
||||
== See also ==
|
||||
Bayesian network – Probabilistic graphical representation of causal relationships
|
||||
|
||||
|
||||
== References ==
|
||||
|
||||
|
||||
== External links ==
|
||||
Lesty, Michel. "Une autre façon de présenter l'iconographie des corrélations" [Another way to present the iconography of correlations]. La représentation multidimensionnelle [Multidimensional representation]. coryent.com (commercial site). Tutorials (in French). Versailles, FR: CORYENT Conseil / Logiciel CORICO.
|
||||
Girondot, Marc (17 January 2018). "Multivariable analysis and correlation of iconography". biostatsr.blogspot.com (academic's blog). Saclay, FR. ... about biostatistics ... by Professor Marc Girondot, University Paris Saclay. — Brief introduction to correlation of iconography by a biostatistician.
|
||||
32
data/en.wikipedia.org/wiki/Ingelfinger_rule-0.md
Normal file
32
data/en.wikipedia.org/wiki/Ingelfinger_rule-0.md
Normal file
@ -0,0 +1,32 @@
|
||||
---
|
||||
title: "Ingelfinger rule"
|
||||
chunk: 1/1
|
||||
source: "https://en.wikipedia.org/wiki/Ingelfinger_rule"
|
||||
category: "reference"
|
||||
tags: "science, encyclopedia"
|
||||
date_saved: "2026-05-05T09:52:57.068716+00:00"
|
||||
instance: "kb-cron"
|
||||
---
|
||||
|
||||
The Ingelfinger rule is an eponymous rule named after Franz J. Ingelfinger, The New England Journal of Medicine (NEJM) editor-in-chief who enunciated it in 1969. The rule, as originally articulated in the editorial "Definition of 'Sole Contribution'", stated that NEJM would not publish findings that had been published elsewhere. Though originally meant only for NEJM, the guideline was subsequently adopted by several other scientific journals, and it has shaped scientific publishing ever since. Historically it has also helped to ensure that the journal's content is fresh and does not duplicate content previously reported elsewhere, and it seeks to protect the scientific embargo system.
|
||||
A similar policy had been earlier expressed in 1960 by Samuel Goudsmit, editor of the Physical Review Letters, but it did not become as well known.
|
||||
The Ingelfinger rule has been seen as having the aim of preventing authors from performing duplicate publications which would unduly inflate their publication record. On the other hand, it has also been stated that the real reason for the Ingelfinger rule is to protect the journals' revenue stream, and with the increase in popularity of preprint servers such as arXiv, bioRxiv, and HAL many journals have loosened their requirements concerning the Ingelfinger rule. In a defense of the policy, the journal said in an editorial that the practice discouraged scientists from talking to the media before their work was peer reviewed.
|
||||
|
||||
|
||||
== See also ==
|
||||
List of academic journals by preprint policy
|
||||
News embargo
|
||||
Network effect
|
||||
|
||||
|
||||
== References ==
|
||||
|
||||
|
||||
== Further reading ==
|
||||
Relman, AS (1981). "The Ingelfinger Rule". The New England Journal of Medicine. 305 (14): 824–6. doi:10.1056/NEJM198110013051408. PMID 7266634.
|
||||
Spain, A (26 February 2011). "Casting a critical eye on the embargo system: one year of Embargo Watch". Association of British Science Writers. Archived from the original on 2017-03-25. Retrieved 2017-03-24.
|
||||
Altman, LK (1996). "The Ingelfinger rule, embargoes, and journal peer review–Part 1". The Lancet. 347 (9012): 1382–6. doi:10.1016/S0140-6736(96)91016-8. PMID 8637347. S2CID 44524038.
|
||||
Toy, J (2002). "The Ingelfinger Rule: Franz Ingelfinger at the New England Journal of Medicine 1967–77" (PDF). Science Editor. 25 (6): 195–198.
|
||||
Harnad, S (2000). "Ingelfinger Over-Ruled: The Role of the Web in the Future of Refereed Medical Journal Publishing". The Lancet Perspectives. 356: s16. doi:10.1016/S0140-6736(00)92002-6. PMID 11191471.
|
||||
White, E (2014). "Why the Ecology Letters editorial board should reconsider its No vote on preprints". Jabberwocky Ecology.
|
||||
Desjardins-Proulx, P; White, EP; Adamson, JJ; Ram, K; Poisot, T; Gravel, D (2013). "The Case for Open Preprints in Biology". PLOS Biology. 11 (5) e1001563. doi:10.1371/journal.pbio.1001563. PMC 3653830. PMID 23690752.
|
||||
@ -4,7 +4,7 @@ chunk: 1/1
|
||||
source: "https://en.wikipedia.org/wiki/Interdisciplinary_peer_review"
|
||||
category: "reference"
|
||||
tags: "science, encyclopedia"
|
||||
date_saved: "2026-05-05T06:27:44.415051+00:00"
|
||||
date_saved: "2026-05-05T09:52:58.269045+00:00"
|
||||
instance: "kb-cron"
|
||||
---
|
||||
|
||||
|
||||
625
data/en.wikipedia.org/wiki/Interleaving_distance-0.md
Normal file
625
data/en.wikipedia.org/wiki/Interleaving_distance-0.md
Normal file
@ -0,0 +1,625 @@
|
||||
---
|
||||
title: "Interleaving distance"
|
||||
chunk: 1/1
|
||||
source: "https://en.wikipedia.org/wiki/Interleaving_distance"
|
||||
category: "reference"
|
||||
tags: "science, encyclopedia"
|
||||
date_saved: "2026-05-05T09:54:16.289352+00:00"
|
||||
instance: "kb-cron"
|
||||
---
|
||||
|
||||
In topological data analysis, the interleaving distance is a measure of similarity between persistence modules, a common object of study in topological data analysis and persistent homology. The interleaving distance was first introduced by Frédéric Chazal et al. in 2009. since then, it and its generalizations have been a central consideration in the study of applied algebraic topology and topological data analysis.
|
||||
|
||||
|
||||
== Definition ==
|
||||
A persistence module
|
||||
|
||||
|
||||
|
||||
|
||||
V
|
||||
|
||||
|
||||
|
||||
{\displaystyle \mathbb {V} }
|
||||
|
||||
is a collection
|
||||
|
||||
|
||||
|
||||
(
|
||||
|
||||
V
|
||||
|
||||
t
|
||||
|
||||
|
||||
∣
|
||||
t
|
||||
∈
|
||||
|
||||
R
|
||||
|
||||
)
|
||||
|
||||
|
||||
{\displaystyle (V_{t}\mid t\in \mathbb {R} )}
|
||||
|
||||
of vector spaces indexed over the real line, along with a collection
|
||||
|
||||
|
||||
|
||||
(
|
||||
|
||||
v
|
||||
|
||||
t
|
||||
|
||||
|
||||
s
|
||||
|
||||
|
||||
:
|
||||
|
||||
V
|
||||
|
||||
s
|
||||
|
||||
|
||||
→
|
||||
|
||||
V
|
||||
|
||||
t
|
||||
|
||||
|
||||
∣
|
||||
s
|
||||
≤
|
||||
t
|
||||
)
|
||||
|
||||
|
||||
{\displaystyle (v_{t}^{s}:V_{s}\to V_{t}\mid s\leq t)}
|
||||
|
||||
of linear maps such that
|
||||
|
||||
|
||||
|
||||
|
||||
v
|
||||
|
||||
t
|
||||
|
||||
|
||||
t
|
||||
|
||||
|
||||
|
||||
|
||||
{\displaystyle v_{t}^{t}}
|
||||
|
||||
is always an isomorphism, and the relation
|
||||
|
||||
|
||||
|
||||
|
||||
v
|
||||
|
||||
t
|
||||
|
||||
|
||||
s
|
||||
|
||||
|
||||
∘
|
||||
|
||||
v
|
||||
|
||||
s
|
||||
|
||||
|
||||
r
|
||||
|
||||
|
||||
=
|
||||
|
||||
v
|
||||
|
||||
t
|
||||
|
||||
|
||||
r
|
||||
|
||||
|
||||
|
||||
|
||||
{\displaystyle v_{t}^{s}\circ v_{s}^{r}=v_{t}^{r}}
|
||||
|
||||
is satisfied for every
|
||||
|
||||
|
||||
|
||||
r
|
||||
≤
|
||||
s
|
||||
≤
|
||||
t
|
||||
|
||||
|
||||
{\displaystyle r\leq s\leq t}
|
||||
|
||||
. The case of
|
||||
|
||||
|
||||
|
||||
|
||||
R
|
||||
|
||||
|
||||
|
||||
{\displaystyle \mathbb {R} }
|
||||
|
||||
indexing is presented here for simplicity, though the interleaving distance can be readily adapted to more general settings, including multi-dimensional persistence modules.
|
||||
Let
|
||||
|
||||
|
||||
|
||||
|
||||
U
|
||||
|
||||
|
||||
|
||||
{\displaystyle \mathbb {U} }
|
||||
|
||||
and
|
||||
|
||||
|
||||
|
||||
|
||||
V
|
||||
|
||||
|
||||
|
||||
{\displaystyle \mathbb {V} }
|
||||
|
||||
be persistence modules. Then for any
|
||||
|
||||
|
||||
|
||||
δ
|
||||
∈
|
||||
|
||||
R
|
||||
|
||||
|
||||
|
||||
{\displaystyle \delta \in \mathbb {R} }
|
||||
|
||||
, a
|
||||
|
||||
|
||||
|
||||
δ
|
||||
|
||||
|
||||
{\displaystyle \delta }
|
||||
|
||||
-shift is a collection
|
||||
|
||||
|
||||
|
||||
(
|
||||
|
||||
ϕ
|
||||
|
||||
t
|
||||
|
||||
|
||||
:
|
||||
|
||||
U
|
||||
|
||||
t
|
||||
|
||||
|
||||
→
|
||||
|
||||
V
|
||||
|
||||
t
|
||||
+
|
||||
δ
|
||||
|
||||
|
||||
∣
|
||||
t
|
||||
∈
|
||||
|
||||
R
|
||||
|
||||
)
|
||||
|
||||
|
||||
{\displaystyle (\phi _{t}:U_{t}\to V_{t+\delta }\mid t\in \mathbb {R} )}
|
||||
|
||||
of linear maps between the persistence modules that commute with the internal maps of
|
||||
|
||||
|
||||
|
||||
|
||||
U
|
||||
|
||||
|
||||
|
||||
{\displaystyle \mathbb {U} }
|
||||
|
||||
and
|
||||
|
||||
|
||||
|
||||
|
||||
V
|
||||
|
||||
|
||||
|
||||
{\displaystyle \mathbb {V} }
|
||||
|
||||
.
|
||||
The persistence modules
|
||||
|
||||
|
||||
|
||||
|
||||
U
|
||||
|
||||
|
||||
|
||||
{\displaystyle \mathbb {U} }
|
||||
|
||||
and
|
||||
|
||||
|
||||
|
||||
|
||||
V
|
||||
|
||||
|
||||
|
||||
{\displaystyle \mathbb {V} }
|
||||
|
||||
are said to be
|
||||
|
||||
|
||||
|
||||
δ
|
||||
|
||||
|
||||
{\displaystyle \delta }
|
||||
|
||||
-interleaved if there are
|
||||
|
||||
|
||||
|
||||
δ
|
||||
|
||||
|
||||
{\displaystyle \delta }
|
||||
|
||||
-shifts
|
||||
|
||||
|
||||
|
||||
|
||||
ϕ
|
||||
|
||||
t
|
||||
|
||||
|
||||
:
|
||||
|
||||
U
|
||||
|
||||
t
|
||||
|
||||
|
||||
→
|
||||
|
||||
V
|
||||
|
||||
t
|
||||
+
|
||||
δ
|
||||
|
||||
|
||||
|
||||
|
||||
{\displaystyle \phi _{t}:U_{t}\to V_{t+\delta }}
|
||||
|
||||
and
|
||||
|
||||
|
||||
|
||||
|
||||
ψ
|
||||
|
||||
t
|
||||
|
||||
|
||||
:
|
||||
|
||||
V
|
||||
|
||||
t
|
||||
|
||||
|
||||
→
|
||||
|
||||
U
|
||||
|
||||
t
|
||||
+
|
||||
δ
|
||||
|
||||
|
||||
|
||||
|
||||
{\displaystyle \psi _{t}:V_{t}\to U_{t+\delta }}
|
||||
|
||||
such that the following diagrams commute for all
|
||||
|
||||
|
||||
|
||||
s
|
||||
≤
|
||||
t
|
||||
|
||||
|
||||
{\displaystyle s\leq t}
|
||||
|
||||
.
|
||||
|
||||
It follows from the definition that if
|
||||
|
||||
|
||||
|
||||
|
||||
U
|
||||
|
||||
|
||||
|
||||
{\displaystyle \mathbb {U} }
|
||||
|
||||
and
|
||||
|
||||
|
||||
|
||||
|
||||
V
|
||||
|
||||
|
||||
|
||||
{\displaystyle \mathbb {V} }
|
||||
|
||||
are
|
||||
|
||||
|
||||
|
||||
δ
|
||||
|
||||
|
||||
{\displaystyle \delta }
|
||||
|
||||
-interleaved for some
|
||||
|
||||
|
||||
|
||||
δ
|
||||
|
||||
|
||||
{\displaystyle \delta }
|
||||
|
||||
, then they are also
|
||||
|
||||
|
||||
|
||||
(
|
||||
δ
|
||||
+
|
||||
ε
|
||||
)
|
||||
|
||||
|
||||
{\displaystyle (\delta +\varepsilon )}
|
||||
|
||||
-interleaved for any positive
|
||||
|
||||
|
||||
|
||||
ε
|
||||
|
||||
|
||||
{\displaystyle \varepsilon }
|
||||
|
||||
. Therefore, in order to find the closest interleaving between the two modules, we must take the infimum across all possible interleavings.
|
||||
The interleaving distance between two persistence modules
|
||||
|
||||
|
||||
|
||||
|
||||
U
|
||||
|
||||
|
||||
|
||||
{\displaystyle \mathbb {U} }
|
||||
|
||||
and
|
||||
|
||||
|
||||
|
||||
|
||||
V
|
||||
|
||||
|
||||
|
||||
{\displaystyle \mathbb {V} }
|
||||
|
||||
is defined as
|
||||
|
||||
|
||||
|
||||
|
||||
d
|
||||
|
||||
I
|
||||
|
||||
|
||||
(
|
||||
|
||||
U
|
||||
|
||||
,
|
||||
|
||||
V
|
||||
|
||||
)
|
||||
=
|
||||
inf
|
||||
{
|
||||
δ
|
||||
∣
|
||||
|
||||
U
|
||||
|
||||
|
||||
and
|
||||
|
||||
|
||||
V
|
||||
|
||||
|
||||
are
|
||||
|
||||
δ
|
||||
|
||||
-interleaved
|
||||
|
||||
}
|
||||
|
||||
|
||||
{\displaystyle d_{I}(\mathbb {U} ,\mathbb {V} )=\inf\{\delta \mid \mathbb {U} {\text{ and }}\mathbb {V} {\text{ are }}\delta {\text{-interleaved}}\}}
|
||||
|
||||
.
|
||||
|
||||
|
||||
== Properties ==
|
||||
|
||||
|
||||
=== Metric properties ===
|
||||
It can be shown that the interleaving distance satisfies the triangle inequality. Namely, given three persistence modules
|
||||
|
||||
|
||||
|
||||
|
||||
U
|
||||
|
||||
|
||||
|
||||
{\displaystyle \mathbb {U} }
|
||||
|
||||
,
|
||||
|
||||
|
||||
|
||||
|
||||
V
|
||||
|
||||
|
||||
|
||||
{\displaystyle \mathbb {V} }
|
||||
|
||||
, and
|
||||
|
||||
|
||||
|
||||
|
||||
W
|
||||
|
||||
|
||||
|
||||
{\displaystyle \mathbb {W} }
|
||||
|
||||
, the inequality
|
||||
|
||||
|
||||
|
||||
|
||||
d
|
||||
|
||||
I
|
||||
|
||||
|
||||
(
|
||||
|
||||
U
|
||||
|
||||
,
|
||||
|
||||
W
|
||||
|
||||
)
|
||||
≤
|
||||
|
||||
d
|
||||
|
||||
I
|
||||
|
||||
|
||||
(
|
||||
|
||||
U
|
||||
|
||||
,
|
||||
|
||||
V
|
||||
|
||||
)
|
||||
+
|
||||
|
||||
d
|
||||
|
||||
I
|
||||
|
||||
|
||||
(
|
||||
|
||||
V
|
||||
|
||||
,
|
||||
|
||||
W
|
||||
|
||||
)
|
||||
|
||||
|
||||
{\displaystyle d_{I}(\mathbb {U} ,\mathbb {W} )\leq d_{I}(\mathbb {U} ,\mathbb {V} )+d_{I}(\mathbb {V} ,\mathbb {W} )}
|
||||
|
||||
is satisfied.
|
||||
On the other hand, there are examples of persistence modules that are not isomorphic but that have interleaving distance zero. Furthermore, if no suitable
|
||||
|
||||
|
||||
|
||||
δ
|
||||
|
||||
|
||||
{\displaystyle \delta }
|
||||
|
||||
exists then two persistence modules are said to have infinite interleaving distance. These two properties make the interleaving distance an extended pseudometric, which means non-identical objects are allowed to have distance zero, and objects are allowed to have infinite distance, but the other properties of a proper metric are satisfied.
|
||||
Further metric properties of the interleaving distance and its variants were investigated by Luis Scoccola in 2020.
|
||||
|
||||
|
||||
=== Computational complexity ===
|
||||
Computing the interleaving distance between two single-parameter persistence modules can be accomplished in polynomial time. On the other hand, it was shown in 2018 that computing the interleaving distance between two multi-dimensional persistence modules is NP-hard.
|
||||
|
||||
|
||||
== References ==
|
||||
162
data/en.wikipedia.org/wiki/Item_tree_analysis-0.md
Normal file
162
data/en.wikipedia.org/wiki/Item_tree_analysis-0.md
Normal file
@ -0,0 +1,162 @@
|
||||
---
|
||||
title: "Item tree analysis"
|
||||
chunk: 1/2
|
||||
source: "https://en.wikipedia.org/wiki/Item_tree_analysis"
|
||||
category: "reference"
|
||||
tags: "science, encyclopedia"
|
||||
date_saved: "2026-05-05T09:54:17.467755+00:00"
|
||||
instance: "kb-cron"
|
||||
---
|
||||
|
||||
Item tree analysis (ITA) is a data analytical method which allows constructing a
|
||||
hierarchical structure on the items of a questionnaire or test from observed response
|
||||
patterns. Assume that we have a questionnaire with m items and that subjects can
|
||||
answer positive (1) or negative (0) to each of these items, i.e. the items are
|
||||
dichotomous. If n subjects answer the items this results in a binary data matrix D
|
||||
with m columns and n rows.
|
||||
Typical examples of this data format are test items which can be solved (1) or failed
|
||||
(0) by subjects. Other typical examples are questionnaires where the items are
|
||||
statements to which subjects can agree (1) or disagree (0).
|
||||
Depending on the content of the items it is possible that the response of a subject to an
|
||||
item j determines her or his responses to other items. It is, for example, possible that
|
||||
each subject who agrees to item j will also agree to item i. In this case we say that
|
||||
item j implies item i (short
|
||||
|
||||
|
||||
|
||||
i
|
||||
→
|
||||
j
|
||||
|
||||
|
||||
{\displaystyle i\rightarrow j}
|
||||
|
||||
). The goal of an ITA is to uncover such
|
||||
deterministic implications from the data set D.
|
||||
|
||||
== Algorithms for ITA ==
|
||||
ITA was originally developed by Van Leeuwe in 1974. The result of his algorithm,
|
||||
which we refer in the following as Classical ITA, is a logically consistent set of
|
||||
implications
|
||||
|
||||
|
||||
|
||||
i
|
||||
→
|
||||
j
|
||||
|
||||
|
||||
{\displaystyle i\rightarrow j}
|
||||
|
||||
. Logically consistent means that if i implies j and j implies k then i implies k for each triple i, j, k of items. Thus the outcome of an ITA is a reflexive and transitive relation on the item set, i.e. a quasi-order on the items.
|
||||
A different algorithm to perform an ITA was suggested in Schrepp (1999). This algorithm is called Inductive ITA.
|
||||
Classical ITA and inductive ITA both construct a quasi-order on the item set by explorative data analysis. But both methods use a different algorithm to construct this quasi-order. For a given data set the resulting quasi-orders from classical and inductive ITA will usually differ.
|
||||
A detailed description of the algorithms used in classical and inductive ITA can be found in Schrepp (2003) or Schrepp (2006)[1]. In a recent paper (Sargin & Ünlü, 2009) some modifications to the algorithm of inductive ITA are proposed, which improve the ability of this method to detect the correct implications from data (especially in the case of higher random response error rates).
|
||||
|
||||
== Relation to other methods ==
|
||||
ITA belongs to a group of data analysis methods called Boolean analysis of questionnaires.
|
||||
Boolean analysis was introduced by Flament in 1976. The goal of a Boolean analysis is to
|
||||
detect deterministic dependencies (formulas from Boolean logic connecting the items, like for example
|
||||
|
||||
|
||||
|
||||
i
|
||||
→
|
||||
j
|
||||
|
||||
|
||||
{\displaystyle i\rightarrow j}
|
||||
|
||||
,
|
||||
|
||||
|
||||
|
||||
i
|
||||
∧
|
||||
j
|
||||
→
|
||||
k
|
||||
|
||||
|
||||
{\displaystyle i\wedge j\rightarrow k}
|
||||
|
||||
, and
|
||||
|
||||
|
||||
|
||||
i
|
||||
∨
|
||||
j
|
||||
→
|
||||
k
|
||||
|
||||
|
||||
{\displaystyle i\vee j\rightarrow k}
|
||||
|
||||
) between the items of a questionnaire or test.
|
||||
Since the basic work of Flament (1976) a number of different methods for boolean analysis
|
||||
have been developed. See, for example, Van Buggenhaut and Degreef (1987), Duquenne (1987) or Theuns (1994).
|
||||
These methods share the goal to derive deterministic dependencies between the items of a
|
||||
questionnaire from data, but differ in the algorithms to reach this goal. A comparison of ITA
|
||||
to other methods of boolean data analysis can be found in Schrepp (2003).
|
||||
|
||||
== Applications ==
|
||||
There are several research papers available, which describe concrete applications of item tree analysis.
|
||||
Held and Korossy (1998) analyzes implications on a set of algebra problems with classical ITA. Item tree analysis is also used in a number of social science studies to get insight into the structure of dichotomous data. In Bart and Krus (1973), for example, a predecessor of ITA is used to establish a hierarchical order on items that describe socially unaccepted behavior. In Janssens (1999) a method of Boolean analysis is used to investigate the
|
||||
integration process of minorities into the value system of the dominant culture. Schrepp describes several applications of inductive ITA in the analysis of dependencies between items of social science questionnaires.
|
||||
|
||||
== Example of an application ==
|
||||
To show the possibilities of an analysis of a data set by ITA we analyse the statements of question 4 of the International Social Science Survey Programme (ISSSP) for the year 1995 by inductive and classical ITA.
|
||||
The ISSSP is a continuing annual program of cross-national collaboration on surveys covering important topics for social science research. The program conducts each year one survey with comparable questions in each of the participating nations. The theme of the 1995 survey was national identity. We analyze the results for question 4 for the data set of Western Germany.
|
||||
The statement for question 4 was:
|
||||
Some people say the following things are important for being truly German. Others say they are not important. How important do you think each of the following is:
|
||||
1. to have been born in Germany
|
||||
2. to have German citizenship
|
||||
3. to have lived in Germany for most of one’s life
|
||||
4. to be able to speak German
|
||||
5. to be a Christian
|
||||
6. to respect Germany’s political institutions
|
||||
7. to feel German
|
||||
The subjects had the response possibilities Very important, Important, Not very important, Not important at all, and Can’t choose to answer the statements.
|
||||
To apply ITA to this data set we changed the answer categories. Very important and Important are coded as 1. Not very important and Not important at all are coded as 0. Can’t choose was handled as missing data.
|
||||
The following figure shows the resulting quasi-orders
|
||||
|
||||
|
||||
|
||||
|
||||
≤
|
||||
|
||||
I
|
||||
I
|
||||
T
|
||||
A
|
||||
|
||||
|
||||
|
||||
|
||||
{\displaystyle \leq _{IITA}}
|
||||
|
||||
from inductive ITA and
|
||||
|
||||
|
||||
|
||||
|
||||
≤
|
||||
|
||||
C
|
||||
I
|
||||
T
|
||||
A
|
||||
|
||||
|
||||
|
||||
|
||||
{\displaystyle \leq _{CITA}}
|
||||
|
||||
from classical ITA.
|
||||
|
||||
== Available software ==
|
||||
The program ITA 2.0 implements both classical and inductive ITA. The program is available on [2]. A short documentation of the program is available in [3].
|
||||
|
||||
== See also ==
|
||||
Item response theory
|
||||
Some files were not shown because too many files have changed in this diff Show More
Loading…
Reference in New Issue
Block a user