diff --git a/_index.db b/_index.db index 8c1589c8b..6467fdb78 100644 Binary files a/_index.db and b/_index.db differ diff --git a/data/en.wikipedia.org/wiki/Academic_Torrents-0.md b/data/en.wikipedia.org/wiki/Academic_Torrents-0.md index 5f1b71bdf..7853dde1a 100644 --- a/data/en.wikipedia.org/wiki/Academic_Torrents-0.md +++ b/data/en.wikipedia.org/wiki/Academic_Torrents-0.md @@ -4,7 +4,7 @@ chunk: 1/1 source: "https://en.wikipedia.org/wiki/Academic_Torrents" category: "reference" tags: "science, encyclopedia" -date_saved: "2026-05-05T06:31:28.240761+00:00" +date_saved: "2026-05-05T09:52:40.258053+00:00" instance: "kb-cron" --- diff --git a/data/en.wikipedia.org/wiki/Academic_journal-0.md b/data/en.wikipedia.org/wiki/Academic_journal-0.md new file mode 100644 index 000000000..ca963e006 --- /dev/null +++ b/data/en.wikipedia.org/wiki/Academic_journal-0.md @@ -0,0 +1,34 @@ +--- +title: "Academic journal" +chunk: 1/5 +source: "https://en.wikipedia.org/wiki/Academic_journal" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:52:46.563930+00:00" +instance: "kb-cron" +--- + +An academic journal or scholarly journal is a periodical publication in which scholarship relating to a particular academic discipline is published. They serve as permanent and transparent forums for the dissemination, scrutiny, and discussion of research. Unlike professional magazines or trade magazines, the articles are mostly written by researchers rather than staff writers employed by the journal. They nearly universally require peer review for research articles or other scrutiny from contemporaries competent and established in their respective fields. Academic journals trace their origins back to the 17th century, with the Philosophical Transactions of the Royal Society being established in 1665 as the first scientific journal. +As of 2012, it is estimated that over 28,100 active academic journals are in publication, with scopes ranging from the general sciences, as seen in journals like Science and Nature, to highly specialized fields. These journals publish a variety of articles including original research, review articles, and perspectives. The advent of electronic publishing has made academic journals more accessible. + +== Content == + +Content usually takes the form of articles presenting original research, review articles, or book reviews. The purpose of an academic journal, according to Henry Oldenburg (the first editor of Philosophical Transactions of the Royal Society), is to give researchers a venue to "impart their knowledge to one another, and contribute what they can to the Grand design of improving natural knowledge, and perfecting all Philosophical Arts, and Sciences." +The term academic journal applies to scholarly publications in all fields; this includes journals that cover formal sciences, natural sciences, social sciences, and humanities, which differ somewhat from each other in form and function. Academic journals in the formal and natural sciences are often called scientific journals. Most journals are highly specialized, although some of the oldest journals such as Science and Nature publish articles and scientific papers across a wide range of scientific fields. +Although academic journals are superficially similar to professional magazines (or trade journals), they are quite different. Articles in academic journals are written by active researchers such as students, scientists, and professors. Their intended audience is others in the field, meaning their content is highly technical. Academic articles also deal with research, and are peer reviewed. Meanwhile, trade journals are aimed at people in different fields, focusing on how people in those fields can do their jobs better. +Active academic researchers are expected to publish their work in academic journals, and public funding bodies often require research results to be published in academic journals. Academic credentials for promotion into academic ranks are established in large part by the number and impact of scientific articles published. This places pressure on researchers to publish articles frequently – an environment known as publish or perish. + +== History == + +In the 17th century, scientists wrote letters to each other, and included scientific ideas with them. Then, in the mid-17th century, scientists began to hold meetings and share their scientific ideas. Eventually, they led to starting organizations, such as the Royal Society (1660) and the French Academy of Sciences (1666). +The idea of a published journal with the purpose of "[letting] people know what is happening in the Republic of Letters" was first conceived by François Eudes de Mézeray in 1663. A publication titled Journal littéraire général was supposed to be published to fulfill that goal, but never was. Humanist scholar Denis de Sallo (under the pseudonym "Sieur de Hédouville") and printer Jean Cusson took Mazerai's idea, and obtained a royal privilege from King Louis XIV on 8 August 1664 to establish the Journal des sçavans. The journal's first issue was published on 5 January 1665. It was aimed at people of letters, and had four main objectives: + +review newly published major European books, +publish the obituaries of famous people, +report on discoveries in arts and science, and +report on the proceedings and censures of both secular and ecclesiastical courts, as well as those of universities both in France and outside. +Soon after, the Royal Society established Philosophical Transactions of the Royal Society in March 1665, and the Académie des Sciences established the Mémoires de l'Académie des Sciences in 1666, which focused on scientific communications. By the end of the 18th century, nearly 500 such periodicals had been published, the vast majority coming from Germany (304 periodicals), France (53), and England (34). Several of those publications, in particular the German journals, tended to be short-lived (under five years). A.J. Meadows has estimated the proliferation of journals to reach 10,000 journals in 1950, and 71,000 in 1987. Michael Mabe wrote that the estimates will vary depending on the definition of what exactly counts as a scholarly publication, but that the growth rate has been "remarkably consistent over time", with an average rate of 3.46% per year from 1800 to 2003. +In 1733, Medical Essays and Observations was established by the Medical Society of Edinburgh as the first fully peer-reviewed journal. Peer review was introduced as an attempt to increase the quality and pertinence of submissions. Other important events in the history of academic journals include the establishment of Nature (1869) and Science (1880), the establishment of Postmodern Culture in 1990 as the first online-only journal, the foundation of arXiv in 1991 for the dissemination of preprints to be discussed prior to publication in a journal, and the establishment of PLOS One in 2006 as the first megajournal. +Peer review did not begin until the 1970s, and was seen as a way of enabling researchers who were not as well-known to have their papers published in journals that were more prestigious. Though it was originally done by mailing copies of papers to reviewers, it is now done online. + +== Scholarly articles == \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Academic_journal-1.md b/data/en.wikipedia.org/wiki/Academic_journal-1.md new file mode 100644 index 000000000..aca8c09e8 --- /dev/null +++ b/data/en.wikipedia.org/wiki/Academic_journal-1.md @@ -0,0 +1,40 @@ +--- +title: "Academic journal" +chunk: 2/5 +source: "https://en.wikipedia.org/wiki/Academic_journal" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:52:46.563930+00:00" +instance: "kb-cron" +--- + +There are two kinds of article or paper submissions in academia: solicited, where an individual has been invited to submit work either through direct contact or through a general submissions call, and unsolicited, where an individual submits a work for potential publication without directly being asked to do so. Upon receipt of a submitted article, editors at the journal determine whether to reject the submission outright or begin the process of peer review. In the latter case, the submission becomes subject to review by outside scholars of the editor's choosing who typically remain anonymous. The number of these peer reviewers (or "referees") varies according to each journal's editorial practice – typically, no fewer than two, though sometimes three or more, experts in the subject matter of the article produce reports upon the content, style, and other factors, which inform the editors' publication decisions. +Though these reports are generally confidential, some journals and publishers also practice public peer review. The editors either choose to reject the article, ask for a revision and resubmission, or accept the article for publication. Even accepted articles are often subjected to further (sometimes considerable) editing by journal editorial staff before they appear in print. The peer review can take from several weeks to several months. +Many journal articles are broadly structured according to the IMRAD scheme. Each article has several sections, often including the following: + +The title; +Information about the author(s); +The abstract, which is a one-paragraph summary of the article; +The introduction, including a background, why the research was done, research on this topic that has been done before, and (possibly) a hypothesis; +The methodology or method, which includes the way the research was done, details concerning the study's sample, measures for assessment, and the procedure; +Findings or results, which summarize what the study found; +Conclusion, comments, or discussion, which both explain how the results answered the questions that were posed, as well as areas that could be researched in the future; +A list of works that the article's author cited. +Reading an article in an academic journal usually entails first reading the title, to see if it is related to the desired topic. If it is, the next step is to read the abstract (or summary or conclusion, if the abstract is missing), to determine if the article is worth reading. +Publishing research results is an essential part of helping science to advance. If scientists are describing experiments or calculations, they should also explain how they did them so that an independent researcher could repeat the experiment or calculation to verify the results, or so that they could evaluate whatever the research article's findings were. Each journal article becomes part of the permanent scientific record. + +=== Types of article === +Articles can also be categorized by their purpose. The exact terminology and definitions vary by field and specific journal, but often include: + +Letters (also called communications, and not to be confused with letters to the editor) are short descriptions of important current research findings that are usually fast-tracked for immediate publication because they are considered urgent. +Research notes are short descriptions of current research findings that are considered less urgent or important than Letters. +Articles are usually between five and twenty pages and are complete descriptions of current original research findings, but there are considerable variations between different fields and journals—80-page articles are not rare in mathematics or theoretical computer science. +Supplemental articles contain a large volume of tabular data that is the result of current research and may be dozens or hundreds of pages with mostly numerical data. Some journals now only publish this data electronically on the Internet. Supplemental information also contains other voluminous material not appropriate for the main body of the article, like descriptions of routine procedures, derivations of equations, source code, non-essential data, spectra or other such miscellaneous information. +A target article in a journal is one which argues a case, to which other authors submit a commentary or a response. There may be a final response from the author of the target article. See, for example, Alison Gopnik's article How we know our minds: The illusion of first-person knowledge of intentionality in the journal Behavioral and Brain Sciences, Volume 16, Issue 1 (1993), which was one of a pair of "target articles" to which other responses were published in the same volume. +Review articles do not cover original research but rather accumulate the results of many different articles on a particular topic into a coherent narrative about the state of the art in that field. Review articles provide information about the topic and also provide journal references to the original research. Reviews may be entirely narrative, or may provide quantitative summary estimates resulting from the application of meta-analytical methods. +Data papers are articles dedicated to describe datasets. This type of article is becoming popular and journals exclusively dedicated to them have been established, e.g. Scientific Data and Earth System Science Data. +Video papers are a recent addition to practice of academic publications. They most often combine an online video demonstration of a new technique or protocol with a rigorous textual description. + +== Reviewing == + +=== Review articles === \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Academic_journal-2.md b/data/en.wikipedia.org/wiki/Academic_journal-2.md new file mode 100644 index 000000000..1d4e91560 --- /dev/null +++ b/data/en.wikipedia.org/wiki/Academic_journal-2.md @@ -0,0 +1,31 @@ +--- +title: "Academic journal" +chunk: 3/5 +source: "https://en.wikipedia.org/wiki/Academic_journal" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:52:46.563930+00:00" +instance: "kb-cron" +--- + +Review articles, also called "reviews of progress", are checks on the research published in journals. Some journals are devoted entirely to review articles, some contain a few in each issue, and others do not publish review articles. Such reviews often cover the research from the preceding year, some for longer or shorter terms; some are devoted to specific topics, some to general surveys. Some reviews are enumerative, listing all significant articles in a given subject; others are selective, including only what they think worthwhile. Yet others are evaluative, judging the state of progress in the subject field. Some journals are published in series, each covering a complete subject field year, or covering specific fields through several years. +Unlike original research articles, review articles tend to be solicited or "peer-invited" submissions, often planned years in advance, which may themselves go through a peer-review process once received. They are typically relied upon by students beginning a study in a given field, or for current awareness of those already in the field. + +=== Book reviews === + +Reviews of scholarly books are checks upon the research books published by scholars; unlike articles, book reviews tend to be solicited. Journals typically have a separate book review editor determining which new books to review and by whom. If an outside scholar accepts the book review editor's request for a book review, he or she generally receives a free copy of the book from the journal in exchange for a timely review. Publishers send books to book review editors in the hope that their books will be reviewed. The length and depth of research book reviews varies much from journal to journal, as does the extent of textbook and trade book review. + +== Prestige and ranking == + +An academic journal's prestige is established over time, and can reflect many factors, some but not all of which are expressible quantitatively. In many fields, a formal or informal hierarchy of academic journals exists; the most prestigious journal in a field tends to be the most selective in terms of the articles it will select for publication, and usually will also have the highest impact factor. In some countries, journal rankings can be utilized for funding decisions and even evaluation of individual researchers, although they are poorly suited for that purpose. +In each academic discipline, some journals receive a high number of submissions and opt to restrict how many they publish, keeping the acceptance rate low. Size or prestige are not a guarantee of reliability. +In the natural sciences and in the social sciences, the impact factor is an established proxy, measuring the number of later articles citing articles already published in the journal. There are other quantitative measures of prestige, such as the overall number of citations, how quickly articles are cited, and the average "half-life" of articles. Clarivate Analytics' Journal Citation Reports, which among other features, computes an impact factor for academic journals, draws data for computation from the Science Citation Index Expanded (for natural science journals), and from the Social Sciences Citation Index (for social science journals). Several other metrics are also used, including the SCImago Journal Rank, CiteScore, Eigenfactor, and Altmetrics. +In the Anglo-American humanities, there is no tradition (as there is in the sciences) of giving impact-factors that could be used in establishing a journal's prestige. Recent moves have been made by the European Science Foundation (ESF) to change the situation, resulting in the publication of preliminary lists for the ranking of academic journals in the humanities. These rankings have been severely criticized, notably by history and sociology of science British journals that have published a common editorial entitled "Journals under Threat". Though it did not prevent ESF and some national organizations from proposing journal rankings, it largely prevented their use as evaluation tools. +In some disciplines such as knowledge management/intellectual capital, the lack of a well-established journal ranking system is perceived by academics as "a major obstacle on the way to tenure, promotion and achievement recognition". Conversely, a significant number of scientists and organizations consider the pursuit of impact factor calculations as inimical to the goals of science, and have signed the San Francisco Declaration on Research Assessment to limit its use. +Three categories of techniques have developed to assess journal quality and create journal rankings: + +stated preference; +revealed preference; and +publication power approaches + +== Costs == \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Academic_journal-3.md b/data/en.wikipedia.org/wiki/Academic_journal-3.md new file mode 100644 index 000000000..1bfc5c7a0 --- /dev/null +++ b/data/en.wikipedia.org/wiki/Academic_journal-3.md @@ -0,0 +1,26 @@ +--- +title: "Academic journal" +chunk: 4/5 +source: "https://en.wikipedia.org/wiki/Academic_journal" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:52:46.563930+00:00" +instance: "kb-cron" +--- + +Many academic journals are subsidized by universities or professional organizations, and do not exist to make a profit. They often accept advertising, page and image charges from authors to pay for production costs. On the other hand, some journals are produced by commercial publishers who do make a profit by charging subscriptions to individuals and libraries. They may also sell all of their journals in discipline-specific collections or a variety of other packages. Many scientists and librarians have long protested these costs, especially as they see these payments going to large for-profit publishing houses. To allow their researchers online access to journals, many universities purchase site licenses, permitting access from anywhere in the university, and, with appropriate authorization, by university-affiliated users at home or elsewhere. These may be much more expensive than the cost for a print subscription. Despite the transition to electronic publishing, the costs of site licenses continue to rise relative to universities' budgets. This is known as the serials crisis. +Journal editors tend to have other professional responsibilities, most often as teaching professors. In the case of the largest journals, there are paid staff assisting in the editing. The production of the journals is almost always done by publisher-paid staff. Humanities and social science academic journals are usually subsidized by universities or professional organization. +Traditional scientific journals require a paid subscription to access published articles. +The cost and value proposition of subscription to academic journals is being continuously re-assessed by institutions worldwide. In the context of the big deal cancellations by several library systems in the world, data analysis tools like Unpaywall Journals are used by libraries to estimate the specific cost and value of the various options: libraries can avoid subscriptions for materials already served by instant open access via open archives like PubMed Central. +Concerns about cost and open access have led to the creation of free-access journals such as the Public Library of Science (PLoS) family and partly open or reduced-cost journals such as the Journal of High Energy Physics. However, professional editors still have to be paid, and PLoS still relies heavily on donations from foundations to cover the majority of its operating costs; smaller journals do not often have access to such resources. Open access journals may charge authors a fee for review or publication, rather than charging a readers a fee for access. + +== Reproducibility and replicability == +For scientific journals, reproducibility and replicability of the scientific results are core concepts that allow other scientists to check and reproduce the results under the same conditions described in the paper or at least similar conditions and produce similar results with similar measurements of the same subject or carried out under changed conditions of measurement. While the ability to reproduce the results based only on details included in the article is expected, verification of reproducibility by a third party is not generally required for publication. The reproducibility of results presented in an article is therefore judged implicitly by the quality of the procedures reported and agreement with the data provided. However, some journals in the field of chemistry such as Inorganic Syntheses and Organic Syntheses require independent reproduction of the results presented as part of the review process. +The inability for independent researches to reproduce published results is widespread, with 70% of researchers reporting failure to reproduce another scientist's results, including more than half who report failing to reproduce their own experiments. Sources of irreproducibility vary, including publication of falsified or misrepresented data and poor detailing of procedures. + +== Copyright == + +Traditionally, the author of an article was required to transfer the copyright to the journal publisher. Publishers claimed this was necessary in order to protect authors' rights, and to coordinate permissions for reprints or other use. However, many authors, especially those active in the open access movement, found this unsatisfactory, and have used their influence to effect a gradual move towards a license to publish instead. Under such a system, the publisher has permission to edit, print, and distribute the article commercially, but the authors retain the other rights themselves. +Even if they retain the copyright to an article, most journals allow certain rights to their authors. These rights usually include the ability to reuse parts of the paper in the author's future work, and allow the author to distribute a limited number of copies. In the print format, such copies are called reprints; in the electronic format, they are called postprints. Some publishers, for example the American Physical Society, also grant the author the right to post and update the article on the author's or employer's website and on free e-print servers, to grant permission to others to use or reuse figures, and even to reprint the article as long as no fee is charged. The rise of open access journals, in which the author retains the copyright but must pay a publication charge, such as the Public Library of Science family of journals, is another recent response to copyright concerns. + +== New developments == \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Academic_journal-4.md b/data/en.wikipedia.org/wiki/Academic_journal-4.md new file mode 100644 index 000000000..3847b10f2 --- /dev/null +++ b/data/en.wikipedia.org/wiki/Academic_journal-4.md @@ -0,0 +1,55 @@ +--- +title: "Academic journal" +chunk: 5/5 +source: "https://en.wikipedia.org/wiki/Academic_journal" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:52:46.563930+00:00" +instance: "kb-cron" +--- + +The Internet has revolutionized the production of, and access to, academic journals, with their contents available online via services subscribed to by academic libraries. Individual articles are subject-indexed in databases such as Google Scholar. Some of the smallest, most specialized journals are prepared in-house, by an academic department, and published only online – this has sometimes been in the blog format, though some, like the open access journal Internet Archaeology, use the medium to embed searchable datasets, 3D models, and interactive mapping. +Currently, there is a movement in higher education encouraging open access, either via self archiving, whereby the author deposits a paper in a disciplinary or institutional repository where it can be searched for and read, or via publishing it in a free open access journal, which does not charge for subscriptions, being either subsidized or financed by a publication fee. Given the goal of sharing scientific research to speed advances, open access has affected science journals more than humanities journals. Commercial publishers are experimenting with open access models, but are trying to protect their subscription revenues. +Academic journals have been criticized for persistent gender, geographic, and other representation imbalances on editorial boards. + +=== Predatory and junk journals === + +The much lower entry cost of on-line publishing has also raised concerns of an increase in publication of "junk" journals with lower publishing standards. These journals, often with names chosen as similar to well-established publications, solicit articles via e-mail and then charge the author to publish an article, often with no sign of actual review. Jeffrey Beall, a research librarian at the University of Colorado, has compiled a list of what he considers to be "potential, possible, or probable predatory scholarly open-access publishers"; the list numbered over 300 journals as of April 2013, but he estimates that there may be thousands. The OMICS Publishing Group, which publishes a number of the journals on this list, threatened to sue Beall in 2013 and Beall stopped publishing in 2017, citing pressure from his university. A US judge fined OMICS $50 million in 2019 stemming from an FTC lawsuit. +Some academic journals use the registered report format, which aims to counteract issues such as data dredging and hypothesizing after the results are known. For example, Nature Human Behaviour has adopted the registered report format, as it "shift[s] the emphasis from the results of research to the questions that guide the research and the methods used to answer them". The European Journal of Personality defines this format: "In a registered report, authors create a study proposal that includes theoretical and empirical background, research questions/hypotheses, and pilot data (if available). Upon submission, this proposal will then be reviewed prior to data collection, and if accepted, the paper resulting from this peer-reviewed procedure will be published, regardless of the study outcomes." + +=== Electronic journals === + +Some journals are born digital in that they are solely published on the web and in a digital format. Though most electronic journals originated as print journals, which subsequently evolved to have an electronic version, while still maintaining a print component, others eventually became electronic-only. +An e-journal closely resembles a print journal in structure: there is a table of contents which lists the articles, and many electronic journals still use a volume/issue model, although some titles now publish on a continuous basis. Online journal articles are a specialized form of electronic document: they have the purpose of providing material for academic research and study, and they are formatted approximately like journal articles in traditional printed journals. Often, a journal article will be available for download in two formats: PDF and HTML, although other electronic file types are often supported for supplementary material. New tools such as JATS and Utopia Documents provide a 'bridge' to the 'web-versions' in that they connect the content in PDF versions directly to the World Wide Web via hyperlinks that are created 'on-the-fly'. The PDF version of an article is usually seen as the version of record, but the matter is subject to some debate. Articles are indexed in bibliographic databases as well as by search engines. E-journals allow new types of content to be included in journals, for example, video material, or the data sets on which research has been based. +With the growth and development of the Internet, there has been a growth in the number of new digital-only journals. A subset of these journals exist as Open Access titles, meaning that they are free to access for all, and have Creative Commons licences which permit the reproduction of content in different ways. High quality open access journals are listed in Directory of Open Access Journals. Most, however, continue to exist as subscription journals, for which libraries, organisations and individuals purchase access. +Benefits of electronically publishing include easy availability of supplementary materials (data, graphics and video), lower cost, and availability to more people, especially scientists from non-developed countries. Hence, research results from more developed nations are becoming more accessible to scientists from non-developed countries. + +== Lists == + +Databases providing detailed information about journals: +Ulrich's Global Serials Directory +Directory of Periodicals by Modern Language Association +JournalSeek by Genamics +Web of Science +Scopus +WorldCat +Journal hosting sites that also provide lists. Some sites evaluate journals, providing information such as how long a journal takes to review articles and what types of articles it publishes: +Project MUSE +JSTOR +ScienceDirect +Informaworld + +== See also == + +== Explanatory notes == + +== References == + +== Further reading == +Bakkalbasi, N; Bauer, K; Glover, J; Wang, L (2006). "Three options for citation tracking: Google Scholar, Scopus and Web of Science". Biomedical Digital Libraries. 3 7. doi:10.1186/1742-5581-3-7. PMC 1533854. PMID 16805916. +Waller, A.C. (2001). Editorial Peer Review Its Strengths and Weaknesses. ASIST monograph series. Information Today. ISBN 978-1-57387-100-6. +Ware, Mark; Mabe, Michael (2015). The STM Report: An overview of scientific and scholarly journal publishing (PDF) (4th ed.). International Association of Scientific, Technical and Medical Publishers. + +== External links == + +Journal Collection at the Internet Archive \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Ads.txt-0.md b/data/en.wikipedia.org/wiki/Ads.txt-0.md new file mode 100644 index 000000000..ca3c7c130 --- /dev/null +++ b/data/en.wikipedia.org/wiki/Ads.txt-0.md @@ -0,0 +1,39 @@ +--- +title: "Ads.txt" +chunk: 1/1 +source: "https://en.wikipedia.org/wiki/Ads.txt" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:53:31.690614+00:00" +instance: "kb-cron" +--- + +ads.txt (Authorized Digital Sellers) is an initiative from IAB Technology Laboratory. It specifies a text file that companies can host on their web servers, listing the other companies authorized to sell their products or services. This is designed to allow online buyers to check the validity of the sellers from whom they buy, for the purposes of internet fraud prevention. + + +== State of adoption == +By November 2017, more than 44% of publishers had ads.txt files. More than 90,000 sites were using ads.txt, up from 3,500 in September 2017, according to Pixalate. Among the top 1,000 sites that sold programmatic ads, 57 percent had ads.txt files, compared to 16 percent in September, per Pixalate. +Latest adoption data per FirstImpression.io's Ads.txt Industry Dashboard: + +Google has been an active proponent of ads.txt. From the end of October 2017 Google Display & Video 360 only buys inventory from sources identified as authorized sellers in a publisher's ads.txt file, when a file is available. ads.txt may become a requirement for Display & Video 360. + + +== File format == +The IAB's ads.txt specification dictates the formatting of ads.txt files, which can contain three types of record; data records, variables and comments. An ads.txt file can include any number of records, each placed on their own line. +Since the ads.txt file format must be adhered to, a range of validation, management and collaboration tools have become available to help ensure ads.txt files are created correctly. +The original specification, v1.0, was issued in 2017. with version 1.0.1 making small changes based on community feedback later that year. In 2019, version 1.0.2 made further small amendments and recommended using a placeholder record to indicate the intent of an empty ads.txt file. placeholder.example.com, placeholder, DIRECT, placeholder +In 2021, version 1.0.3 added the inventorypartnerdomain directive, with version 1.1 adding managerdomain and ownerdomain in 2022. + + +== See also == +Online advertising +robots.txt +security.txt +trust.txt + + +== References == + + +== External links == +Official website \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/African_Peer_Review_Mechanism-0.md b/data/en.wikipedia.org/wiki/African_Peer_Review_Mechanism-0.md new file mode 100644 index 000000000..bb17628cb --- /dev/null +++ b/data/en.wikipedia.org/wiki/African_Peer_Review_Mechanism-0.md @@ -0,0 +1,70 @@ +--- +title: "African Peer Review Mechanism" +chunk: 1/2 +source: "https://en.wikipedia.org/wiki/African_Peer_Review_Mechanism" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:52:47.737196+00:00" +instance: "kb-cron" +--- + +The African Peer Review Mechanism (APRM) is a mutually agreed instrument voluntarily acceded to by the member states of the African Union (AU) as a self-monitoring mechanism. The APRM was launched on 9 March 2003 by the NEPAD Heads of State and Government Implementation Committee (HSGIC) in Abuja, Nigeria (NEPAD/HSGIC/03-2003/APRM/MOU (9 March 2003), Assembly Decision 198 (XI), Decision 527 (XXIII) and Decision Ext/Assembly/AU/Dec.1-4(XI); +The APRM is an African-owned and African-led platform for self-assessment, peer-learning, and experience-sharing in democracy and good governance, in full respect for democratic principles, human rights, rule of law, the acceleration of political, social and economic integration in Africa; + +== The Mandate == +The mandate of the APRM is to encourage conformity with regards to political, economic and corporate governance values, codes and standards, among African countries and the objectives in socio-economic development as well as to ensure monitoring and evaluation of AU Agenda 2063 and SDGs 2030. + +The mandate of the APRM is to ensure that policies and practices of participating Member States conform to the agreed political, economic and corporate governance values, codes and standards contained in the African Union Declaration on Democracy, Political, Economic and Corporate Governance. As a voluntary self-monitoring instrument, APRM fosters the adoption of policies, standards and practices that lead to political stability, high economic growth, sustainable development and accelerated regional and continental economic integration through sharing of experiences and best practices, including identifying deficiencies and assessing the needs for capacity building. + +== Expanded Mandate == +In 2018, during the 28th Assembly of Heads of State and Government of the AU, the APR Forum of Heads of State and Government decided to extend the APRM's mandate. This expansion includes the tracking and oversight of key governance initiatives across the continent. +Furthermore, the AU Assembly expanded the APRM's responsibilities to encompass monitoring the implementation of the African Union's Agenda 2063 + +and the United Nations Sustainable Development Goals (SDGs) as part of Agenda 2030. This broadened mandate aims to enhance the APRM's role in promoting governance, development, and accountability in African nations. + +Tracking governance-related aspects of Agenda 2063 and UN SDGs 2030 +Biennial Africa Governance Report completed in AGR2019, AGR2021, AGR2023 on UCG. AGR2025 (tbc) +National Governance Reporting +Support to Member States in the area of International Credit Rating Agencies + +== Africa's self-assessment for good governance == +Member countries within the APRM undertake self-monitoring in all aspects of their governance and socio-economic development. African Union (AU) stakeholders participate in the self-assessment of all branches of government – executive, legislative and judicial – as well as the private sector, civil society and the media. The APRM Review Process gives member states a space for national dialogue on governance and socio-economic indicators and an opportunity to build consensus on the way forward. + +== Four types of country reviews == +1. Base Review – carried out immediately after a country becomes a member of the APRM +2. Periodic Review every four years +3. Targeted Review – requested by the member country itself outside the framework of mandated reviews +4. A Review commissioned by the APR Forum when there are early signs of pending political and economic crisis. + +== Five Thematic Areas == +1. Democracy and Political Governance (DPG) +2. Economic Governance and Management (EGM) +3. Corporate Governance (CG) +4. Broad-based Sustainable Socio-economic Development (SED) +5. State Resilience to Shocks and Disasters +The APRM Principles that underpin APRM reviews include +(i) national ownership and leadership; +(ii) inclusive participation; +(iii) technical competence and +(iv) freedom from political manipulation. + +== The five stages of a peer review == + +=== 1. Consultation === +The APR Secretariat and the Country under review consult on the process overview and terms of the Memorandum of Understanding (MoU). The Country under review creates a Focal Point to liaise with the Secretariat and provide it with relevant laws, treaty ratifications, budgets and development plans. The Secretariat prepares a background assessment document. At the same time, the Country under review independently completes the APR Self-Assessment Questionnaire, gathers inputs from civil society and drafts a paper outlining the nation's issues and a National Programme of Action (NPoA) with clear steps and deadlines on how it plans to conform to APRM codes and standards, the African Union Charter, and UN obligations. The Country Review Team that is set up writes a report outlining issues to be focused on during the review mission. +2. THE REVIEW MISSION +Visits the Country under review and conducts broad-based consultations with government, officials, political parties, parliamentarians, and representatives of civil society organisations (e.g. media, academia, trade unions, professional bodies), and the private sector. The mission typically lasts two-and-a-half to three weeks. +3. DRAFT REPORT +The APR Country Review Team drafts a report on the Country under review. +4.THE PEER REVIEW +takes place at the level of the APR Forum, using the APR Panel's report on the team's findings as a basis. The APR Forum discusses these recommendations with the Reviewed Country's leadership. +5. FINAL REPORT +Within six months, after the peer review, the published Country Review Report must be tabled in sub-regional institutions (Pan-African Parliament, African Commission on Human and Peoples' Rights, AU Peace and Security Council, Economic, Social and Cultural Council of the African Union [AU-ECOSOCC]). The report is then made publicly available. + +== The second generation review == +The objective of the APRM Second Generation Review is to assess progress made in Governance and Socio-economic Development in Member States in the period since the Base Review. The specific objectives are to: + +reinvigorate, rationalize and institutionalize the APRM in governance reforms within a Member States. +appraise to what extent the National Programme of Action (NPoA) is implemented and its continued relevance, on the basis of which a new NPOA with a few key actions will be proposed; +facilitate the development of a second NPOA with greater focus and based only on key actions; and +make the APRM Review process more relevant to citizens' needs, more cost-effective and in tune with the Agenda 2063 priorities and goals. \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/African_Peer_Review_Mechanism-1.md b/data/en.wikipedia.org/wiki/African_Peer_Review_Mechanism-1.md new file mode 100644 index 000000000..5c14556c5 --- /dev/null +++ b/data/en.wikipedia.org/wiki/African_Peer_Review_Mechanism-1.md @@ -0,0 +1,75 @@ +--- +title: "African Peer Review Mechanism" +chunk: 2/2 +source: "https://en.wikipedia.org/wiki/African_Peer_Review_Mechanism" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:52:47.737196+00:00" +instance: "kb-cron" +--- + +== What happens after the country review == +The National Programme of Action (NPoA) is divided into short-term, medium-term and long-term goals and is continuously monitored by the National Governance Commission/Governing Council, or a smaller body of state and non-state representatives. Progress Reports on implementation are presented annually to the APR Forum. The APR Secretariat follows up on commitments made, holds regional workshops to share best practices identified in the reviews, and offers technical support to fulfill APRM plans. + +== APRM Structures == + +=== APR FORUM === +(Committee of Participating Heads of State and Government) +Highest decision-making authority. +APR PANEL +(Panel of Eminent Persons) +Oversees the review process to ensure its independence, professionalism and credibility, and reports to the Forum. The APR Panel is also responsible for selecting and appointing and the Review Teams. +COMMITTEE OF FOCAL POINTS +Committee of representatives of Heads of State and Government +Manages the budgetary process, resource mobilisation through Member States, Strategic and Development Partners, and the APRM Trust Fund and Audit. +National Governing Council (NGC) +The National Governance Commission/National Governing Council (NGC) is the body that oversees implementation of the APRM process at the Member State level. In addition to providing guidance in terms of policy direction, the NGC ensures professionalism, credibility and independence of the national APRM self-assessment and review processes. The NGC is composed of key stakeholder groups from government, civil society and the private sector, in line with the APRM principle of broad-based participation. + +Management of the APRM Continental Secretariat +The APRM Secretariat is currently managed by H.E. Ambassador Marie-Antoinette Rose Quatre, Chief Executive Officer (CEO). The continental structure works in collaboration with the National Focal Points and the National Commissions / National Governing Councils. +APRM SECRETARIAT +Provides technical, coordinating and administrative support services. It must have sufficient capacity for the analytical work that underpins the peer review process. + +== Membership of the APRM == +Membership of the APRM is voluntary and open to all African Union (AU) countries. Accession begins with an expression of interest in membership followed by the signing of a Memorandum of Understanding (MoU) between the country and the APR Forum. +As of 2024, the African Peer Review Mechanism (APRM) comprises 44 member states, with the Central African Republic (CAR) acceding during the 33rd APR Forum on February 6, 2024. Among these members, 26 countries have completed their first-generation peer reviews, 5 have undergone second-generation reviews, and 12 have participated in targeted peer reviews. + +== Strategic partners == +The APRM has entered into special support agreements with partner institutions designated by the Forum as Strategic Partners. These are: African Capacity Building Foundation (ACBF), the African Development Bank (AfDB); Mo Ibrahim Foundation; United Nations Economic Commission for Africa (UNECA); Office of the Special Advisor on Africa (OSAA); United Nations Development Programme (UNDP) Regional Bureau for Africa. + +== See also == +New Partnership for Africa's Development +African Unions + +== Bibliography == + +=== APRM documents === +APRM Base document The APRM base document +Memorandum of Understanding on the APRM The Memorandum of Understanding (MOU) +Guidelines Guidelines for Countries to prepare for and participate in the APRM +Organisation and Processes Organisation and Processes +The objectives, standards, criteria and indicators for the APRM. Retrieved 27 May 2006. +The Africa Governance Report 2019 + +=== Critiques and studies of the APRM process === +An Analysis of the Implementation of the African Peer Review Mechanism in Ghana, Kenya and Mauritius Grant Masterson, EISA, February 2005 +NEPAD’s APRM: A Progress Report, Practical Limitations and Challenges Paper by Ayesha Kajee on the APRM, 2004, retrieved 27 May 2006. +Becoming my brother's keeper eAfrica October 2003 by Ross Herbert, retrieved 27 May 2006. +The APRM process in Kenya: A pathway to a new state? Report by OSIEA/AfriMAP, April 2007 +Critical review of the African Peer Review Mechanism Process in Rwanda Report by Ligue des Droits de la Personne dans la Région des Grands Lacs (LDGL), January 2007 +Between Hope and Scepticism: Civil Society and the African Peer Review Mechanism Archived 29 June 2007 at the Wayback Machine Partnership Africa Canada, October 2005 +Strategies for promoting effective stakeholder participation in the African Peer Review Mechanism, UNECA, 2005 +Inadequately Self-Critical: Rwanda’s Self-Assessment for the African Peer Review Mechanism Eduard Jordaan, African Affairs, April 2006 +Effective Stakeholder Participation. in the APRM Process for the Promotion. of Democratic Governance:. A Case Study of Ghana, Eric Opoku, UNDP, December 2006 +Ghana and the APRM: A Critical Assessment, Adotey Bing-Pappoe, AfriMAP & OSIWA, June 2007 +The African Peer Review Mechanism in Mauritius: Lessons from Phase I Sheila Bunwaree, AfriMAP & OSISA, August 2007 +Further analysis of the African Peer Review Mechanism and the ECA/OECD-DAC Mutual Review of Development Effectiveness in the context of NEPAD Report to the UN High Level Task Force on the Implementation of the Right to Development, January 2008 +The African Peer Review Mechanism: Lessons from the Pioneers, Ross Herbert and Steven Gruzd, South African Institute of International Affairs, March 2008 +Civil Society Participation in Uganda’s APRM Process Juliet Nakato Odoi, SAIIA, June 2008 +Assessing South Africa’s APRM: An NGO Perspective, Nick Hutchings, Mukelani Dimba, and Alison Tilley, SAIIA, June 2008 +Addressing the African Peer Review Mechanism's Programmes of Action, Faten Aggad, SAIIA, June 2008 +Understanding APRM Research: Planning, Process and Politics. A Practical Handbook for Peer Review Research, George Katito, SAIIA, June 2008 +The African Peer Review Process in Nigeria, Adele Jinadu, AfriMAP, August 2008 +Benin and the African Peer Review Mechanism: Consolidating Democratic Achievements Gilles Badet, AfriMAP, August 2008 +Making the News: Why the APRM Didn't, Brendan Boyle, SAIIA, September 2008 +Media Freedom, Transparency and Governance, Raymond Louw, SAIIA, September 2008 \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/African_Theological_Studies-0.md b/data/en.wikipedia.org/wiki/African_Theological_Studies-0.md new file mode 100644 index 000000000..d7d2fef34 --- /dev/null +++ b/data/en.wikipedia.org/wiki/African_Theological_Studies-0.md @@ -0,0 +1,42 @@ +--- +title: "African Theological Studies" +chunk: 1/2 +source: "https://en.wikipedia.org/wiki/African_Theological_Studies" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:52:48.901352+00:00" +instance: "kb-cron" +--- + +African Theological Studies (Études Théologiques Africaines) is a book series published by Peter Lang Verlagsgruppe under the ISSN number 2196-0615. It includes, as of 2021, a total of 22 volumes written in English and French. Some editors have changed over the years since 2013, when the series was founded, but the University of Würzburg Professor Gerhard Droesser was a founding editor and serves as an editor still. The co-editors are currently Esther Hornung and Frédéric Fungula Kwilu. Chibueze Clement Udeani has served as the editor of some volumes. Many of the monographs were originally submitted as dissertations at the Catholic Theology Faculty of the University of Würzburg. According to the title pages of each volume, the series is peer-reviewed. + +== Scope of the Series == +The editors of the series seek to promote "the characteristics of African theology and its independence". Many of the authors are from Nigeria; almost all of them are from Africa. Bernhard Dinkelaker's monograph was a successful book in the series, since it was a European reception of the work of the African theologian Kwame Bediako, important to African but not that well-known in Europe and North America. The book was well-received by reviewer Kees van Dam. + +== Criticism of the Series' Academic Integrity == + +=== Misleading Book Titles === +In 2014, Blaise Okachibe Okpanachi published his Würzburg dissertation on missionary history in modern Nigeria in the first volume of a subseries of ATS devoted to church history. Edmund Hogan stated in his book review published in Catholic Historical Review that: "The problem with this book, however, is its title. Nigerian-Vatican Diplomatic Relations purports to be a study of diplomatic relations between Nigeria and the Vatican for the period 1884–1950. This does not reflect well on the general editorship of the series, for in the 300 pages of this book only twenty-nine pages have any sort of relevance to this title (pp. 191–212, 221–29). Moreover, in these few pages, there are serious problems." + +=== Reliance on Secondary Sources that are not Peer-Reviewed or Vetted === +Chimaka's 2014 monograph (volume 6 in the series) is devoted to Nigerian education and nation-building. +Text on p. 11 about artisans, leaders and wives in Aristotelian and Platonic notions of education are taken from p. 10 in Rao's History of Education. Rao's book was published in New Delhi in 2005, and is not to be found in a single online-catalogue for German-speaking Europe (Chimaka submitted his dissertation at Würzburg). Any number of authoritative histories of education would have been available in European libraries. +Much of p. 17 is plagiarized as well, such as the following sentence, taken verbatim from a 2010 source: "Education encompasses both the teaching and learning of knowledge, proper conduct, and technical competency. It thus focuses on the cultivation of skills, trades or professions, as well as mental, moral and aesthetic development." + +=== Plagiarizing Academic and Non-Academic Sources === +While Nwaigwe's 2013 study of canonical law and marriage in Nigeria (volume 6 in the series) references some academic publications, it does not always give credit where necessary. On p. 90, almost all of the third paragraph is copied verbatim from Glenn Olsen's study of the history of marriage. In another section, the passage "civil divorce cannot dissolve the marriage bond, and a divorced person cannot enter into a new valid marriage while the first spouse is still alive" (p. 98) is verbatim from a Liberian source; he does not cite it. If made aware of the source, many would question why such introductory facts are being taken from a Liberian newsletter. The second paragraph on p. 127 is taken verbatim from Gerard Sheehy's practical guide to canon law, but without attributing Sheehy. +The Nigerian doctoral candidate also copies from inappropriate sources. At the end of the second paragraph on p. 317, there is verbatim text overlap from a handbook for marriage preparation issued for parish use by the Archdiocese of Newark. Such texts are not usually considered valid sources for research on the doctoral level. +Author published the 20th monograph in the series in 2020. He teaches canon law. The book on participative structures in the AMECEA plagiarizes from a book by Makarios Tillyrides, the Eastern Orthodox Archbishop of Nairobi. He uses the following sentence on p. 46-47 without attributing its author: "It is especially the concern of the local church and entrusted to the guide of the local bishop to coordinate the commitment to evangelization, by gathering the faithful together, confirming them in the Faith through the priests and supporting them in the fulfillment of their respective tasks." It was originally published on p. 389 of Tillyride's 2011 book. + +=== Lack of English Proficiency === + +==== Volume 4, Marriage Preparation in the Igbo Tradition ==== +Nwaigwe's book on marriage includes a section on the Nigerian fattening ceremony and describes the bride "before she finally leaved her home" (p. 232). Another paragraph begins as follows: "Also such theme like sexuality has to be made clear to them that sex is intimately connected with our personal identity" (p. 299). + +==== Volume 8, Religious Education ==== +Sister Chizurum Ugbor's dissertation, Living the Future in Dialogue, was accepted by the University of Leuven and printed as the eighth volume in the series in 2015. The following examples are evidence for insufficient linguistic skills: + +"Political leaders, as well, use youths to get their political powers, which often lead to bloody shedding of human beings." (p. 24) +"Education in Nigeria is as old as the country itself and is considered as a formative process that enables the citizens to meet up their challenging and changing situations to live and fit in a dynamic society." (p. 33) + +=== Erroneous Bibliography === \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/African_Theological_Studies-1.md b/data/en.wikipedia.org/wiki/African_Theological_Studies-1.md new file mode 100644 index 000000000..56a72a5e9 --- /dev/null +++ b/data/en.wikipedia.org/wiki/African_Theological_Studies-1.md @@ -0,0 +1,21 @@ +--- +title: "African Theological Studies" +chunk: 2/2 +source: "https://en.wikipedia.org/wiki/African_Theological_Studies" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:52:48.901352+00:00" +instance: "kb-cron" +--- + +==== Volume 2, Solving the Problems of Poverty ==== +Kwazu's book on African poverty was the second monograph in the series. His bibliography lists "Hillman, K., Wörterbuch der Soziologie, Kröner, Stuttgart, Bd. 410, 1994" on p. 252, but Hillmann should be spelled with two n's and the dictionary does not contain 410 volumes, as Kwazu's entry suggests, but is volume 410 in the series called "Kröners Taschenausgabe". The 1994 printing is, furthermore, the fourth, revised edition. +The bibliography cites E. Leuninger's "Take (sic) to Catholic Workers' Movement, (KAB) Diocese of Limburg, Germany, 1 April 2001" (p. 255). The talk (not "take") was certainly held in German, and Kwazu's bibliography lists several German titles, but here, Kwazu gives no specific title, making the reference inaccessible. + +==== Volume 13, Women's Emancipation in Africa ==== +Mutume's monograph on women's rights (volume 13 in the series) cites the Vatican's 1976 "Declaration on the Question of Admission of Women to the Ministerial Priesthood" and lists it twice in the bibliography, once under A for Austin Flannery, and once under F. Both are incorrect, since it was signed by Franjo Cardinal Seper, then Prefect of the Congregation of the Doctrine of the Faith. The Dominican priest Austin Flannery was the translator; listing him as its author is confusing. +The bibliography also mistakenly cites Tamale's 2003 article "Out of the Closet" as having been published in 2005. Furthermore, Mutume identifies no page numbers for the Tamale publication. Since it has been republished several time, giving page numbers would be one way of helping readers confirm they have the right version at hand. +Mutume cites from Together, a journal published by World Vision International, which has often been criticized for unethical fundraising techniques and corruption. The bibliography lists the article under the authors's first name (Graeme), thus suggesting that Mutume was unable to distinguish it from the last (Irvine). In any case, magazines distributed by World Vision to their donors are unusual sources for doctoral dissertations. +Blind references to publications unseen by Mutume appear regularly in confusing dead ends. One bibliographical entry cites "Schrijvers, quoted in Stromquist", but gives no title or page numbers for the Schrijvers publication (p. 219). The same page in the bibliography lists two entries under T because they begin with "The". This page of the bibliography ends with the following puzzling entry: "Women; Definitions. UNICEF. Retrieved on 20th February 2012". Without citing the URL or the bibliographical container, this entry is useless. + +== References == \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Al-Ruhawi-0.md b/data/en.wikipedia.org/wiki/Al-Ruhawi-0.md new file mode 100644 index 000000000..2f80588d4 --- /dev/null +++ b/data/en.wikipedia.org/wiki/Al-Ruhawi-0.md @@ -0,0 +1,31 @@ +--- +title: "Al-Ruhawi" +chunk: 1/1 +source: "https://en.wikipedia.org/wiki/Al-Ruhawi" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:53:16.940835+00:00" +instance: "kb-cron" +--- + +Ishāq bin Ali al-Rohawi (Arabic: إسحاق بن علي الرهاوي) was a 9th-century author of the first medical ethics book in Arabic medicine. +His Ethics of the Physician contains the first documented description for peer review processes. The processes gave the fundamentals of peer review processes where practising Arab physicians were reviewed by peers on the medical treatment they provide to their patients. If the treatment of a patient was incorrectly done with negative peer reviews, then the physician is at a lawsuit liability. +Al-Ruhawi was probably an Iranian-Nestorian from Al-Ruha, modern-day Şanlıurfa. It is often known as Urfa. Not much is known about Al-Ruhawi, including his beliefs. The author Levey stated in his book that Al-Ruhawi was a Nestorian Christian while Johann Christoph Bürgel wrote that Al-Ruhawi was Jewish. However, both authors did not give evidence to support their argument, rather having it based on their interpretations. +Although there are conflictions in these two beliefs, there is evidence to prove that Al-Rahawi had Islamic beliefs. Al-Rahawi began his book with the words "In the name of Allah," which is a traditional opening for Muslims. Additionally, Al-Rahawi uses the word "Allah" hundreds of times in his work, which is associated with Islam. There is also more proof for the Islamic beliefs of Al-Ruhawi in another area of his writing where he hints towards the five pillars of Islam. In his introduction of the first chapter for one of his books, Al-Ruhawi writes the following: "The first thing in which a physician must believe is that all in this world has only one able creator who performs all deeds wilfully. The second article of faith in which a physician must believe is that he have credence in the great Allah with a firm affection, and is devoted to Him with all his reason, soul, and free will. The third faith which a physician must possess is that Allah sent His messengers to mankind to teach them what is good since the mind alone is not sufficient. Thus, without His apostles, it is not enough for man...... In all these matters, the physician must truly believe since all the holy books and ancients affirm them. No believer can deny them." + + +== Works == +Al-Ruhawi's most celebrated work is Adab al-Tabib ("Practical Ethics of the Physician" or "Practical Medical Deontology"), the earliest surviving Arabic work on medical ethics. In his works, Al-Ruhawi's regarded physicians as "guardians of souls and bodies". The work was based on Hippocrates and Galen and consisted of twenty chapters on various topics related to medical ethics. +Al- Ruhawi also wrote the following books: + +A compilation of first four books of Alexandrian Canons +Introduction to Dialectics for Beginners +On Examination of Physicians +He compiled two works based on Galen. + + +== References == + + +== External links == +lib.eshia.ir (in Persian) \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/American_Institute_of_Biological_Sciences-0.md b/data/en.wikipedia.org/wiki/American_Institute_of_Biological_Sciences-0.md new file mode 100644 index 000000000..50b2d6b08 --- /dev/null +++ b/data/en.wikipedia.org/wiki/American_Institute_of_Biological_Sciences-0.md @@ -0,0 +1,85 @@ +--- +title: "American Institute of Biological Sciences" +chunk: 1/1 +source: "https://en.wikipedia.org/wiki/American_Institute_of_Biological_Sciences" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:52:50.014686+00:00" +instance: "kb-cron" +--- + +The American Institute of Biological Sciences (AIBS) is a nonprofit scientific public charitable organization. The organization's mission is to promote the use of science to inform decision-making and advance biology for the benefit of science and society. + + +== Overview == +AIBS serves as a society of societies. It has 98 member organizations and is headquartered in Herndon, VA. Its staff work to achieve its mission by publishing the peer-reviewed journal BioScience, providing peer review and advisory support services for funding organizations, providing professional development for scientists and students, advocating for science policy and educating the public about biology. AIBS works with like-minded organizations, funding agencies, and nonprofit and for-profit entities to promote the use of science to inform decision-making. +AIBS is governed by an esteemed Board of Directors and a Council of representatives of its member organizations. + + +== Background and history == +AIBS was established in 1947 as a part of the National Academy of Sciences. The overarching goal was to unify the individuals and organizations that collectively represent the biological sciences, so that the community could address matters of common concern. In the 1950s, AIBS became an independent, member-governed, nonprofit 501(c)3 public charity scientific organization. In 1962, the National Science Foundation audited the books of IABS and froze assets, eventually leading to AIBS having to pay back nearly $500,000 and institute a program of austerity. By the 1970s it appeared to have largely recovered. + + +== Strategic priorities == +AIBS works toward overarching outcomes through three strategic priorities: + +Scientific Peer Advisory and Review Services for research proposals and programs sponsored by funding organizations, including the federal government, state agencies, private research foundations, other non-government organizations and educate the community about the science of peer review. +Publications and Communications, including reliable reports, analyses, and the peer-reviewed journal BioScience, which is a forum for integrating the life sciences and educating the public about biological sciences. +Community Programs that advance the field and profession of biology while promoting and providing leadership, with a particular emphasis on public policy and advocacy, education and professional development, as well as public awareness of science. + + +== Core activities == +The American Institute of Biological Sciences promotes the use of science to inform decision-making and advances biology for the benefit of science and society through: + +Assessment. Coordinating and facilitating expert merit-based evaluation of research proposals, ongoing research programs, and completed research efforts, as well as analyzing merit-based peer-review activities sponsored by a diverse group of research-funding organizations. +Advocacy. Advancing biology through programs, partnerships, and advocacy with a particular emphasis on public policy that benefits and promotes the interests and diversity of the life sciences community and informed decision-making. +Training. Providing educational training programs to scientists that enhance their professional skills and opportunities, as well as improve their ability to engage and inform diverse audiences. +Communication. Convening communities, publishing, and sharing scientific information (through a variety of channels) with the breadth of the scientific community, decision-makers, and the public. + + +=== Assessment === + + +==== Scientific Peer Advisory and Review Services (SPARS) ==== +The Scientific Peer Advisory and Review Services (SPARS) department is focused on coordinating, facilitating, and promoting independent, equitable evaluation processes to inform decision-making. +AIBS SPARS partners with a diverse group of organizations that are focused on a wide variety of research and funding efforts. AIBS SPARS facilitates review/advisory support services to ensure thorough and structured assessment processes are performed by vetted, qualified experts. +AIBS SPARS staff are experts in review processes and advisory services, which are often provided for proposed research grants and progress reports, ongoing funded research programs/portfolios, retrospective impact analyses, and for advisory boards and committees. + + +==== Science of Peer Review ==== +The empirical basis upon which peer review rests is limited. To build up this literature base, AIBS SPARS staff perform in-house research and meta-analyses on the peer review process and share results with the scientific community through publications and presentations. + + +=== Advocacy === + + +==== Public Policy Office ==== +The Public Policy Office works to educate policymakers about the importance of investing in biology and advocates for policies that serve the needs of researchers, educators, and other biological science professionals. Issues addressed by the office include but are not limited to: funding for the biological sciences; funding for research infrastructure, including scientific collections and field stations; strengthening science education policy; investing in the scientific workforce; and promoting scientific integrity and transparency. + + +=== Training === +With over 2,600 people trained, AIBS provides training programs to scientists that enhance their professional skills and opportunities, as well as improve their ability to engage and inform diverse audiences. Workshops include: + +Employment Acquisition Skills Boot Camp for Scientists: Helping scientists hone and practice the skills needed to secure employment +Writing for Impact and Influence: Helping scientists and graduate students hone their written communication skills to increase the impact and influence of their message +Enabling Interdisciplinary and Team Science: Providing participants with the knowledge and skills required to become productive and effective members of scientific teams +Communication Bootcamp: Helps the science community learn how to engage and inform decision-makers and have the opportunity to become effective and engaged communicators + + +=== Communications === + + +==== Faces of Biology Photo Contest ==== +AIBS conducts a yearly photo contest that helps communicate science through imagery. Photographs entered into the contest must depict a person, such as a scientist, researcher, collections curator, technician, or student, engaging in biological research. The depicted research may occur outside, in a lab, with a natural history collection, on a computer, in a classroom, or elsewhere. + + +==== Publications ==== +AIBS publishes BioScience, a peer-reviewed monthly journal with content written and edited for accessibility to researchers, educators, and students. The journal is heavily cited, with a 2021 Impact Factor of 11.687. AIBS also publishes BioScience Talks, a companion podcast to the journal. +BioScience includes articles about research findings and techniques, advances in biology education, professionally written feature articles about new developments in biology, discussions of professional issues, book reviews, news about AIBS, and a policy column (Washington Watch). Roundtables, forums, and viewpoint articles offer the perspectives of opinion leaders and invite further commentary. + + +== References == + + +== See also == +BioScience \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Augmented_Analytics-0.md b/data/en.wikipedia.org/wiki/Augmented_Analytics-0.md new file mode 100644 index 000000000..f72e613b3 --- /dev/null +++ b/data/en.wikipedia.org/wiki/Augmented_Analytics-0.md @@ -0,0 +1,41 @@ +--- +title: "Augmented Analytics" +chunk: 1/1 +source: "https://en.wikipedia.org/wiki/Augmented_Analytics" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:53:32.830056+00:00" +instance: "kb-cron" +--- + +Augmented Analytics is an approach of data analytics that employs the use of machine learning and natural language processing to automate analysis processes normally done by a specialist or data scientist. The term was introduced in 2017 by Rita Sallam, Cindi Howson, and Carlie Idoine in a Gartner research paper. +Augmented analytics is based on business intelligence and analytics. In the graph extraction step, data from different sources are investigated. + + +== Defining Augmented Analytics == +Machine Learning – a systematic computing method that uses algorithms to sift through data to identify relationships, trends, and patterns. It is a process that allows algorithms to dynamically learn from data instead of having a set base of programmed rules. +Natural language generation (NLG) – a software capability that takes unstructured data and translates it into plain-English, readable, language. +Automating Insights – using machine learning algorithms to automate data analysis processes. +Natural Language Query – enabling users to query data using business terms that are either typed onto a search box or spoken. + + +== Data Democratization == +Data Democratization is the democratizing data access in order to relieve data congestion and get rid of any sense of data "gatekeepers". This process must be implemented alongside a method for users to make sense of the data. This process is used in hopes of speeding up company decision making and uncovering opportunities hidden in data. +There are three aspects to democratising data: + +Data Parameterisation and Characterisation. +Data Decentralisation using an OS of blockchain and DLT technologies, as well as an independently governed secure data exchange to enable trust. +Consent Market-driven Data Monetisation. +When it comes to connecting assets, there are two features that will accelerate the adoption and usage of data democratisation: decentralized identity management and business data object monetization of data ownership. It enables multiple individuals and organizations to identify, authenticate, and authorize participants and organizations, enabling them to access services, data or systems across multiple networks, organizations, environments, and use cases. It empowers users and enables a personalized, self-service digital onboarding system so that users can self-authenticate without relying on a central administration function to process their information. Simultaneously, decentralized identity management ensures the user is authorized to perform actions subject to the system’s policies based on their attributes (role, department, organization, etc.) and/ or physical location. + + +== Use cases == +Agriculture – Farmers collect data on water use, soil temperature, moisture content and crop growth, augmented analytics can be used to make sense of this data and possibly identify insights that the user can then use to make business decisions. +Smart Cities – Many cities across the United States, known as Smart Cities collect large amounts of data on a daily basis. Augmented analytics can be used to simplify this data in order to increase effectiveness in city management (transportation, natural disasters, etc.). +Analytic Dashboards – Augmented analytics has the ability to take large data sets and create highly interactive and informative analytical dashboards that assist in many organizational decisions. +Augmented Data Discovery – Using an augmented analytics process can assist organizations in automatically finding, visualizing and narrating potentially important data correlations and trends. +Data Preparation – Augmented analytics platforms have the ability to take large amounts of data and organize and "clean" the data in order for it to be usable for future analyses. +Business – Businesses collect large amounts of data, daily. Some examples of types of data collected in business operations include; sales data, consumer behavior data, distribution data. An augmented analytics platform provides access to analysis of this data, which could be used in making business decisions. + + +== References == \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/BADIR-0.md b/data/en.wikipedia.org/wiki/BADIR-0.md new file mode 100644 index 000000000..2841f6547 --- /dev/null +++ b/data/en.wikipedia.org/wiki/BADIR-0.md @@ -0,0 +1,36 @@ +--- +title: "BADIR" +chunk: 1/1 +source: "https://en.wikipedia.org/wiki/BADIR" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:53:34.122536+00:00" +instance: "kb-cron" +--- + +The BADIR (pronounced /ˈbaːdɪr/) is a structured data science and data analytics process designed to enhance data-driven decision-making within organizations by addressing both analytical output as well as usefulness to management. It was developed by Piyanka Jain and Puneet Sharma and first published in the 2014 book “Behind Every Good Decision”. + + +== Overview == +The BADIR Framework employs a hypothesis-driven approach that involves understanding business objectives and challenges, analysis planning before acquiring and ensuring the quality of relevant data, applying analytics to derive insights and the potential impact on the business challenge, developing actionable recommendations aligned with strategic goals, and implementing decisions while monitoring outcomes and making necessary adjustments. + + +== Key Components of BADIR == +The BADIR Framework consists of interrelated components essential for decision-making. Its main assertion is that if data analytics does not drive business impact, then it is just statistics, not analytics. The acronym in the framework stands for the following five steps: + +B = Business Question: The first step in the framework is to define the real business question. Market trends, customer feedback, competitor actions, etc. are data sources commonly used by businesses. However, these data alone do not help teams to make decisions. +A = Analysis Plan: Once the fundamental business question is defined, the next phase involves generating and testing hypotheses to explore potential strategic directions. This phase ensures that decisions are not based on assumptions but on validated data insights. +D = Data Collection: With a clear plan and methodology in place, the next step is to gather the required data. This stage is crucial as the quality of data directly affects the reliability of insights derived from it. +I = Insights Derivation: This phase involves analyzing the data to identify patterns, trends, and outcomes that either support or challenge the proposed hypotheses. +R = Recommendations: The final step is to synthesize the insights gained through data analysis into actionable recommendations that can guide strategic business decisions. + + +== Origin == +Piyanka Jain, a data science expert with experience at Adobe and PayPal, developed the BADIR framework to address the need for more structured data analytics methodologies. Observing that data-driven decision-making was often seen as complex and inaccessible, Jain designed BADIR as a five-step process to streamline and simplify data analysis in business environments. + + +== Impact and Recognition == + According to Jain, the framework's focus on defining the business question early is said to optimize up to 80% of a business leader's workflow by reducing guesswork and concentrating efforts on strategic goals. IMA's Strategic Finance magazine compared the framework to the scientific process in terms of its approach to data analysis, and noted its ability to improve cross-departmental efficiency. The framework has been recommended for a variety of business leaders, from customer service managers to records and information management professionals. + + +== References == \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Base_effect-0.md b/data/en.wikipedia.org/wiki/Base_effect-0.md new file mode 100644 index 000000000..94fbb9d02 --- /dev/null +++ b/data/en.wikipedia.org/wiki/Base_effect-0.md @@ -0,0 +1,35 @@ +--- +title: "Base effect" +chunk: 1/1 +source: "https://en.wikipedia.org/wiki/Base_effect" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:53:35.330547+00:00" +instance: "kb-cron" +--- + +The base effect is a mathematical effect that originates from the fact that a given percentage of a reference value, is not the same as the absolute difference of the same given percentage of a much larger or smaller reference value. E.g. 1% of a GDP of US$1 million is not equal to 1% of GDP of US$1 billion in terms of absolute difference. +The reference value is common called a base year in economics. +A low base effect is the tendency of an absolute change from a low initial amount to be translated into a larger percentage change, while a high base effect would be the tendency of an absolute change from a high initial amount to be translated into a smaller percentage change. +Because of the base effect percentages in time series analysis can be misleading, in particular when percentages are compounded annually over a period of many years. +A high base effect can mislead because of decreasing percentages even while the underlying absolute difference is increasing over time as much as the measured value or population increases over time. Also, a low base effect can mislead because of increasing percentages even while the absolute difference is decreasing over time as much as the measured value or population decreases over time. +A base effect is closely related to a base year, which serves as a reference point to normalize rates of change, similar to the function of a denominator in comparisons. +When inflation is measured with a price index different formulas produce different results because of the base effect. E.g. Paasche versus Laspeyres price indices. Because of the problems, in particular with headline inflation, that arise from volatility in prices core inflation is used as an additional indicator of the development of inflation. +A lot of different methodologies can be used to correct for the base effect in computations of economic growth, and are often used in financial market analysis. +A base effect relates to inflation when in the corresponding period of the previous year. If the inflation rate was too low in the corresponding period of the previous year, even a smaller rise in the Price Index will arithmetically give a high rate of inflation now. On the other hand, if the price index had risen at a high rate in the corresponding period of the previous year and recorded high inflation rate, a similar absolute increase in the price index now will show a lower inflation rate now. +An example of the base effect: + +The Price Index is 100, 150, and 200 in each of three consecutive periods, called 1, 2, and 3, respectively. The increase of 50 from period 1 to period 2 gives a percentage increase of 50%, but the increase from period 2 to period 3, despite being the same as the previous increase in absolute terms, gives a percentage increase of only 33.33%. This is due to the relatively large difference in the bases on which the percentages are calculated (100 vs 150). + + +== See also == +Compound annual growth rate +Path dependence +Volatility tax +Low base effect +Relative change +Percentage#Percentage increase and decrease +Time series analysis + + +== References == \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Big_data-0.md b/data/en.wikipedia.org/wiki/Big_data-0.md new file mode 100644 index 000000000..bee91f0a2 --- /dev/null +++ b/data/en.wikipedia.org/wiki/Big_data-0.md @@ -0,0 +1,31 @@ +--- +title: "Big data" +chunk: 1/10 +source: "https://en.wikipedia.org/wiki/Big_data" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:53:36.506913+00:00" +instance: "kb-cron" +--- + +Big data primarily refers to data sets that are too large or complex to be dealt with by traditional data-processing software. Data with many entries (rows) offers greater statistical power, while data with higher complexity (more attributes or columns) may lead to a higher false discovery rate. +Big data analysis challenges include capturing data, data storage, data analysis, search, sharing, transfer, visualization, querying, updating, information privacy, and data sources. Big data was originally associated with three key concepts: volume, variety, and velocity. The analysis of big data that have only volume, velocity, and variety can pose challenges in sampling. A fourth concept, veracity, which refers to the level of reliability of data, was thus added. Without sufficient investment in expertise to ensure big data veracity, the volume and variety of data can produce costs and risks that exceed an organization's capacity to create and capture value from big data. +Current usage of the term big data tends to refer to the use of predictive analytics, user behavior analytics, or certain other advanced data analytics methods that extract value from big data, and seldom to a particular size of data set. "There is little doubt that the quantities of data now available are indeed large, but that's not the most relevant characteristic of this new data ecosystem." +Analysis of data sets can find new correlations to "spot business trends, prevent diseases, combat crime and so on". Scientists, business executives, medical practitioners, advertising and governments alike regularly meet difficulties with large datasets in areas including Internet searches, fintech, healthcare analytics, geographic information systems, urban informatics, and business informatics. Scientists encounter limitations in e-Science work, including meteorology, genomics, connectomics, complex physics simulations, biology, and environmental research. +The size and number of available data sets have grown rapidly as data is collected by devices such as mobile devices, cheap and numerous information-sensing Internet of things devices, aerial (remote sensing) equipment, software logs, cameras, microphones, radio-frequency identification (RFID) readers and wireless sensor networks. The world's technological per-capita capacity to store information has roughly doubled every 40 months since the 1980s; as of 2012, every day 2.5 exabytes (2.17×260 bytes) of data are generated. Based on an IDC report prediction, the global data volume was predicted to grow exponentially from 4.4 zettabytes to 44 zettabytes between 2013 and 2020. By 2025, IDC predicts there will be 163 zettabytes of data. According to IDC, global spending on big data and business analytics (BDA) solutions is estimated to reach $215.7 billion in 2021. Statista reported that the global big data market is forecasted to grow to $103 billion by 2027. In 2011 McKinsey & Company reported, if US healthcare were to use big data creatively and effectively to drive efficiency and quality, the sector could create more than $300 billion in value every year. In the developed economies of Europe, government administrators could save more than €100 billion ($149 billion) in operational efficiency improvements alone by using big data. And users of services enabled by personal-location data could capture $600 billion in consumer surplus. One question for large enterprises is determining who should own big-data initiatives that affect the entire organization. +Relational database management systems and desktop statistical software packages used to visualize data often have difficulty processing and analyzing big data. The processing and analysis of big data may require "massively parallel software running on tens, hundreds, or even thousands of servers". What qualifies as "big data" varies depending on the capabilities of those analyzing it and their tools. Furthermore, expanding capabilities make big data a moving target. "For some organizations, facing hundreds of gigabytes of data for the first time may trigger a need to reconsider data management options. For others, it may take tens or hundreds of terabytes before data size becomes a significant consideration." + +== Definition == +The term big data has been in use since the 1990s, with some giving credit to John Mashey for popularizing the term. Big data usually includes data sets with sizes beyond the ability of commonly used software tools to capture, curate, manage, and process data within a tolerable elapsed time. Big data philosophy encompasses unstructured, semi-structured and structured data; however, the main focus is on unstructured data. Big data "size" is a constantly moving target; as of 2012 ranging from a few dozen terabytes to many zettabytes of data. Big data requires a set of techniques and technologies with new forms of integration to reveal insights from datasets that are diverse, complex, and of a massive scale. Variability is often included as an additional quality of big data. +A 2018 definition states "Big data is where parallel computing tools are needed to handle data", and notes, "This represents a distinct and clearly defined change in the computer science used, via parallel programming theories, and losses of some of the guarantees and capabilities made by Codd's relational model." +In a comparative study of big datasets, Kitchin and McArdle found that none of the commonly considered characteristics of big data appear consistently across all of the analyzed cases. For this reason, other studies identified the redefinition of power dynamics in knowledge discovery as the defining trait. Instead of focusing on the intrinsic characteristics of big data, this alternative perspective pushes forward a relational understanding of the object claiming that what matters is the way in which data is collected, stored, made available and analyzed. + +=== Big data vs. business intelligence === +The growing maturity of the concept more starkly delineates the difference between "big data" and "business intelligence": + +Business intelligence uses applied mathematics tools and descriptive statistics with data with high information density to measure things, detect trends, etc. +Big data uses mathematical analysis, optimization, inductive statistics, and concepts from nonlinear system identification to infer laws (regressions, nonlinear relationships, and causal effects) from large sets of data with low information density to reveal relationships and dependencies, or to perform predictions of outcomes and behaviors. + +== Characteristics == + +Big data can be described by the following characteristics: \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Big_data-1.md b/data/en.wikipedia.org/wiki/Big_data-1.md new file mode 100644 index 000000000..164c2d2ef --- /dev/null +++ b/data/en.wikipedia.org/wiki/Big_data-1.md @@ -0,0 +1,43 @@ +--- +title: "Big data" +chunk: 2/10 +source: "https://en.wikipedia.org/wiki/Big_data" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:53:36.506913+00:00" +instance: "kb-cron" +--- + +Volume +The quantity of generated and stored data. The size of the data determines the value and potential insight, and whether it can be considered big data or not. The size of big data is usually larger than terabytes and petabytes. +Variety +The type and nature of the data. Earlier technologies like RDBMSs were capable to handle structured data efficiently and effectively. However, the change in type and nature from structured to semi-structured or unstructured challenged the existing tools and technologies. Big data technologies evolved with the prime intention to capture, store, and process the semi-structured and unstructured (variety) data generated with high speed (velocity), and huge in size (volume). Later, these tools and technologies were explored and used for handling structured data also but preferable for storage. Eventually, the processing of structured data was still kept as optional, either using big data or traditional RDBMSs. This helps in analyzing data towards effective usage of the hidden insights exposed from the data collected via social media, log files, sensors, etc. Big data draws from text, images, audio, video; plus it completes missing pieces through data fusion. +Velocity +The speed at which the data is generated and processed to meet the demands and challenges that lie in the path of growth and development. Big data is often available in real-time. Compared to small data, big data is produced more continually. Two kinds of velocity related to big data are the frequency of generation and the frequency of handling, recording, and publishing. +Veracity +The truthfulness or reliability of the data, which refers to the data quality and the data value. Big data must not only be large in size, but also must be reliable in order to achieve value in the analysis of it. The data quality of captured data can vary greatly, affecting an accurate analysis. +Value +The worth in information that can be achieved by the processing and analysis of large datasets. Value also can be measured by an assessment of the other qualities of big data. Value may also represent the profitability of information that is retrieved from the analysis of big data. +Variability +The characteristic of the changing formats, structure, or sources of big data. Big data can include structured, unstructured, or combinations of structured and unstructured data. Big data analysis may integrate raw data from multiple sources. The processing of raw data may also involve transformations of unstructured data to structured data. +Other possible characteristics of big data are: + +Exhaustive +Whether the entire system (i.e., + + + + n + + + {\textstyle n} + +=all) is captured or recorded or not. Big data may or may not include all the available data from sources. +Fine-grained and uniquely lexical +Respectively, the proportion of specific data of each element per element collected and if the element and its characteristics are properly indexed or identified. +Relational +If the data collected contains common fields that would enable a conjoining, or meta-analysis, of different data sets. +Extensional +If new fields in each element of the data collected can be added or changed easily. +Scalability +If the size of the big data storage system can expand rapidly. \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Big_data-2.md b/data/en.wikipedia.org/wiki/Big_data-2.md new file mode 100644 index 000000000..59b5a904a --- /dev/null +++ b/data/en.wikipedia.org/wiki/Big_data-2.md @@ -0,0 +1,34 @@ +--- +title: "Big data" +chunk: 3/10 +source: "https://en.wikipedia.org/wiki/Big_data" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:53:36.506913+00:00" +instance: "kb-cron" +--- + +== Architecture == +Big data repositories have existed in many forms, often built by corporations with a special need. Commercial vendors historically offered parallel database management systems for big data beginning in the 1990s. For many years, WinterCorp published the largest database report. +Teradata Corporation in 1984 marketed the parallel processing DBC 1012 system. Teradata systems were the first to store and analyze 1 terabyte of data in 1992. Hard disk drives were 2.5 GB in 1991 so the definition of big data continuously evolves. Teradata installed the first petabyte class RDBMS based system in 2007. As of 2017, there are a few dozen petabyte class Teradata relational databases installed, the largest of which exceeds 50 PB. Systems up until 2008 were 100% structured relational data. Since then, Teradata has added semi structured data types including XML, JSON, and Avro. +In 2000, Seisint Inc. (now LexisNexis Risk Solutions) developed a C++-based distributed platform for data processing and querying known as the HPCC Systems platform. This system automatically partitions, distributes, stores and delivers structured, semi-structured, and unstructured data across multiple commodity servers. Users can write data processing pipelines and queries in a declarative dataflow programming language called ECL. Data analysts working in ECL are not required to define data schemas upfront and can rather focus on the particular problem at hand, reshaping data in the best possible manner as they develop the solution. In 2004, LexisNexis acquired Seisint Inc. and their high-speed parallel processing platform and successfully used this platform to integrate the data systems of Choicepoint Inc. when they acquired that company in 2008. In 2011, the HPCC systems platform was open-sourced under the Apache v2.0 License. +CERN and other physics experiments have collected big data sets for many decades, usually analyzed via high-throughput computing rather than the map-reduce architectures usually meant by the current "big data" movement. +In 2004, Google published a paper on a process called MapReduce that uses a similar architecture. The MapReduce concept provides a parallel processing model, and an associated implementation was released to process huge amounts of data. With MapReduce, queries are split and distributed across parallel nodes and processed in parallel (the "map" step). The results are then gathered and delivered (the "reduce" step). The framework was very successful, so others wanted to replicate the algorithm. Therefore, an implementation of the MapReduce framework was adopted by an Apache open-source project named "Hadoop". Apache Spark was developed in 2012 in response to limitations in the MapReduce paradigm, as it adds in-memory processing and the ability to set up many operations (not just map followed by reducing). +MIKE2.0 is an open approach to information management that acknowledges the need for revisions due to big data implications identified in an article titled "Big Data Solution Offering". The methodology addresses handling big data in terms of useful permutations of data sources, complexity in interrelationships, and difficulty in deleting (or modifying) individual records. +Studies in 2012 showed that a multiple-layer architecture was one option to address the issues that big data presents. A distributed parallel architecture distributes data across multiple servers; these parallel execution environments can dramatically improve data processing speeds. This type of architecture inserts data into a parallel DBMS, which implements the use of MapReduce and Hadoop frameworks. This type of framework looks to make the processing power transparent to the end-user by using a front-end application server. +The data lake allows an organization to shift its focus from centralized control to a shared model to respond to the changing dynamics of information management. This enables quick segregation of data into the data lake, thereby reducing the overhead time. + +== Technologies == +A 2011 McKinsey Global Institute report characterizes the main components and ecosystem of big data as follows: + +Techniques for analyzing data, such as A/B testing, machine learning, and natural language processing +Big data technologies, like business intelligence, cloud computing, and databases +Visualization, such as charts, graphs, and other displays of the data +Multidimensional big data can also be represented as OLAP data cubes or, mathematically, tensors. Array database systems have set out to provide storage and high-level query support on this data type. +Additional technologies being applied to big data include efficient tensor-based computation, such as multilinear subspace learning, massively parallel-processing (MPP) databases, search-based applications, data mining, distributed file systems, distributed cache (e.g., burst buffer and Memcached), distributed databases, cloud and HPC-based infrastructure (applications, storage and computing resources), and the Internet. Although, many approaches and technologies have been developed, it still remains difficult to carry out machine learning with big data. +Some MPP relational databases have the ability to store and manage petabytes of data. Implicit is the ability to load, monitor, back up, and optimize the use of the large data tables in the RDBMS. +DARPA's Topological Data Analysis program seeks the fundamental structure of massive data sets and in 2008 the technology went public with the launch of a company called "Ayasdi". +The practitioners of big data analytics processes are generally hostile to slower shared storage, preferring direct-attached storage (DAS) in its various forms from solid state drive (SSD) to high capacity SATA disk buried inside parallel processing nodes. The perception of shared storage architectures—storage area network (SAN) and network-attached storage (NAS)— is that they are relatively slow, complex, and expensive. These qualities are not consistent with big data analytics systems that thrive on system performance, commodity infrastructure, and low cost. +Real or near-real-time information delivery is one of the defining characteristics of big data analytics. Latency is therefore avoided whenever and wherever possible. Data in direct-attached memory or disk is good—data on memory or disk at the other end of an FC SAN connection is not. The cost of an SAN at the scale needed for analytics applications is much higher than other storage techniques. + +== Applications == \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Big_data-3.md b/data/en.wikipedia.org/wiki/Big_data-3.md new file mode 100644 index 000000000..09c308b0a --- /dev/null +++ b/data/en.wikipedia.org/wiki/Big_data-3.md @@ -0,0 +1,40 @@ +--- +title: "Big data" +chunk: 4/10 +source: "https://en.wikipedia.org/wiki/Big_data" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:53:36.506913+00:00" +instance: "kb-cron" +--- + +Big data has increased the demand of information management specialists so much so that Software AG, Oracle Corporation, IBM, Microsoft, SAP, EMC, HP, and Dell have spent more than $15 billion on software firms specializing in data management and analytics. In 2010, this industry was worth more than $100 billion and was growing at almost 10 percent a year, about twice as fast as the software business as a whole. +Developed economies increasingly use data-intensive technologies. There are 4.6 billion mobile-phone subscriptions worldwide, and between 1 billion and 2 billion people accessing the internet. Between 1990 and 2005, more than 1 billion people worldwide entered the middle class, which means more people became more literate, which in turn led to information growth. The world's effective capacity to exchange information through telecommunication networks was 281 petabytes in 1986, 471 petabytes in 1993, 2.2 exabytes in 2000, 65 exabytes in 2007 and predictions put the amount of internet traffic at 667 exabytes annually by 2014. According to one estimate, one-third of the globally stored information is in the form of alphanumeric text and still image data, which is the format most useful for most big data applications. This also shows the potential of yet unused data (i.e. in the form of video and audio content). +While many vendors offer off-the-shelf products for big data, experts promote the development of in-house custom-tailored systems if the company has sufficient technical capabilities. + +=== Government === + +The use and adoption of big data within governmental processes allows efficiencies in terms of cost, productivity, and innovation, but comes with flaws. Data analysis often requires multiple parts of government (central and local) to work in collaboration and create new and innovative processes to deliver the desired outcome. A common government organization that makes use of big data is the National Security Administration (NSA), which monitors the activities of the Internet constantly in search for potential patterns of suspicious or illegal activities their system may pick up. +Civil registration and vital statistics (CRVS) collects all certificates status from birth to death. CRVS is a source of big data for governments. + +=== International development === +Research on the effective usage of information and communication technologies for development (also known as "ICT4D") suggests that big data technology can make important contributions but also present unique challenges to international development. Advancements in big data analysis offer cost-effective opportunities to improve decision-making in critical development areas such as health care, employment, economic productivity, crime, security, and natural disaster and resource management. Additionally, user-generated data offers new opportunities to give the unheard a voice. However, longstanding challenges for developing regions such as inadequate technological infrastructure and economic and human resource scarcity exacerbate existing concerns with big data such as privacy, imperfect methodology, and interoperability issues. The challenge of "big data for development" is currently evolving toward the application of this data through machine learning, known as "artificial intelligence for development (AI4D). + +==== Benefits ==== +A major practical application of big data for development has been "fighting poverty with data". In 2015, Blumenstock and colleagues estimated predicted poverty and wealth from mobile phone metadata and in 2016 Jean and colleagues combined satellite imagery and machine learning to predict poverty. Using digital trace data to study the labor market and the digital economy in Latin America, Hilbert and colleagues argue that digital trace data has several benefits such as: + +Thematic coverage: including areas that were previously difficult or impossible to measure +Geographical coverage: providing sizable and comparable data for almost all countries, including many small countries that usually are not included in international inventories +Level of detail: providing fine-grained data with many interrelated variables, and new aspects, like network connections +Timeliness and timeseries: graphs can be produced within days of being collected + +==== Challenges ==== +At the same time, working with digital trace data instead of traditional survey data does not eliminate the traditional challenges involved when working in the field of international quantitative analysis. Priorities change, but the basic discussions remain the same. Among the main challenges are: + +Representativeness. While traditional development statistics is mainly concerned with the representativeness of random survey samples, digital trace data is never a random sample. +Generalizability. While observational data always represents this source very well, it only represents what it represents, and nothing more. While it is tempting to generalize from specific observations of one platform to broader settings, this is often very deceptive. +Harmonization. Digital trace data still requires international harmonization of indicators. It adds the challenge of so-called "data-fusion", the harmonization of different sources. +Data overload. Analysts and institutions are not used to effectively deal with a large number of variables, which is efficiently done with interactive dashboards. Practitioners still lack a standard workflow that would allow researchers, users and policymakers to efficiently and effectively deal with data. + +=== Finance === +Big Data is being rapidly adopted in Finance to 1) speed up processing and 2) deliver better, more informed inferences, both internally and to the clients of the financial institutions. The financial applications of Big Data range from investing decisions and trading (processing volumes of available price data, limit order books, economic data and more, all at the same time), portfolio management (optimizing over an increasingly large array of financial instruments, potentially selected from different asset classes), risk management (credit rating based on extended information), and any other aspect where the data inputs are large. Big Data has also been a typical concept within the field of alternative financial service. Some of the major areas involve crowd-funding platforms and crypto currency exchanges. \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Big_data-4.md b/data/en.wikipedia.org/wiki/Big_data-4.md new file mode 100644 index 000000000..966ae6dbb --- /dev/null +++ b/data/en.wikipedia.org/wiki/Big_data-4.md @@ -0,0 +1,37 @@ +--- +title: "Big data" +chunk: 5/10 +source: "https://en.wikipedia.org/wiki/Big_data" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:53:36.506913+00:00" +instance: "kb-cron" +--- + +=== Healthcare === +Big data analytics has been used in healthcare in providing personalized medicine and prescriptive analytics, clinical risk intervention and predictive analytics, waste and care variability reduction, automated external and internal reporting of patient data, standardized medical terms and patient registries. Some areas of improvement are more aspirational than actually implemented. The level of data generated within healthcare systems is not trivial. With the added adoption of mHealth, eHealth and wearable technologies the volume of data will continue to increase. This includes electronic health record data, imaging data, patient generated data, sensor data, and other forms of difficult to process data. There is now an even greater need for such environments to pay greater attention to data and information quality. "Big data very often means 'dirty data' and the fraction of data inaccuracies increases with data volume growth." Human inspection at the big data scale is impossible and there is a desperate need in health service for intelligent tools for accuracy and believability control and handling of information missed. While extensive information in healthcare is now electronic, it fits under the big data umbrella as most is unstructured and difficult to use. The use of big data in healthcare has raised significant ethical challenges ranging from risks for individual rights, privacy and autonomy, to transparency and trust. +Big data in health research is particularly promising in terms of exploratory biomedical research, as data-driven analysis can move forward more quickly than hypothesis-driven research. Then, trends seen in data analysis can be tested in traditional, hypothesis-driven follow up biological research and eventually clinical research. +A related application sub-area, that heavily relies on big data, within the healthcare field is that of computer-aided diagnosis in medicine. For instance, for epilepsy monitoring it is customary to create 5 to 10 GB of data daily. Similarly, a single uncompressed image of breast tomosynthesis averages 450 MB of data. These are just a few of the many examples where computer-aided diagnosis uses big data. For this reason, big data has been recognized as one of the seven key challenges that computer-aided diagnosis systems need to overcome in order to reach the next level of performance. + +=== Education === +A McKinsey Global Institute study found a shortage of 1.5 million highly trained data professionals and managers and a number of universities including University of Tennessee and UC Berkeley, have created masters programs to meet this demand. Private boot camps have also developed programs to meet that demand, including paid programs like The Data Incubator or General Assembly. In the specific field of marketing, one of the problems stressed by Wedel and Kannan is that marketing has several sub domains (e.g., advertising, promotions, +product development, branding) that all use different types of data. + +=== Media === +To understand how the media uses big data, it is first necessary to provide some context into the mechanism used for media process. It has been suggested by Nick Couldry and Joseph Turow that practitioners in media and advertising approach big data as many actionable points of information about millions of individuals. The industry appears to be moving away from the traditional approach of using specific media environments such as newspapers, magazines, or television shows and instead taps into consumers with technologies that reach targeted people at optimal times in optimal locations. The ultimate aim is to serve or convey, a message or content that is (statistically speaking) in line with the consumer's mindset. For example, publishing environments are increasingly tailoring messages (advertisements) and content (articles) to appeal to consumers that have been exclusively gleaned through various data-mining activities. + +Targeting of consumers (for advertising by marketers) +Data capture +Data journalism: publishers and journalists use big data tools to provide unique and innovative insights and infographics. +Channel 4, the British public-service television broadcaster, is a leader in the field of big data and data analysis. + +=== Insurance === +Health insurance providers are collecting data on social "determinants of health" such as food and TV consumption, marital status, clothing size, and purchasing habits, from which they make predictions on health costs, in order to spot health issues in their clients. It is controversial whether these predictions are currently being used for pricing. + +=== Internet of things (IoT) === + +Big data and the IoT work in conjunction. Data extracted from IoT devices provides a mapping of device inter-connectivity. Such mappings have been used by the media industry, companies, and governments to more accurately target their audience and increase media efficiency. The IoT is also increasingly adopted as a means of gathering sensory data, and this sensory data has been used in medical, manufacturing and transportation contexts. +Kevin Ashton, the digital innovation expert who is credited with coining the term, defines the Internet of things in this quote: "If we had computers that knew everything there was to know about things—using data they gathered without any help from us—we would be able to track and count everything, and greatly reduce waste, loss, and cost. We would know when things needed replacing, repairing, or recalling, and whether they were fresh or past their best." + +=== Information technology === +Especially since 2015, big data has come to prominence within business operations as a tool to help employees work more efficiently and streamline the collection and distribution of information technology (IT). The use of big data to resolve IT and data collection issues within an enterprise is called IT operations analytics (ITOA). By applying big data principles into the concepts of machine intelligence and deep computing, IT departments can predict potential issues and prevent them. ITOA businesses offer platforms for systems management that bring data silos together and generate insights from the whole of the system rather than from isolated pockets of data. \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Big_data-5.md b/data/en.wikipedia.org/wiki/Big_data-5.md new file mode 100644 index 000000000..6b2df2278 --- /dev/null +++ b/data/en.wikipedia.org/wiki/Big_data-5.md @@ -0,0 +1,53 @@ +--- +title: "Big data" +chunk: 6/10 +source: "https://en.wikipedia.org/wiki/Big_data" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:53:36.506913+00:00" +instance: "kb-cron" +--- + +=== Survey science === +Compared to survey-based data collection, big data has low cost per data point, applies analysis techniques via machine learning and data mining, and includes diverse and new data sources, e.g., registers, social media, apps, and other forms digital data. Since 2018, survey scientists have started to examine how big data and survey science can complement each other to allow researchers and practitioners to improve the production of statistics and its quality. There have been three Big Data Meets Survey Science (BigSurv) conferences in 2018, 2020 (virtual), 2023, and as of 2023 one conference forthcoming in 2025, a special issue in the Social Science Computer Review, a special issue in Journal of the Royal Statistical Society, and a special issue in EP J Data Science, and a book called Big Data Meets Social Sciences edited by Craig Hill and five other Fellows of the American Statistical Association. In 2021, the founding members of BigSurv received the Warren J. Mitofsky Innovators Award from the American Association for Public Opinion Research. + +=== Marketing === +Big data is notable in marketing due to the constant "datafication" of everyday consumers of the internet, in which all forms of data are tracked. The datafication of consumers can be defined as quantifying many of or all human behaviors for the purpose of marketing. The increasingly digital world of rapid datafication makes this idea relevant to marketing because the amount of data constantly grows exponentially. It is predicted to increase from 44 to 163 zettabytes within the span of five years. The size of big data can often be difficult to navigate for marketers. As a result, adopters of big data may find themselves at a disadvantage. Algorithmic findings can be difficult to achieve with such large datasets. Big data in marketing is a highly lucrative tool that can be used for large corporations, its value being as a result of the possibility of predicting significant trends, interests, or statistical outcomes in a consumer-based manner. +There are three significant factors in the use of big data in marketing: + +Big data provides customer behavior pattern spotting for marketers, since all human actions are being quantified into readable numbers for marketers to analyze and use for their research. In addition, big data can also be seen as a customized product recommendation tool. Specifically, since big data is effective in analyzing customers' purchase behaviors and browsing patterns, this technology can assist companies in promoting specific personalized products to specific customers. +Real-time market responsiveness is important for marketers because of the ability to shift marketing efforts and correct to current trends, which is helpful in maintaining relevance to consumers. This can supply corporations with the information necessary to predict the wants and needs of consumers in advance. +Data-driven market ambidexterity are being highly fueled by big data. New models and algorithms are being developed to make significant predictions about certain economic and social situations. + +== Case studies == + +=== Government === + +==== China ==== +The Integrated Joint Operations Platform (IJOP, 一体化联合作战平台) is used by the government to monitor the population, particularly Uyghurs. Biometrics, including DNA samples, are gathered through a program of free physicals. +By 2020, China plans to give all its citizens a personal "social credit" score based on how they behave. The Social Credit System, now being piloted in a number of Chinese cities, is considered a form of mass surveillance which uses big data analysis technology. + +==== India ==== +Big data analysis was tried out for the BJP to win the 2014 Indian General Election. +The Indian government uses numerous techniques to ascertain how the Indian electorate is responding to government action, as well as ideas for policy augmentation. + +==== Israel ==== +Personalized diabetic treatments can be created through GlucoMe's big data solution. + +==== United Kingdom ==== +Examples of uses of big data in public services: + +Data on prescription drugs: by connecting origin, location and the time of each prescription, a research unit was able to exemplify and examine the considerable delay between the release of any given drug, and a UK-wide adaptation of the National Institute for Health and Care Excellence guidelines. This suggests that new or most up-to-date drugs take some time to filter through to the general patient. +Joining up data: a local authority blended data about services, such as road gritting rotas, with services for people at risk, such as Meals on Wheels. The connection of data allowed the local authority to avoid any weather-related delay. + +==== United States ==== +In 2012, the Obama administration announced the Big Data Research and Development Initiative, to explore how big data could be used to address important problems faced by the government. The initiative is composed of 84 different big data programs spread across six departments. +Big data analysis played a large role in Barack Obama's successful 2012 re-election campaign. +The United States Federal Government owns four of the ten most powerful supercomputers in the world. +The Utah Data Center has been constructed by the United States National Security Agency. When finished, the facility will be able to handle a large amount of information collected by the NSA over the Internet. The exact amount of storage space is unknown, but more recent sources claim it will be on the order of a few exabytes. This has posed security concerns regarding the anonymity of the data collected. + +=== Retail === +Walmart handles more than 1 million customer transactions every hour, which are imported into databases estimated to contain more than 2.5 petabytes (2560 terabytes) of data—the equivalent of 167 times the information contained in all the books in the US Library of Congress. +Windermere Real Estate uses location information from nearly 100 million drivers to help new home buyers determine their typical drive times to and from work throughout various times of the day. +FICO Card Detection System protects accounts worldwide. +Omnichannel retailing leverages online big data to improve offline experiences. \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Big_data-6.md b/data/en.wikipedia.org/wiki/Big_data-6.md new file mode 100644 index 000000000..b2e688bc5 --- /dev/null +++ b/data/en.wikipedia.org/wiki/Big_data-6.md @@ -0,0 +1,38 @@ +--- +title: "Big data" +chunk: 7/10 +source: "https://en.wikipedia.org/wiki/Big_data" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:53:36.506913+00:00" +instance: "kb-cron" +--- + +=== Science === +The Large Hadron Collider experiments represent about 150 million sensors delivering data 40 million times per second. There are nearly 600 million collisions per second. After filtering and refraining from recording more than 99.99995% of these streams, there are 1,000 collisions of interest per second. +As a result, only working with less than 0.001% of the sensor stream data, the data flow from all four LHC experiments represents 25 petabytes annual rate before replication (as of 2012). This becomes nearly 200 petabytes after replication. +If all sensor data were recorded in LHC, the data flow would be extremely hard to work with. The data flow would exceed 150 million petabytes annual rate, or nearly 500 exabytes per day, before replication. To put the number in perspective, this is equivalent to 500 quintillion (5×1020) bytes per day, almost 200 times more than all the other sources combined in the world. +The Square Kilometre Array is a radio telescope built of thousands of antennas. It is expected to be operational by 2024. Collectively, these antennas are expected to gather 14 exabytes and store one petabyte per day. It is considered one of the most ambitious scientific projects ever undertaken. +When the Sloan Digital Sky Survey (SDSS) began to collect astronomical data in 2000, it amassed more in its first few weeks than all data collected in the history of astronomy previously. Continuing at a rate of about 200 GB per night, SDSS has amassed more than 140 terabytes of information. When the Large Synoptic Survey Telescope, successor to SDSS, comes online in 2020, its designers expect it to acquire that amount of data every five days. +Decoding the human genome originally took 10 years to process; now it can be achieved in less than a day. The DNA sequencers have divided the sequencing cost by 10,000 in the last ten years, which is 100 times less expensive than the reduction in cost predicted by Moore's law. +The NASA Center for Climate Simulation (NCCS) stores 32 petabytes of climate observations and simulations on the Discover supercomputing cluster. +Google's DNAStack compiles and organizes DNA samples of genetic data from around the world to identify diseases and other medical defects. These fast and exact calculations eliminate any "friction points", or human errors that could be made by one of the numerous science and biology experts working with the DNA. DNAStack, a part of Google Genomics, allows scientists to use the vast sample of resources from Google's search server to scale social experiments that would usually take years, instantly. +23andme's DNA database contains the genetic information of over 1,000,000 people worldwide. The company explores selling the "anonymous aggregated genetic data" to other researchers and pharmaceutical companies for research purposes if patients give their consent. Ahmad Hariri, professor of psychology and neuroscience at Duke University who has been using 23andMe in his research since 2009 states that the most important aspect of the company's new service is that it makes genetic research accessible and relatively cheap for scientists. A study that identified 15 genome sites linked to depression in 23andMe's database lead to a surge in demands to access the repository with 23andMe fielding nearly 20 requests to access the depression data in the two weeks after publication of the paper. +Computational fluid dynamics (CFD) and hydrodynamic turbulence research generate massive data sets. The Johns Hopkins Turbulence Databases (JHTDB) contains over 350 terabytes of spatiotemporal fields from Direct Numerical simulations of various turbulent flows. Such data have been difficult to share using traditional methods such as downloading flat simulation output files. The data within JHTDB can be accessed using "virtual sensors" with various access modes ranging from direct web-browser queries, access through Matlab, Python, Fortran and C programs executing on clients' platforms, to cut out services to download raw data. The data have been used in over 150 scientific publications. + +=== Sports === +Big data can be used to improve training and understanding competitors, using sport sensors. It is also possible to predict winners in a match using big data analytics. +Future performance of players could be predicted as well. Thus, players' value and salary is determined by data collected throughout the season. +In Formula One races, race cars with hundreds of sensors generate terabytes of data. These sensors collect data points from tire pressure to fuel burn efficiency. +Based on the data, engineers and data analysts decide whether adjustments should be made in order to win a race. Besides, using big data, race teams try to predict the time they will finish the race beforehand, based on simulations using data collected over the season. + +=== Technology === +As of 2013, eBay.com uses two data warehouses at 7.5 petabytes and 40PB as well as a 40PB Hadoop cluster for search, consumer recommendations, and merchandising. +Amazon.com handles millions of back-end operations every day, as well as queries from more than half a million third-party sellers. The core technology that keeps Amazon running is Linux-based and as of 2005 they had the world's three largest Linux databases, with capacities of 7.8 TB, 18.5 TB, and 24.7 TB. +Facebook handles 50 billion photos from its user base. As of June 2017, Facebook reached 2 billion monthly active users. +Google was handling roughly 100 billion searches per month as of August 2012. +A majorly renowned example of the practice of big data is Amazon. Amazon utilizes data analytics to drive its system for recommendations. Amazon has great success from the portion of sales that derives from the "recommended" section with personalized recommended items to shop. + +=== COVID-19 === +During the COVID-19 pandemic, big data was raised as a way to minimise the impact of the disease. Significant applications of big data included minimizing the spread of the virus, case identification and development of medical treatment. +Governments used big data to track infected people to minimise spread. Early adopters included China, Taiwan, South Korea, and Israel. \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Big_data-7.md b/data/en.wikipedia.org/wiki/Big_data-7.md new file mode 100644 index 000000000..303f59c79 --- /dev/null +++ b/data/en.wikipedia.org/wiki/Big_data-7.md @@ -0,0 +1,28 @@ +--- +title: "Big data" +chunk: 8/10 +source: "https://en.wikipedia.org/wiki/Big_data" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:53:36.506913+00:00" +instance: "kb-cron" +--- + +== Research activities == +Encrypted search and cluster formation in big data were demonstrated in March 2014 at the American Society of Engineering Education. Gautam Siwach engaged at Tackling the challenges of Big Data by MIT Computer Science and Artificial Intelligence Laboratory and Amir Esmailpour at the UNH Research Group investigated the key features of big data as the formation of clusters and their interconnections. They focused on the security of big data and the orientation of the term towards the presence of different types of data in an encrypted form at cloud interface by providing the raw definitions and real-time examples within the technology. Moreover, they proposed an approach for identifying the encoding technique to advance towards an expedited search over encrypted text leading to the security enhancements in big data. +In March 2012, The White House announced a national "Big Data Initiative" that consisted of six federal departments and agencies committing more than $200 million to big data research projects. +The initiative included a National Science Foundation "Expeditions in Computing" grant of $10 million over five years to the AMPLab at the University of California, Berkeley. The AMPLab also received funds from DARPA, and over a dozen industrial sponsors and uses big data to attack a wide range of problems from predicting traffic congestion to fighting cancer. +The White House Big Data Initiative also included a commitment by the Department of Energy to provide $25 million in funding over five years to establish the Scalable Data Management, Analysis and Visualization (SDAV) Institute, led by the Energy Department's Lawrence Berkeley National Laboratory. The SDAV Institute aims to bring together the expertise of six national laboratories and seven universities to develop new tools to help scientists manage and visualize data on the department's supercomputers. +The U.S. state of Massachusetts announced the Massachusetts Big Data Initiative in May 2012, which provides funding from the state government and private companies to a variety of research institutions. The Massachusetts Institute of Technology hosts the Intel Science and Technology Center for Big Data in the MIT Computer Science and Artificial Intelligence Laboratory, combining government, corporate, and institutional funding and research efforts. +The European Commission is funding the two-year-long Big Data Public Private Forum through their Seventh Framework Program to engage companies, academics and other stakeholders in discussing big data issues. The project aims to define a strategy in terms of research and innovation to guide supporting actions from the European Commission in the successful implementation of the big data economy. Outcomes of this project will be used as input for Horizon 2020, their next framework program. +The British government announced in March 2014 the founding of the Alan Turing Institute, named after the computer pioneer and code-breaker, which will focus on new ways to collect and analyze large data sets. +At the University of Waterloo Stratford Campus Canadian Open Data Experience (CODE) Inspiration Day, participants demonstrated how using data visualization can increase the understanding and appeal of big data sets and communicate their story to the world. +Computational social sciences – Anyone can use application programming interfaces (APIs) provided by big data holders, such as Google and Twitter, to do research in the social and behavioral sciences. Often these APIs are provided for free. Tobias Preis et al. used Google Trends data to demonstrate that Internet users from countries with a higher per capita gross domestic products (GDPs) are more likely to search for information about the future than information about the past. The findings suggest there may be a link between online behaviors and real-world economic indicators. The authors of the study examined Google queries logs made by ratio of the volume of searches for the coming year (2011) to the volume of searches for the previous year (2009), which they call the "future orientation index". They compared the future orientation index to the per capita GDP of each country, and found a strong tendency for countries where Google users inquire more about the future to have a higher GDP. +Tobias Preis and his colleagues Helen Susannah Moat and H. Eugene Stanley introduced a method to identify online precursors for stock market moves, using trading strategies based on search volume data provided by Google Trends. Their analysis of Google search volume for 98 terms of varying financial relevance, published in Scientific Reports, suggests that increases in search volume for financially relevant search terms tend to precede large losses in financial markets. +Big data sets come with algorithmic challenges that previously did not exist. Hence, there is seen by some to be a need to fundamentally change the processing ways. + +=== Sampling big data === +A research question that is asked about big data sets is whether it is necessary to look at the full data to draw certain conclusions about the properties of the data or if is a sample is good enough. The name big data itself contains a term related to size and this is an important characteristic of big data. But sampling enables the selection of right data points from within the larger data set to estimate the characteristics of the whole population. In manufacturing different types of sensory data such as acoustics, vibration, pressure, current, voltage, and controller data are available at short time intervals. To predict downtime it may not be necessary to look at all the data but a sample may be sufficient. Big data can be broken down by various data point categories such as demographic, psychographic, behavioral, and transactional data. With large sets of data points, marketers are able to create and use more customized segments of consumers for more strategic targeting. + +== Critique == +Critiques of the big data paradigm come in two flavors: those that question the implications of the approach itself, and those that question the way it is currently done. One approach to this criticism is the field of critical data studies. \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Big_data-8.md b/data/en.wikipedia.org/wiki/Big_data-8.md new file mode 100644 index 000000000..212015c0c --- /dev/null +++ b/data/en.wikipedia.org/wiki/Big_data-8.md @@ -0,0 +1,27 @@ +--- +title: "Big data" +chunk: 9/10 +source: "https://en.wikipedia.org/wiki/Big_data" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:53:36.506913+00:00" +instance: "kb-cron" +--- + +=== Critiques of the big data paradigm === +"A crucial problem is that we do not know much about the underlying empirical micro-processes that lead to the emergence of the[se] typical network characteristics of Big Data." In their critique, Snijders, Matzat, and Reips point out that often very strong assumptions are made about mathematical properties that may not at all reflect what is really going on at the level of micro-processes. Mark Graham has leveled broad critiques at Chris Anderson's assertion that big data will spell the end of theory: focusing in particular on the notion that big data must always be contextualized in their social, economic, and political contexts. Even as companies invest eight- and nine-figure sums to derive insight from information streaming in from suppliers and customers, less than 40% of employees have sufficiently mature processes and skills to do so. To overcome this insight deficit, big data, no matter how comprehensive or well analyzed, must be complemented by "big judgment", according to an article in the Harvard Business Review. +Much in the same line, it has been pointed out that the decisions based on the analysis of big data are inevitably "informed by the world as it was in the past, or, at best, as it currently is". Fed by a large number of data on past experiences, algorithms can predict future development if the future is similar to the past. If the system's dynamics of the future change (if it is not a stationary process), the past can say little about the future. In order to make predictions in changing environments, it would be necessary to have a thorough understanding of the systems dynamic, which requires theory. As a response to this critique Alemany Oliver and Vayre suggest to use "abductive reasoning as a first step in the research process in order to bring context to consumers' digital traces and make new theories emerge". Additionally, it has been suggested to combine big data approaches with computer simulations, such as agent-based models and complex systems. Agent-based models are increasingly getting better in predicting the outcome of social complexities of even unknown future scenarios through computer simulations that are based on a collection of mutually interdependent algorithms. Finally, the use of multivariate methods that probe for the latent structure of the data, such as factor analysis and cluster analysis, have proven useful as analytic approaches that go well beyond the bi-variate approaches (e.g. contingency tables) typically employed with smaller data sets. +In health and biology, conventional scientific approaches are based on experimentation. For these approaches, the limiting factor is the relevant data that can confirm or refute the initial hypothesis. A new postulate is accepted now in biosciences: the information provided by the data in huge volumes (omics) without prior hypothesis is complementary and sometimes necessary to conventional approaches based on experimentation. In the massive approaches it is the formulation of a relevant hypothesis to explain the data that is the limiting factor. The search logic is reversed and the limits of induction ("Glory of Science and Philosophy scandal", C. D. Broad, 1926) are to be considered. +Privacy advocates are concerned about the threat to privacy represented by increasing storage and integration of personally identifiable information; expert panels have released various policy recommendations to conform practice to expectations of privacy. The misuse of big data in several cases by media, companies, and even the government has allowed for abolition of trust in almost every fundamental institution holding up society. +Barocas and Nissenbaum argue that one way of protecting individual users is by being informed about the types of information being collected, with whom it is shared, under what constraints and for what purposes. + +=== Critiques of the "V" model === +The "V" model of big data is concerning as it centers around computational scalability and lacks in a loss around the perceptibility and understandability of information. This led to the framework of cognitive big data, which characterizes big data applications according to: + +Data completeness: understanding of the non-obvious from data +Data correlation, causation, and predictability: causality as not essential requirement to achieve predictability +Explainability and interpretability: humans desire to understand and accept what they understand, where algorithms do not cope with this +Level of automated decision-making: algorithms that support automated decision making and algorithmic self-learning + +=== Critiques of novelty === +Large data sets have been analyzed by computing machines for well over a century, including the US census analytics performed by IBM's punch-card machines which computed statistics including means and variances of populations across the whole continent. In more recent decades, science experiments such as CERN have produced data on similar scales to current commercial "big data". However, science experiments have tended to analyze their data using specialized custom-built high-performance computing (super-computing) clusters and grids, rather than clouds of cheap commodity computers as in the current commercial wave, implying a difference in both culture and technology stack. \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Big_data-9.md b/data/en.wikipedia.org/wiki/Big_data-9.md new file mode 100644 index 000000000..f0cce3214 --- /dev/null +++ b/data/en.wikipedia.org/wiki/Big_data-9.md @@ -0,0 +1,39 @@ +--- +title: "Big data" +chunk: 10/10 +source: "https://en.wikipedia.org/wiki/Big_data" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:53:36.506913+00:00" +instance: "kb-cron" +--- + +=== Critiques of big data execution === +Ulf-Dietrich Reips and Uwe Matzat wrote in 2014 that big data had become a "fad" in scientific research. Researcher danah boyd has raised concerns about the use of big data in science neglecting principles such as choosing a representative sample by being too concerned about handling the huge amounts of data. This approach may lead to results that have a bias in one way or another. Integration across heterogeneous data resources—some that might be considered big data and others not—presents formidable logistical as well as analytical challenges, but many researchers argue that such integrations are likely to represent the most promising new frontiers in science. +In the provocative article "Critical Questions for Big Data", the authors title big data a part of mythology: "large data sets offer a higher form of intelligence and knowledge [...], with the aura of truth, objectivity, and accuracy". Users of big data are often "lost in the sheer volume of numbers", and "working with Big Data is still subjective, and what it quantifies does not necessarily have a closer claim on objective truth". Recent developments in BI domain, such as pro-active reporting especially target improvements in the usability of big data, through automated filtering of non-useful data and correlations. Big structures are full of spurious correlations either because of non-causal coincidences (law of truly large numbers), solely nature of big randomness (Ramsey theory), or existence of non-included factors so the hope, of early experimenters to make large databases of numbers "speak for themselves" and revolutionize scientific method, is questioned. Catherine Tucker has pointed to "hype" around big data, writing "By itself, big data is unlikely to be valuable." The article explains: "The many contexts where data is cheap relative to the cost of retaining talent to process it, suggests that processing skills are more important than data itself in creating value for a firm." +Big data analysis is often shallow compared to analysis of smaller data sets. In many big data projects, there is no large data analysis happening, but the challenge is the extract, transform, load part of data pre-processing. +Big data is a buzzword and a "vague term", but at the same time an "obsession" with entrepreneurs, consultants, scientists, and the media. Big data showcases such as Google Flu Trends failed to deliver good predictions in recent years, overstating the flu outbreaks by a factor of two. Similarly, Academy Awards and election predictions solely based on Twitter were more often off than on target. +Big data often poses the same challenges as small data; adding more data does not solve problems of bias, but may emphasize other problems. In particular data sources such as Twitter are not representative of the overall population, and results drawn from such sources may then lead to wrong conclusions. Google Translate—which is based on big data statistical analysis of text—does a good job at translating web pages. However, results from specialized domains may be dramatically skewed. +On the other hand, big data may also introduce new problems, such as the multiple comparisons problem: simultaneously testing a large set of hypotheses is likely to produce many false results that mistakenly appear significant. +Ioannidis argued that "most published research findings are false" due to essentially the same effect: when many scientific teams and researchers each perform many experiments (i.e. process a big amount of scientific data; although not with big data technology), the likelihood of a "significant" result being false grows fast – even more so, when only positive results are published. +Furthermore, big data analytics results are only as good as the model on which they are predicated. In an example, big data took part in attempting to predict the results of the 2016 U.S. presidential election with varying degrees of success. + +=== Critiques of big data policing and surveillance === +Big data has been used in policing and surveillance by institutions like law enforcement and corporations (see: corporate surveillance and surveillance capitalism). Due to the less visible nature of data-based surveillance as compared to traditional methods of policing, objections to big data policing are less likely to arise. According to Sarah Brayne's Big Data Surveillance: The Case of Policing, big data policing can reproduce existing societal inequalities in three ways: + +Placing people under increased surveillance by using the justification of a mathematical and therefore unbiased algorithm +Increasing the scope and number of people that are subject to law enforcement tracking and exacerbating existing racial overrepresentation in the criminal justice system +Encouraging members of society to abandon interactions with institutions that would create a digital trace, thus creating obstacles to social inclusion +If these potential problems are not corrected or regulated, the effects of big data policing may continue to shape societal hierarchies. Conscientious usage of big data policing could prevent individual level biases from becoming institutional biases, Brayne also notes. + +== See also == + +== References == + +== Bibliography == + +== Further reading == + +== External links == + Media related to Big data at Wikimedia Commons + The dictionary definition of big data at Wiktionary \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Causal_analysis-0.md b/data/en.wikipedia.org/wiki/Causal_analysis-0.md new file mode 100644 index 000000000..3bc3e5803 --- /dev/null +++ b/data/en.wikipedia.org/wiki/Causal_analysis-0.md @@ -0,0 +1,60 @@ +--- +title: "Causal analysis" +chunk: 1/1 +source: "https://en.wikipedia.org/wiki/Causal_analysis" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:53:30.564050+00:00" +instance: "kb-cron" +--- + +Causal analysis is the field of experimental design and statistics pertaining to establishing cause and effect. Typically it involves establishing four elements: correlation, sequence in time (that is, causes must occur before their proposed effect), a plausible physical or information-theoretical mechanism for an observed effect to follow from a possible cause, and eliminating the possibility of common and alternative ("special") causes. Such analysis usually involves one or more controlled or natural experiments. + + +== Motivation == +Data analysis is primarily concerned with causal questions. For example, did the fertilizer cause the crops to grow? Or, can a given sickness be prevented? Or, why is my friend depressed? The potential outcomes and regression analysis techniques handle such queries when data is collected using designed experiments. Data collected in observational studies require different techniques for causal inference (because, for example, of issues such as confounding). Causal inference techniques used with experimental data require additional assumptions to produce reasonable inferences with observation data. The difficulty of causal inference under such circumstances is often summed up as "correlation does not imply causation". + + +== In philosophy and physics == + +The nature of causality is systematically investigated in several academic disciplines, including philosophy and physics. In academia, there are a significant number of theories on causality; The Oxford Handbook of Causation (Beebee, Hitchcock & Menzies 2009) encompasses 770 pages. Among the more influential theories within philosophy are Aristotle's Four causes and Al-Ghazali's occasionalism. David Hume argued that beliefs about causality are based on experience, and experience similarly based on the assumption that the future models the past, which in turn can only be based on experience – leading to circular logic. In conclusion, he asserted that causality is not based on actual reasoning: only correlation can actually be perceived. Immanuel Kant, according to Beebee, Hitchcock & Menzies (2009), held that "a causal principle according to which every event has a cause, or follows according to a causal law, cannot be established through induction as a purely empirical claim, since it would then lack strict universality, or necessity". +Outside the field of philosophy, theories of causation can be identified in classical mechanics, statistical mechanics, quantum mechanics, spacetime theories, biology, social sciences, and law. To establish a correlation as causal within physics, it is normally understood that the cause and the effect must connect through a local mechanism (cf. for instance the concept of impact) or a nonlocal mechanism (cf. the concept of field), in accordance with known laws of nature. From the point of view of thermodynamics, universal properties of causes as compared to effects have been identified through the Second law of thermodynamics, confirming the ancient, medieval and Cartesian view that "the cause is greater than the effect" for the particular case of thermodynamic free energy. This, in turn, is challenged by popular interpretations of the concepts of nonlinear systems and the butterfly effect, in which small events cause large effects due to, respectively, unpredictability and an unlikely triggering of large amounts of potential energy. + + +=== Causality construed from counterfactual states === + +Intuitively, causation seems to require not just a correlation, but a counterfactual dependence. Suppose that a student performed poorly on a test and guesses that the cause was his not studying. To prove this, one thinks of the counterfactual – the same student writing the same test under the same circumstances but having studied the night before. If one could rewind history, and change only one small thing (making the student study for the exam), then causation could be observed (by comparing version 1 to version 2). Because one cannot rewind history and replay events after making small controlled changes, causation can only be inferred, never exactly known. This is referred to as the Fundamental Problem of Causal Inference – it is impossible to directly observe causal effects. +A major goal of scientific experiments and statistical methods is to approximate as best possible the counterfactual state of the world. For example, one could run an experiment on identical twins who were known to consistently get the same grades on their tests. One twin is sent to study for six hours while the other is sent to the amusement park. If their test scores suddenly diverged by a large degree, this would be strong evidence that studying (or going to the amusement park) had a causal effect on test scores. In this case, correlation between studying and test scores would almost certainly imply causation. +Well-designed experimental studies replace equality of individuals as in the previous example by equality of groups. The objective is to construct two groups that are similar except for the treatment that the groups receive. This is achieved by selecting subjects from a single population and randomly assigning them to two or more groups. The likelihood of the groups behaving similarly to one another (on average) rises with the number of subjects in each group. If the groups are essentially equivalent except for the treatment they receive, and a difference in the outcome for the groups is observed, then this constitutes evidence that the treatment is responsible for the outcome, or in other words the treatment causes the observed effect. However, an observed effect could also be caused "by chance", for example as a result of random perturbations in the population. Statistical tests exist to quantify the likelihood of erroneously concluding that an observed difference exists when in fact it does not (for example see P-value). + + +== Operational definitions of causality == +Clive Granger created the first operational definition of causality in 1969. Granger made the definition of probabilistic causality proposed by Norbert Wiener operational as a comparison of variances. + + +== Verification by "truth" == +Peter Spirtes, Clark Glymour, and Richard Scheines introduced the idea of explicitly not providing a definition of causality . Spirtes and Glymour introduced the PC algorithm for causal discovery in 1990. Many recent causal discovery algorithms follow the Spirtes-Glymour approach to verification. + + +== Exploratory == + +Exploratory causal analysis, also known as "data causality" or "causal discovery" is the use of statistical algorithms to infer associations in observed data sets that are potentially causal under strict assumptions. ECA is a type of causal inference distinct from causal modeling and treatment effects in randomized controlled trials. It is exploratory research usually preceding more formal causal research in the same way exploratory data analysis often precedes statistical hypothesis testing in data analysis. + + +== See also == +Causal inference +Causal model +Causality +Causal reasoning + + +== References == + + +=== Bibliography === +Beebee, Helen; Hitchcock, Christopher; Menzies, Peter (2009). The Oxford Handbook of Causation. Oxford University Press. ISBN 978-0-19-162946-4. + + +== External links == +Causality Workbench team tools and data +University of Pittsburgh CCD team tools \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Center_for_Open_Science-0.md b/data/en.wikipedia.org/wiki/Center_for_Open_Science-0.md index f9350d39d..84c3b91b7 100644 --- a/data/en.wikipedia.org/wiki/Center_for_Open_Science-0.md +++ b/data/en.wikipedia.org/wiki/Center_for_Open_Science-0.md @@ -4,7 +4,7 @@ chunk: 1/1 source: "https://en.wikipedia.org/wiki/Center_for_Open_Science" category: "reference" tags: "science, encyclopedia" -date_saved: "2026-05-05T06:31:44.226264+00:00" +date_saved: "2026-05-05T09:52:41.470257+00:00" instance: "kb-cron" --- diff --git a/data/en.wikipedia.org/wiki/Chen_Zhenyuan_fake_peer_review_scandal-0.md b/data/en.wikipedia.org/wiki/Chen_Zhenyuan_fake_peer_review_scandal-0.md new file mode 100644 index 000000000..7e4c65bd9 --- /dev/null +++ b/data/en.wikipedia.org/wiki/Chen_Zhenyuan_fake_peer_review_scandal-0.md @@ -0,0 +1,69 @@ +--- +title: "Chen Zhenyuan fake peer review scandal" +chunk: 1/2 +source: "https://en.wikipedia.org/wiki/Chen_Zhenyuan_fake_peer_review_scandal" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:52:51.221154+00:00" +instance: "kb-cron" +--- + +The Chen Zhenyuan fake peer review scandal was an academic misconduct case that came to light in July 2014. The UK-based publisher SAGE Publications discovered that Chen Zhenyuan, then an associate professor in the Department of Computer Science at National Pingtung University of Education (now National Pingtung University, Minsheng Campus), was suspected of fabricating 130 fake reviewer accounts in order to manipulate peer review and citation networks to secure the publication of his papers. As a result, SAGE retracted 60 papers authored or co-authored by Chen Zhenyuan from the Journal of Vibration and Control (JVC). +The mass retraction of 60 papers at once was extremely rare in the global academic community. The editor-in-chief of the Journal of Vibration and Control subsequently resigned and applied for retirement from his university. Among the retracted papers, five were authored by Taiwan’s then Minister of Education, Chiang Wei-ling, with Chen Zhenyuan listed as a co-author. Amid intense public pressure, Chiang Wei-ling resigned as Minister of Education. + +== Background and timeline == + +=== 2009 === +Chen Zhenyuan joined National Pingtung University of Education as a faculty member and began publishing a large number of papers in the Journal of Vibration and Control (JVC). + +=== February 2011 === +Chen Zhenyuan was promoted to associate professor. + +=== 2013 === +Ali H. Nayfeh, editor-in-chief of JVC, reviewed Chen’s papers and discovered suspicious reviewer accounts. Upon investigation, 130 problematic accounts were identified, all suspected to have been fabricated by Chen Zhenyuan. +In September, SAGE Publications sent a formal letter to National Pingtung University of Education requesting assistance in the investigation. +In October, the Department of Computer Science at the university established a special investigative committee. +In November, Chen Zhenyuan resigned from his associate professor position and officially left the university after the start of the semester in February 2014. +The university concluded its internal investigation without publicly releasing the results or notifying the Ministry of Education. The university stated that Chen’s resignation had hindered the investigation, and that many investigative actions could not be carried out during the summer break. It emphasized that a final report would be completed after the new semester began and that the matter would “not be privately settled.” + +== Public exposure and aftermath == + +=== May 2014 === +After completing his investigation and submitting a report to JVC, editor-in-chief Ali H. Nayfeh resigned from his position and applied for retirement from his university, taking responsibility for the incident. + +=== July 8, 2014 === +The blog Retraction Watch first disclosed the incident to the public. + +=== July 9, 2014 === +The Journal of Vibration and Control issued a statement accusing Chen Zhenyuan of exploiting the journal’s online peer review system by registering 130 fake reviewer accounts. SAGE Publications suspected that Chen used this method to pass peer review and consequently retracted 60 papers associated with him. +Major international media outlets, including The New York Times and The Washington Post, reported on the case. Taiwanese media followed shortly thereafter. +Academic Lee Chia-tung criticized the incident as being caused by flaws in JVC’s peer review system. + +=== July 11, 2014 === +Because five of the retracted papers were authored by Chiang Wei-ling, with Chen Zhenyuan listed as a co-author, Chiang held a press conference to clarify the matter. Chiang stated that Chen Zhenyuan’s younger brother, Chen Zhenwu, was his supervised student, and that Chen Zhenyuan’s co-authorship was handled by Chen Zhenwu without Chiang’s knowledge. +Public opinion called for Chiang Wei-ling to be suspended pending investigation. + +=== July 13, 2014 === +Chen Zhenyuan held a press conference, asserting that Chiang Wei-ling was not involved. In a post-conference statement, Chen claimed that there was no plagiarism or data fabrication in the papers themselves. Regarding the fake accounts, he argued that they were created unintentionally due to keyboard input errors during manuscript submission and revision, which caused the information system to automatically generate multiple co-author accounts, and that there was no malicious intent. + +=== July 14, 2014 === +The Executive Yuan of the Republic of China announced that Chiang Wei-ling had resigned as Minister of Education. + +== Government investigation and sanctions == + +=== September 25, 2014 === +The Academic Ethics Committee of Taiwan’s Ministry of Science and Technology (MOST) issued preliminary findings confirming peer review fraud. Proposed sanctions included: + +A 10-year ban on Chen Zhenyuan from applying for MOST funding +A 5-year ban on Chen Zhenwu +Recovery of more than NT$2 million in research grants obtained during the period in question +The case was submitted to Minister Chang San-cheng for final determination. +During the MOST investigation, it was further found that, in addition to the 60 retracted JVC papers, Chen Zhenyuan had published a paper in Nature Hazards whose introduction was copied from Wikipedia. Approximately one-third of the paper consisted of citations from other works, and about 90% of those citations referenced papers authored by Chen Zhenyuan and Chen Zhenwu, raising suspicions of deliberate inflation of citation metrics. +Additionally, Chiang Wei-ling was reported for alleged self-plagiarism in one of the retracted Journal of Vibration and Control papers in which he was listed as a co-author. The paper was suspected of substantially duplicating content from another JVC paper published in 2010 that had not been retracted. Comparative analysis found multiple paragraphs to be nearly identical word-for-word, without proper citation. +On October 16, MOST’s Academic Ethics Committee initially ruled to impose a three-year funding ban on Chiang Wei-ling, pending final approval. + +=== October 26, 2014 === +MOST formally decided to impose a 10-year ban on Chen Zhenyuan from applying for research grants. Sanctions against Chen Zhenwu and Chiang Wei-ling remained undecided at that time. + +=== November 26, 2014 === +The British journal Nature published a feature article titled “The Peer-Review Scam”, reviewing the incident and concluding that the Chen Zhenyuan case, together with the Wen Hyeong-yoon case, exposed security vulnerabilities in automated manuscript submission and review systems such as ScholarOne. \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Chen_Zhenyuan_fake_peer_review_scandal-1.md b/data/en.wikipedia.org/wiki/Chen_Zhenyuan_fake_peer_review_scandal-1.md new file mode 100644 index 000000000..17e527d9c --- /dev/null +++ b/data/en.wikipedia.org/wiki/Chen_Zhenyuan_fake_peer_review_scandal-1.md @@ -0,0 +1,29 @@ +--- +title: "Chen Zhenyuan fake peer review scandal" +chunk: 2/2 +source: "https://en.wikipedia.org/wiki/Chen_Zhenyuan_fake_peer_review_scandal" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:52:51.221154+00:00" +instance: "kb-cron" +--- + +=== February 6, 2015 === +MOST decided to impose a 10-year ban on Chen Zhenwu and to recover principal investigator fees from three research projects. Chiang Wei-ling received a one-year ban. + +== Subsequent developments == + +=== March 6, 2016 === +The international journal Human Factors and Ergonomics in Manufacturing & Service Industries announced the retraction of three papers published by Chen Zhenwu between 2012 and 2013. These papers listed only Chen Zhenwu and other scholars as authors, without Chen Zhenyuan. Retraction Watch concluded that Chen Zhenwu was not merely an innocent bystander in the case, prompting MOST to reopen its investigation. +The Ministry of Education stated that after the incident, National Kaohsiung University of Science and Technology verified that several papers submitted by Chen Zhenwu for promotion contained problems. The university decided to revoke his professorship, demote him to associate professor, and recover excess salary payments. Chen Zhenwu filed an administrative appeal. +After court proceedings, the Ministry of Education initially lost the case, appealed, and the Supreme Administrative Court ultimately ruled to vacate the original judgment, revoke the appeal decision and the original administrative sanction, and ordered the case to be reinvestigated and re-sanctioned accordingly. As a result, Chen Zhenwu currently holds the rank of associate professor. + +== Similar cases == + +=== 2012 === +At Dong-A University in South Korea, agricultural scientist Wen Hyeong-yoon was found to have created 24 fake email accounts to impersonate peer reviewers for his own submissions to the Journal of Medicinal Biology. As a result, 35 papers were retracted. + +=== March 2015 === +BioMed Central retracted 43 papers from its journals, mostly authored by Chinese researchers, after discovering fabricated peer review activities. The Washington Post reported on the incident and explicitly compared it to the Chen Zhenyuan case. + +== References == \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Clinical_peer_review-0.md b/data/en.wikipedia.org/wiki/Clinical_peer_review-0.md new file mode 100644 index 000000000..5ee1b02aa --- /dev/null +++ b/data/en.wikipedia.org/wiki/Clinical_peer_review-0.md @@ -0,0 +1,24 @@ +--- +title: "Clinical peer review" +chunk: 1/5 +source: "https://en.wikipedia.org/wiki/Clinical_peer_review" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:52:52.389008+00:00" +instance: "kb-cron" +--- + +Clinical peer review, also known as medical peer review is the process by which health care professionals, including those in nursing and pharmacy, evaluate each other's clinical performance. A discipline-specific process may be referenced accordingly (e.g., physician peer review, nursing peer review). +Today, clinical peer review is most commonly done in hospitals, but may also occur in other practice settings including surgical centers and large group practices. The primary purpose of peer review is to improve the quality and safety of care. Secondarily, it serves to reduce the organization's vicarious malpractice liability and meet regulatory requirements. In the US, these include accreditation, licensure and Medicare participation. Peer review also supports the other processes that healthcare organizations have in place to assure that physicians are competent and practice within the boundaries of professionally accepted norms. + +== Overview == +Clinical peer review should be distinguished from the peer review that medical journals use to evaluate the merits of a scientific manuscript, from the peer review process used to evaluate health care research grant applications, and, also, from the process by which clinical teaching might be evaluated. All these forms of peer review are confounded in the term Medical Peer Review. Moreover, Medical peer review has been used by the American Medical Association (AMA) to refer not only to the process of improving quality and safety in health care organizations, but also to process by which adverse actions involving clinical privileges or professional society membership may be pursued. In addition, peer review methods are frequently used by state medical boards with respect to licensing decisions and complaint investigation. They are also used by insurance companies with respect to credentialing and utilization management processes. + +=== Medicine === +In US hospital settings, clinical peer review encompasses a wide variety of activities, whose scope varies across institutions. Virtually all programs perform retrospective medical record review (aka, case review) of the quality of care. Most also include Ongoing Professional Practice Evaluation (OPPE) and Focused Professional Practice Evaluation (FPPE) required by the Joint Commission since 2007. Many also include the management of disruptive behavior (see ) and physician health programs. +Routine clinical peer review activity (performance assessment) is typically organized separately from the credentialing/privileging process (competence assessment), but the results of peer review inform those decisions. From a legal and regulatory perspective, however, the line is blurred. The definition of a peer review body can be broad, including not only individuals but also (for example, in Oregon), "tissue committees, governing bodies or committees including medical staff committees of a [licensed] health care facility...or any other medical group in connection with bona fide medical research, quality assurance, utilization review, credentialing, education, training, supervision or discipline of physicians or other health care providers." +Medical staffs generally rely on generic screens for adverse events to identify cases for peer review, even though that might not be the most efficient or effective method. These are generally applied through administrative data analysis, but referrals for peer review are frequently made by risk managers, nurses and medical staff. The median annual review volume is 1–2% of hospital inpatient admissions. Thus, case review may be the dominant form of adverse event analysis in US hospitals. +Case reviews are typically conducted by individual reviewers, but in nearly 70% of hospitals, most reviews are presented and discussed in a committee prior to final decision-making. Nurses now participate on physician review committees in the majority of programs. This extends the trend of the past decade for adoption of multi-specialty representation in the direction of multi-disciplinary peer review. Some of these committees now routinely assess nursing care during the case review process and may even directly address all improvement opportunities. + +=== Nursing === +The American Nurses Association published the first definition of nursing peer review in 1988. It includes the following statements: \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Clinical_peer_review-1.md b/data/en.wikipedia.org/wiki/Clinical_peer_review-1.md new file mode 100644 index 000000000..d3a278472 --- /dev/null +++ b/data/en.wikipedia.org/wiki/Clinical_peer_review-1.md @@ -0,0 +1,29 @@ +--- +title: "Clinical peer review" +chunk: 2/5 +source: "https://en.wikipedia.org/wiki/Clinical_peer_review" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:52:52.389008+00:00" +instance: "kb-cron" +--- + +The American Nurses Association believes nurses bare primary responsibility and accountability for the quality of nursing care their clients receive. Standards of nursing practice provide a means for measuring the quality of nursing care a client receives. Each nurse is responsible for interpreting and implementing the standards of nursing practice. Likewise, each nurse must participate with other nurses in the decision-making process for evaluating nursing care…Peer review implies that the nursing care delivered by a group of nurses or an individual nurse is evaluated by individuals of the same rank or standing according to established standards of practice…. Peer review is an organized effort whereby practicing professionals review the quality and appropriateness of services ordered or performed by their professional peers. Peer review in nursing is the process by which practicing registered nurses systematically access, monitor, and make judgments about the quality of nursing care provided by peers as measured against professional standards of practice. +In Nursing, as in other professions, peer review applies professional control to practice, and is used by professionals to hold themselves accountable for their services to the public and the organization. Peer review plays a role in affecting the quality of outcomes, fostering practice development, and maintaining professional autonomy. American Nurses Association guidelines define peer review as the process by which practitioners of the same rank, profession, or setting critically appraise each other's work performance against established standards. The professionals who are best acquainted with the requirements and demands of the role are the givers and receivers of the review. +Nursing peer review appears to have gained momentum as a result of growth of hospital participation in the American Nursing Association's Magnet Program. Even so, less than 7% of U.S. hospitals have qualified. Magnet hospitals must have at least 2 years of experience with a peer review evaluation process designed to improve practice and performance for all RNs for at least 2 years. The literature on nursing peer review is more limited than that which has been developed for physician peer review, and has focused more on annual performance appraisal than on case review. No aggregate studies of clinical nursing peer review practices have been published. Nevertheless, more sophisticated studies have been reported. +Nursing professionals have historically been less likely to participate or be subject to peer review. This is changing, as is the previously limited extensiveness (for example, no aggregate studies of clinical nursing peer review practices had been published as of 2010) of the literature on nursing peer review +Mostly what is mistakenly referred to as "peer review" in clinical practice is really a form of the annual performance evaluation. The annual performance review is a managerial process and does not meet the definition or outcomes needed related to peer review. Other organizational practices may violate the peer review guidelines set forth 1988 by the ANA 1988. The most frequent violation is the performance of direct care peer review by managers. One of the reasons for the confusion is that the ANA guidelines for peer review had been out of print prior to being reprinted and updated in 2011. +The early ANA Peer Review Guidelines (1988) and Code of Ethics for Nurses (2001) focus on maintaining standards of nursing practice and upgrading nursing care in three contemporary focus areas for peer review. The three dimensions of peer review are: (a) quality and safety, (b) role actualization, and (c) practice advancement. Each area of contemporary peer review has an organizational, unit, and individual focus. The following six peer review practice principles stem from and are grounded in the 1988 ANA Guidelines and may help to assure an evidence-based and consistent approach to peer review: +1. A peer is someone of the same rank. +2. Peer review is practice focused. +3. Feedback is timely, routine and a continuous expectation. +4. Peer review fosters a continuous learning culture of patient safety and best practice. +5. Feedback is not anonymous. +6. Feedback incorporates the developmental stage of the nurse. +Written and standardized operating procedures for peer review also need development and adoption by the direct care staff and incorporation into the professional practice model (shared governance) bylaws. +Confusion exists about the differences between the Professional Peer Review process, the Annual Performance Review (APR) and the role of peer evaluation. The APR is a managerial human resource function performed with direct reports, and is aimed at defining, aligning and recognizing each employee's contribution to the organization's success. In contrast, professional peer review is conducted within the professional practice model and is not a managerial accountability. Peer evaluation is the process of getting feedback on one's specific role competencies or "at work" behaviors from people that one works within the department and from other departments. "Colleague evaluation" is a more appropriate term than "peer evaluation" as this is not a form of professional peer review. + +=== Pharmacy === +There is limited published information about peer review among pharmacists. + +== History == \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Clinical_peer_review-2.md b/data/en.wikipedia.org/wiki/Clinical_peer_review-2.md new file mode 100644 index 000000000..121a9a05d --- /dev/null +++ b/data/en.wikipedia.org/wiki/Clinical_peer_review-2.md @@ -0,0 +1,22 @@ +--- +title: "Clinical peer review" +chunk: 3/5 +source: "https://en.wikipedia.org/wiki/Clinical_peer_review" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:52:52.389008+00:00" +instance: "kb-cron" +--- + +The first documented description of a peer review process is found in the Ethics of the Physician written by Ishap bin Ali al-Rahawi (854–931) of al-Raha, Syria. His work, as well as later Arabic medical manuals, states that a visiting physician must always make duplicate notes of a patient's condition on every visit. When the patient was cured or had died, the notes of the physician were examined by a local medical council of other physicians, who would review the practicing physician's notes to decide whether his or her performance met the required standards of medical care. If their reviews were negative, the practicing physician could face a lawsuit from a maltreated patient. Such practices are known to have continued into the 11th century. +In the 1900s, peer review methods evolved in relation to the pioneering work of Codman's End Result System and Ponton's concept of Medical Audit. Lembcke, himself a major contributor to audit methodology, in reviewing this history, notes the pre-emptive influence of hospital standardization promoted by the American College of Surgeons (ACS) following WWI. The Joint Commission (on Accreditation of Hospitals) followed the ACS in this role from 1952. Medicare legislation, enacted in 1964, boosted the stature and influence of the Joint Commission because the conditions for hospital participation required a credible medical care review program and the regulations stipulated that Joint Commission accreditation would guarantee payment eligibility. What was once a sporadic process, became hardwired in most hospitals following the medical audit model. The widespread creation of new programs was hampered, however, by limitations in the available process models, tools, training and implementation support. +Medical audit is a focused study of the process and/or outcomes of care for a specified patient cohort using pre-defined criteria. Audits are typically organized around a diagnosis, procedure or clinical situation. It remains the predominant mode of peer review in Europe and other countries. +In the US, however, the lack of perceived effectiveness of medical audit led to revisions of Joint Commission standards in 1980. Those modified standards dispensed with the audit requirement and called for an organized system of Quality Assurance (QA). About the same time, hospital and physicians were facing escalating malpractice insurance costs. In response to these combined pressures, they began to adopt "generic screens" for potential substandard care. These screens were originally developed to evaluate the feasibility of a no-fault medical malpractice insurance plan and were never validated as a tool to improve quality of care. Despite warnings from the developers, their use became widespread. In the process, a QA model for peer review evolved with a narrow focus on the question of whether or not the standard of care had been met. It has persisted despite the many criticisms of its methods and effectiveness. Today, its methods are increasingly recognized to be outdated and incongruent with the Quality Improvement (QI) principles that have been increasingly embraced by healthcare organizations. +There is good evidence that contemporary peer review process can be further improved. The American College of Obstetrics and Gynecology has offered a Voluntary Review of Quality of Care Program for more than 2 decades. Perceived issues with the adequacy of peer review were an explicit reason for requesting this service by 15% of participating hospitals, yet recommendations for improved peer review process were made to 60%. A 2007 study of peer review in US hospitals found wide variation in practice. The more effective programs had more features consistent with quality improvement principles. There were substantial opportunities for program improvement. The implication was that a new QI model for peer review seems to be evolving. +While it is premature to judge the potential effectiveness of this model, a 2009 study confirmed these findings in a separate sampling of hospitals. It also showed that important differences among programs predict a meaningful portion of the variation on 32 objective measures of patient care quality and safety. These findings were extended by cohort follow-up studies conducted in 2011 and 2015–16. +The 2015-16 study refined QI model identifying 20 features that distinguish the most effective programs. These include among other factors: aiming first and foremost at improving quality, standardizing review process, maintaining high quality of case review, promoting self-reporting of adverse events, near misses and hazardous conditions, identifying opportunities for improvement in the review process (as opposed to casting blame), providing timely clinical performance feedback, recognizing clinical excellence, and establishing effective program governance. as additional multivariate predictors of the impact of clinical peer review on quality and safety, medical staff perceptions of the program, and clinician engagement in quality and safety initiatives. The online supplement to the report includes a program self assessment tool which is also available as a free online utility. Despite a persistently high annual rate of major program change, about two-thirds of programs still have significant opportunity for improvement. It is argued that the outmoded QA model perpetuates a culture of blame that is toxic to efforts to advance quality and high reliability among both physicians and nurses. + +== Legal and regulatory environment == + +=== United States === +In the US, peer review activity is generally protected under state statutes. The protection may include confidentiality of the review process and protection to reviewers and institutions for good faith efforts to improve quality and safety through review activity. Such statutes may also specify whether or not the physician conducting the review must be in active practice. The nature of that protection varies widely. For example, Texas is generally considered to have fairly robust protections, whereas Florida protections were undermined by a constitutional amendment that exposed peer review data to discovery. \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Clinical_peer_review-3.md b/data/en.wikipedia.org/wiki/Clinical_peer_review-3.md new file mode 100644 index 000000000..620f2a7f7 --- /dev/null +++ b/data/en.wikipedia.org/wiki/Clinical_peer_review-3.md @@ -0,0 +1,43 @@ +--- +title: "Clinical peer review" +chunk: 4/5 +source: "https://en.wikipedia.org/wiki/Clinical_peer_review" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:52:52.389008+00:00" +instance: "kb-cron" +--- + +==== Health Care Quality Improvement Act ==== +US federal law generally trumps state law. The federal Health Care Quality Improvement Act ("HCQIA"), 42 U.S.C. § 11112, enacted in 1986, sets standards that professional review actions must meet in order to receive protection under the Act. It requires that the action be taken in the reasonable belief that it will advance healthcare quality based on facts obtained through reasonable efforts with due process and fairness to the involved physician. When peer review leads to an action to limit or revoke clinical privileges, the physician is entitled to both a fair hearing and the right of appeal. +Congress explicitly stated the rationale for this legislation as follows: + + + (1) The increasing occurrence of medical malpractice and the need to improve the quality of medical care +have become nationwide problems that warrant greater efforts than those that can be undertaken by +any individual State. +(2) There is a national need to restrict the ability of incompetent physicians to move from State to State +without disclosure or discovery of the physician's previous damaging or incompetent performance. +(3) This nationwide problem can be remedied through effective professional peer review. +(4) The threat of private money damage liability under Federal laws, including treble damage liability under +Federal antitrust law, unreasonably discourages physicians from participating in effective professional +peer review. +(5) There is an overriding national need to provide incentive and protection for physicians engaging in +effective professional peer review. + +From the time of the HCQIA, there has been good alignment between regulatory and accrediting bodies with respect to due process requirements for physician disciplinary actions. These formalities apply primarily to questions of competence (credentialing and privileging) rather than performance (routine clinical peer review). It would be most unusual to find a hospital whose medical staff bylaws did not conform. + +==== National Practitioner Data Bank ==== +HCQIA enabled the creation of a National Practitioner Data Bank and required hospitals, state medical boards and other health care entities who engage in formal peer review activities to report all disciplinary actions that affect clinical privileges for more than 30 days. This includes incidents in which a provider voluntarily resigns privileges while under investigation. An entity that fails to report as required may lose HCQIA protections for three years. +The HCQIA (§ 11135) requires hospitals to query the NPDB in their initial credentialing and bi-annual provider re-credentialing processes. Structurally, this process fulfills the congressional intention of restricting movement of incompetent physicians. Disciplinary actions may be a red flag for issues of global incompetence, but the problem may be focal, not global. Thus, the NPDB has been criticized for having the unintended consequence of having adverse economic impact on providers who were reported regardless of the magnitude of the issue. Even so, gross under-reporting of adverse actions remains an issue. + +==== Patient Safety and Quality Improvement Act ==== +The Patient Safety and Quality Improvement Act of 2005 ("Patient Safety Act"), Public Law 109–41, USC 299b-21-b-26 amended title IX of the Public Health Service Act to create a general framework to support and protect voluntary initiatives to improve quality and patient safety in all healthcare settings through reporting to Patient Safety Organizations (PSO). This was intended to include peer review. The final rule promulgated by the Agency for Healthcare Research and Quality in 2008 at 42 CFR Part 3 also includes protections against reprisals for good-faith reporters of adverse events, near misses and hazardous conditions. Several Florida health systems subsequently formed PSOs in expectation of using federal statutory protections to maintain the confidentiality of peer review activity that would have been exposed under Amendment 7. The subsequent legal challenges to this strategy go beyond the scope of this article. + +== External peer review == +In the US, following enactment of the HCQIA, executives from various national medical associations and health care organizations formed the non-profit American Medical Foundation for Peer Review and Education to provide independent assessment of medical care. +A 2007 study showed that the vast majority of physician peer review is done "in house": 87% of US hospitals send less than 1% of their peer review cases to external agencies. The external review process is generally reserved for cases requiring special expertise for evaluation or for situations in which the independent opinion of an outside reviewer would be helpful. The process is significantly more costly than in-house review, since the majority of hospital review is done as a voluntary contribution of the medical staff. +Mandated external peer review has not played an enduring role in the US, but was tested back in the 70s. A 1972 amendment to the Social Security Act established Professional Standards Review Organizations (PSRO) with a view to controlling escalating Medicare costs through physician-organized review. The PSRO model was not considered to be effective and was replaced in 1982 by a further act of Congress which established Utilization and Quality Control Peer Review Organizations (PROs). This model too was fraught with limitations. Studies of its methods called into question its reliability and validity for peer review. A survey of Iowa state medical society members in the early 90s regarding perceptions of the PRO program illustrated the potential harm of a poorly designed program. Furthermore, the Institute of Medicine issued a report identifying the system of care as the root cause of many instances of poor quality. As a result, in the mid-90s, the PROs changed their focus and methods; and began to de-emphasize their role as agents of external peer review. The change was completed by 2002, when they were renamed Quality Improvement Organizations. +In contrast, external peer review has been used by German hospitals to lower their standardized mortality rate + +== Abuse == \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Clinical_peer_review-4.md b/data/en.wikipedia.org/wiki/Clinical_peer_review-4.md new file mode 100644 index 000000000..8d79b052b --- /dev/null +++ b/data/en.wikipedia.org/wiki/Clinical_peer_review-4.md @@ -0,0 +1,32 @@ +--- +title: "Clinical peer review" +chunk: 5/5 +source: "https://en.wikipedia.org/wiki/Clinical_peer_review" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:52:52.389008+00:00" +instance: "kb-cron" +--- + +Sham peer review is a name given to the abuse of a medical peer review process to attack a doctor for personal or other non-medical reasons. State medical boards have withheld medical records from court to frame innocent physicians as negligent. Another type of review similar to sham peer review is "incompetent peer review," in which the reviewers are unable to accurately assess the quality of care provided by their colleagues. +Controversy exists over whether medical peer review has been used as a competitive weapon in turf wars among physicians, hospitals, HMOs, and other entities and whether it is used in retaliation for whistleblowing. Many medical staff laws specify guidelines for the timeliness of peer review, in compliance with JCAHO standards, but state medical boards are not bound by such timely peer review and occasionally litigate cases for more than five years. Abuse is also referred to as "malicious peer review" by those who consider it endemic, and they allege that the creation of the National Practitioner Data Bank under the 1986 Healthcare Quality Improvement Act (HCQIA) facilitates such abuse, creating a 'third-rail' or a 'first-strike' mentality by granting significant immunity from liability to doctors and others who participate in peer reviews. +The American Medical Association conducted an investigation of medical peer review in 2007 and concluded that while it is easy to allege misconduct, proven cases of malicious peer review are rare. Parenthetically, it is difficult to prove wrongdoing on behalf of a review committee that can use their clinical and administrative privileges to conceal exculpatory evidence. +The California legislature framed its statutes so as to allow that a peer review can be found in court to have been improper due to bad faith or malice, in which case the peer reviewers' immunities from civil liability "fall by the wayside". +Dishonesty by healthcare institutions is well-described in the literature and there is no incentive for those that lie to the public about patient care to be honest with a peer review committee. +Incidents of alleged sham peer review are numerous and include cases such as Khajavi v. Feather River Anesthesiology Medical Group, Mileikowsky v. Tenet, and Roland Chalifoux. +Defenders of the Health Care Quality Improvement Act state that the National Practitioner Data Bank protects patients by helping preventing errant physicians who have lost their privileges in one state from traveling to practice in another state. Physicians who allege they have been affected by sham peer review are also less able to find work when they move to another state, as Roland Chalifoux did. Moreover, neither opponents or supporters of the NPDB can be completely satisfied, as Chalifoux' case shows that just as physicians who were unjustly accused may be deprived of work in this way, those who have erred might still find work in other states. + +== See also == +Subpoena duces tecum +Utilization management + +== References == + +== Further reading == +NPDB Guidebook: The NPDB Guidebook (last accessed 2/11/19) +Steve Twedt (2003-10-26). "The Cost of Courage: How the tables turn on doctors". Post-Gazette. PG Publishing. +Magazine Staff (2006-08-01). "All Is NOT Calm on the Hospital-Medical Staff Front". Southern California Physician. LACMA Services Inc. Archived from the original on 2006-09-02. +Charles Bond (2005-11-15). "Editorial in Response to "What Is Sham Peer Review?"". Medscape General Medicine. 7 (4): 48. +Goldstein H. (2006). "Appraising the performance of performance appraisals". IEEE Spectrum. 38 (11): 61–63. doi:10.1109/6.963247. +Philip L. Merkel (Winter 2004). "Physicians Policing Physicians: the Development of Medical Staff Peer Review Law at California Hospitals" (PDF). University of San Francisco Law Review. 38 U.S.F. L. Rev. 301. +Bryan G. Hall (Fall 2003). "The Health Care Quality Improvement Act of 1986 and Physician Peer Reviews: Success or Failure?" (PDF). University of South Dakota Elder Law Forum. \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Code_review-0.md b/data/en.wikipedia.org/wiki/Code_review-0.md new file mode 100644 index 000000000..73a96b642 --- /dev/null +++ b/data/en.wikipedia.org/wiki/Code_review-0.md @@ -0,0 +1,74 @@ +--- +title: "Code review" +chunk: 1/1 +source: "https://en.wikipedia.org/wiki/Code_review" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:52:53.547978+00:00" +instance: "kb-cron" +--- + +Code review (sometimes referred to as peer review) is a software quality assurance activity in which one or more people examine the source code of a computer program, either after implementation or during the development process. The persons performing the checking, excluding the author, are called "reviewers". At least one reviewer must not be the code's author. +Code review differs from related software quality assurance techniques like static code analysis, self-checks, testing, and pair programming. Static analysis relies primarily on automated tools, self-checks involve only the author, testing requires code execution, and pair programming is performed continuously during development rather than as a separate step. + + +== Goal == +Although direct discovery of quality problems is often the main goal, code reviews are usually performed to reach a combination of goals: + +Improving code quality – Improve internal code quality and maintainability through better readability, uniformity, and understandability +Detecting defects – Improve quality regarding external aspects, especially correctness, but also find issues such as performance problems, security vulnerabilities, and injected malware +Learning/Knowledge transfer – Sharing codebase knowledge, solution approaches, and quality expectations, both to the reviewers and the author +Increase sense of mutual responsibility – Increase a sense of collective code ownership and solidarity +Finding better solutions – Generate ideas for new and better solutions and ideas beyond the specific code at hand +Complying with QA guidelines, ISO/IEC standards – Code reviews are mandatory in some contexts, such as air traffic software and safety-critical software + + +== Review types == +Several variations of code review processes exist, with additional types specified in IEEE 1028. + +Management reviews +Technical reviews +Inspections +Walk-throughs +Audits + + +=== Inspection (formal) === +The first code review process that was studied and described in detail was called "Inspection" by its inventor, Michael Fagan. Fagan inspection is a formal process that involves a careful and detailed execution with multiple participants and phases. In formal code reviews, software developers attend a series of meetings to examine code line by line, often using printed copies. Research has shown formal inspections to be extremely thorough and highly effective at identifying defects. + + +=== Regular change-based code review (Walk-throughs) === +Software development teams typically adopt a more lightweight review process in which the scope of each review relates to changes to the codebase corresponding to a ticket, user story, commit, or some other unit of work. Furthermore, there are rules or conventions that integrate the review task into the development workflow through conventions like mandatory review of all tickets, commonly as part of a pull request, instead of explicitly planning each review. Such a process is called "regular, change-based code review". There are many variations of this basic process. +A 2017 survey of 240 development teams found that 90% of teams using code review followed a change-based process, with 60% specifically using regular change-based review. +Major software corporations known to use changed-based code review include Microsoft, +Google, +and Facebook. + + +== Efficiency and effectiveness == +Ongoing research by Capers Jones analyzing over 12,000 software development projects found formal inspections had a latent defect discovery rate of 60-65%, while informal inspections detected fewer than 50% of defects. The latent defect discovery rate for most forms of testing is about 30%. A code review case study published in the book Best Kept Secrets of Peer Code Review contradicted the Capers Jones study, finding that lightweight reviews can uncover as many bugs as formal reviews while being faster and less costly. +Studies indicate that up to 75% of code review comments affect software evolvability and maintainability rather than functionality, suggesting that code reviews are an excellent tool for software companies with long product or system life cycles. Therefore, less than 15% of issues discussed in code reviews relate directly to bugs. + + +=== Guidelines === +Research indicates review effectiveness correlates with review speed. Optimal code review rates range from 200 to 400 lines of code per hour. Inspecting and reviewing more than a few hundred lines of code per hour for critical software (such as safety critical embedded software) may be too fast to find errors. + + +=== Supporting tools === +Static code analysis tools assist reviewers by automatically checking source code for known vulnerabilities and defect patterns, particularly for large chunks of code. A 2012 study by VDC Research reports that 17.6% of the embedded software engineers surveyed currently use automated tools to support peer code review and 23.7% plan to use them within two years. + + +== See also == +Committer +Software review +Software quality +Best coding practices +List of software development philosophies + + +== External links == +Five Code Review Antipatterns +Java Magazine, Best of 2020 + + +== References == \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Collocation_(remote_sensing)-0.md b/data/en.wikipedia.org/wiki/Collocation_(remote_sensing)-0.md new file mode 100644 index 000000000..e1f307fab --- /dev/null +++ b/data/en.wikipedia.org/wiki/Collocation_(remote_sensing)-0.md @@ -0,0 +1,159 @@ +--- +title: "Collocation (remote sensing)" +chunk: 1/2 +source: "https://en.wikipedia.org/wiki/Collocation_(remote_sensing)" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:53:37.734675+00:00" +instance: "kb-cron" +--- + +Collocation is a procedure used in remote sensing +to match measurements from two or more different instruments. +This is done for two main reasons: +for validation purposes when comparing measurements of the same variable, +and to relate measurements of two different variables +either for performing retrievals or for prediction. +In the second case the data is later fed into some type of statistical +inverse method +such as an artificial neural network, statistical classification algorithm, +kernel estimator or a linear least squares. +In principle, most collocation problems can be solved by a nearest neighbor search, +but in practice there are many other considerations involved and the best method is +highly specific to the particular matching of instruments. +Here we deal with some of the most important considerations along with specific examples. +There are at least two main considerations when performing collocations. +The first is the sampling pattern of the instrument. +Measurements may be dense and regular, such as those from a cross-track +scanning satellite instrument. In this case, some form of interpolation +may be appropriate. On the other hand, the measurements may be +sparse, such as a one-off field campaign designed for some +particular validation exercise. +The second consideration is the instrument footprint, which +can range from something approaching a point measurement +such as that of a radiosonde, or it might be several +kilometers in diameter such as that of a satellite-mounted, +microwave radiometer. In the latter case, it is appropriate +to take into account the instrument antenna pattern when +making comparisons with another instrument having both a smaller +footprint and a denser sampling, that is, several measurements +from the one instrument will fit into the footprint of the other. +Just as the instrument has a spatial footprint, it will also have +a temporal footprint, often called the integration time. +While the integration time is usually less than a second, +which for meteorological applications is essentially instantaneous, +there are many instances where some form of time averaging can considerably +ease the collocation process. +The collocations will need to be screened based on both the time +and length scales of the phenomenon of interest. +This will further facilitate the collocation process since +remote sensing and other measurement data is almost always +binned in some way. +Certain atmospheric phenomena such as clouds or convection are quite transient +so that we need not consider collocations with a time error of more than an hour or so. +Sea ice, on the other hand, moves and evolves quite slowly, so that +measurements separated by as much as a day or more might still be useful. + +== Satellites == + +The satellites that most concern us are those with a low-Earth, polar orbit since geostationary satellites view the same point throughout their lifetime. +The diagram shows measurements from AMSU-B +instruments mounted on three satellites over a period of 12 hours. +This illustrates both the orbit path and the scan pattern which runs crosswise. +Since the orbit of a satellite is deterministic, +barring orbit maneuvers, we can predict the location of the +satellite at a given time and, by extension, the location of +the measurement pixels. +In theory, collocations can be performed by inverting the +determining equations starting from the desired time period. +In practice, partially processed data (usually referred to as +level 1b, 1c or level 2) contain the coordinates of each of +the measurement pixels and +it is common to simply feed these coordinates to a nearest neighbor search. +As mentioned previously, the satellite data is always binned +in some manner. +At minimum, the data will be arranged in +swaths extending from pole to pole. +The swaths will be labelled by time period and the +approximate location known. + +== Radiosondes == + +Radiosondes are particularly important for collocation studies +because they measure atmospheric variables more accurately and more +directly than satellite or other remote-sensing instruments. +In addition, radiosonde samples are effectively instantaneous point measurements. +One issue with radiosondes carried aloft by weather balloons is +balloon drift. In, +this is handled by averaging all the satellite pixels within a 50 km radius +of the balloon launch. + +If high-resolution sonde data, which normally has a constant +sampling rate or includes the measurement time, is used, +then the lateral motion can be traced from the wind data. +Even with low-resolution data, the motion can still +be approximated by assuming a constant ascent rate. +Excepting a short bit towards the end, +the linear ascent can be clearly seen in the figure above. +We can show that the ascent rate of a balloon is given +by the following equation + + + + + v + = + + + + + g + k + h + ( + 1 + − + + R + + a + + + + / + + + R + + s + + + ) + + + c + + D + + + + + + + + {\displaystyle v={\sqrt {\frac {gkh(1-R_{a}/R_{s})}{c_{D}}}}} + + +where g is gravitational acceleration, +k relates the height, h, and surface area, A, +of the balloon to its volume: V = khA; +Rs is the equivalent "gas constant" of the balloon, +Ra is the gas constant of the air +and cD is the drag coefficient of the balloon. +Substituting some sensible values for each of the constants, +k=1. (the balloon is a perfect cylinder), h=2. m, cD = 1. +and Ra is the gas constant of helium, +returns an ascent rate of 4.1 m/s. Compare this with the +values shown in the histogram which compiles all of the +radiosonde launches from the Polarstern research vessel +over a period of eleven years between 1992 and 2003. \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Collocation_(remote_sensing)-1.md b/data/en.wikipedia.org/wiki/Collocation_(remote_sensing)-1.md new file mode 100644 index 000000000..b66ebd1aa --- /dev/null +++ b/data/en.wikipedia.org/wiki/Collocation_(remote_sensing)-1.md @@ -0,0 +1,85 @@ +--- +title: "Collocation (remote sensing)" +chunk: 2/2 +source: "https://en.wikipedia.org/wiki/Collocation_(remote_sensing)" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:53:37.734675+00:00" +instance: "kb-cron" +--- + +== Interpolation == +For gridded data such as assimilation or reanalysis data, +interpolation is likely the most appropriate method for performing any type of comparison. +A specific point in both physical position and time is easy to locate +within the grid and interpolation performed between the nearest neighbors. +Linear interpolation (bilinear, trilinear etc.) is the most common, +though cubic is used as well but is probably not worth the extra computational overhead. +If the variable of interest has a relatively smooth rate of change (temperature is a good example of this because it +has a diffusion mechanism, radiative transfer, not available to other atmospheric variables), +then interpolation can eliminate much of the error associated with collocation. +Interpolation may also be appropriate for many types of satellite instruments, +for instance a cross-track scanning instrument like Landsat. +In data derived from the Advanced Microwave Sounding Unit (AMSU) are +interpolated (although not for the purposes of collocation) using a slight variation +of trilinear interpolation. +Since measurements within a single scan track are laid out in an approximately rectangular +grid, bilinear interpolation can be performed. +By searching for the nearest overlapping scan track both forwards and backwards in time, +the spatial interpolates can then be interpolated in time. +This technique works better with derived quantities rather than raw brightness temperatures since +the scan angle will already have been accounted for. +For instruments with a more irregular sampling pattern, such as the Advanced Microwave +Scanning Radiometer-EOS (AMSR-E) instrument which has a circular scanning pattern, +we need a more general form of interpolation such as kernel estimation. +A method commonly used for this particular instrument, as well as SSM/I, +is a simple daily average within regularly gridded, spatial bins +. + +== Trajectories == +To collocate measurements of a medium- to long-lived atmospheric tracer with a second instrument, running trajectories can considerably improve the accuracy. It also simplifies the analysis somewhat: a trajectory is run both forwards and backwards from the measurement location and between the desired time window. Note that the acceptable time window has now become longer because the error from transport induced changes in the tracer is removed: the tracer lifetime would be a good window to use. Since the trajectories provide a location for every point in time within the time window, there is no need to check multiple measurements from the second instrument. Every time within the trajectory is checked for the distance criterion but within a very narrow window. Alternatively, the exact times of the measurements for the second instrument are interpolated within the trajectory. Only the smallest distance error below the threshold is used and the distance criterion can be made smaller as a consequence. + +== Example: Pol-Ice Campaign == + +Collocations of sea ice thickness and brightness temperatures taken during the +Pol-Ice Campaign are an excellent example since they illustrate many of the most important principles as well as demonstrating the necessity of taking into account the individual case. The Pol-Ice campaign was conducted in the N. Baltic in March 2007 as part of the SMOS-Ice project in preparation for the launch of the Soil Moisture and Ocean Salinity satellite. +Because of the low frequency of the SMOS instrument, it is hoped that it will render +information on sea ice thickness, therefore the campaign comprised measurements +of both sea ice thickness and emitted brightness temperature. +Brightness temperatures were measured with the EMIRAD L-band microwave radiometer + +carried on board an airplane. Ice thickness was measured with the E-M Bird ice thickness meter which was carried by a helicopter. The E-M Bird measures ice thickness with a combination of inductance measurements to determine the location of the ice-water interface and a laser altimeter to measure the height of the ice surface. +The map above shows the flight tracks of both instruments which were approximately coincident but obviously subject to pilot error. + +Since the flight paths of both aircraft were approximately linear, the first step in the collocation process was to convert all the coincident flights to Cartesian coordinates with the x-axis being lateral distance and the y-axis transverse distance. In this way, collocations can be performed in two ways: crudely, by matching only the x distances, and more precisely by matching both coordinates. +More importantly, the footprint size of the radiometer is many times larger +than that of the E-M Bird meter. The figure to the left shows +the antenna response function for the radiometer. +The full width at half maximum is 31 degrees. +Since the aircraft was flying at approximately 500 m, this translates +to a footprint size of 200 m or more. +Meanwhile, the footprint size of the E-M Bird was roughly 40 m +with a sample spacing of only 2 to 4 m. +Rather than looking to nearest neighbors, which would have produced +poor results, a weighted average of the thickness measurements was +performed for each radiometer measurement. +Weights were calculated based on the radiometer response function which is almost +a perfect Gaussian up to about 45 degrees. +Points could be excluded based on distance along the flight +path. +For validation of sea ice emissivity forward model calculations, +this was further refined by performing an emissivity calculation +for each thickness measurement and averaging over the radiometer +footprint. + +The figure below illustrates relative measurement locations +from each of the instruments used in the Pol-Ice campaign. +Two overpasses are shown: one from the airplane carrying the +EMIRAD radiometer and one from the helicopter carrying the +E-M Bird instrument. +The x-axis is along the line of the flight path. +EMIRAD footprints are drawn with lines, E-M Bird +inductance measurements are represented by circles +and LIDAR measurements with dots. + +== References == \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Combinatorial_data_analysis-0.md b/data/en.wikipedia.org/wiki/Combinatorial_data_analysis-0.md new file mode 100644 index 000000000..96da23e6b --- /dev/null +++ b/data/en.wikipedia.org/wiki/Combinatorial_data_analysis-0.md @@ -0,0 +1,21 @@ +--- +title: "Combinatorial data analysis" +chunk: 1/1 +source: "https://en.wikipedia.org/wiki/Combinatorial_data_analysis" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:53:38.930340+00:00" +instance: "kb-cron" +--- + +In statistics, combinatorial data analysis (CDA) is the study of data sets where the order in which objects are arranged is important. CDA can be used either to determine how well a given combinatorial construct reflects the observed data, or to search for a suitable combinatorial construct that does fit the data. + + +== See also == +Cluster analysis +Geometric data analysis +Structured data analysis (statistics) +Seriation (statistics) + + +== References == \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Commentary_article-0.md b/data/en.wikipedia.org/wiki/Commentary_article-0.md new file mode 100644 index 000000000..7bdc7aa6f --- /dev/null +++ b/data/en.wikipedia.org/wiki/Commentary_article-0.md @@ -0,0 +1,14 @@ +--- +title: "Commentary article" +chunk: 1/1 +source: "https://en.wikipedia.org/wiki/Commentary_article" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:52:54.688511+00:00" +instance: "kb-cron" +--- + +A commentary article is a short narrowly focused article that is usually commissioned by an Academic journal, though they are sometimes published from unsolicited submissions. + + +== References == \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Concept_drift-0.md b/data/en.wikipedia.org/wiki/Concept_drift-0.md new file mode 100644 index 000000000..8200cea29 --- /dev/null +++ b/data/en.wikipedia.org/wiki/Concept_drift-0.md @@ -0,0 +1,32 @@ +--- +title: "Concept drift" +chunk: 1/4 +source: "https://en.wikipedia.org/wiki/Concept_drift" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:53:40.159460+00:00" +instance: "kb-cron" +--- + +In predictive analytics, data science, machine learning and related fields, concept drift or drift is an evolution of data that invalidates the data model. It happens when the statistical properties of the target variable, which the model is trying to predict, change over time in unforeseen ways. This causes problems because the predictions become less accurate as time passes. Drift detection and drift adaptation are of paramount importance in the fields that involve dynamically changing data and data models. + +== Predictive model decay == +In machine learning and predictive analytics this drift phenomenon is called concept drift. In machine learning, a common element of a data model are the statistical properties, such as probability distribution of the actual data. If they deviate from the statistical properties of the training data set, then the learned predictions may become invalid, if the drift is not addressed. + +== Data configuration decay == +Another important area is software engineering, where three types of data drift affecting data fidelity may be recognized. Changes in the software environment ("infrastructure drift") may invalidate software infrastructure configuration. "Structural drift" happens when the data schema changes, which may invalidate databases. "Semantic drift" is changes in the meaning of data while the structure does not change. In many cases this may happen in complicated applications when many independent developers introduce changes without proper awareness of the effects of their changes in other areas of the software system. +For many application systems, the nature of data on which they operate are subject to changes for various reasons, e.g., due to changes in business model, system updates, or switching the platform on which the system operates. +In the case of cloud computing, infrastructure drift that may affect the applications running on cloud may be caused by the updates of cloud software. +There are several types of detrimental effects of data drift on data fidelity. Data corrosion is passing the drifted data into the system undetected. Data loss happens when valid data are ignored due to non-conformance with the applied schema. Squandering is the phenomenon when new data fields are introduced upstream in the data processing pipeline, but somewhere downstream these data fields are absent. + +== Inconsistent data == +"Data drift" may refer to the phenomenon when database records fail to match the real-world data due to the changes in the latter over time. This is a common problem with databases involving people, such as customers, employees, citizens, residents, etc. Human data drift may be caused by unrecorded changes in personal data, such as place of residence or name, as well as due to errors during data input. +"Data drift" may also refer to inconsistency of data elements between several replicas of a database. The reasons can be difficult to identify. A simple drift detection is to run checksum regularly. However the remedy may be not so easy. + +== Examples == +The behavior of the customers in an online shop may change over time. For example, if weekly merchandise sales are to be predicted, and a predictive model has been developed that works satisfactorily. The model may use inputs such as the amount of money spent on advertising, promotions being run, and other metrics that may affect sales. The model is likely to become less and less accurate over time – this is concept drift. In the merchandise sales application, one reason for concept drift may be seasonality, which means that shopping behavior changes seasonally. Perhaps there will be higher sales in the winter holiday season than during the summer, for example. Concept drift generally occurs when the covariates that comprise the data set begin to explain the variation of your target set less accurately — there may be some confounding variables that have emerged, and that one simply cannot account for, which renders the model accuracy to progressively decrease with time. Generally, it is advised to perform health checks as part of the post-production analysis and to re-train the model with new assumptions upon signs of concept drift. + +== Possible remedies == +To prevent deterioration in prediction accuracy because of concept drift, reactive and tracking solutions can be adopted. Reactive solutions retrain the model in reaction to a triggering mechanism, such as a change-detection test or control charts from statistical process control, to explicitly detect concept drift as a change in the statistics of the data-generating process. When concept drift is detected, the current model is no longer up-to-date and must be replaced by a new one to restore prediction accuracy. A shortcoming of reactive approaches is that performance may decay until the change is detected. Tracking solutions seek to track the changes in the concept by continually updating the model. Methods for achieving this include online machine learning, frequent retraining on the most recently observed samples, and maintaining an ensemble of classifiers where one new classifier is trained on the most recent batch of examples and replaces the oldest classifier in the ensemble. +Contextual information, when available, can be used to better explain the causes of the concept drift: for instance, in the sales prediction application, concept drift might be compensated by adding information about the season to the model. By providing information about the time of the year, the rate of deterioration of your model is likely to decrease, but concept drift is unlikely to be eliminated altogether. This is because actual shopping behavior does not follow any static, finite model. New factors may arise at any time that influence shopping behavior, the influence of the known factors or their interactions may change. +Concept drift cannot be avoided for complex phenomena that are not governed by fixed laws of nature. All processes that arise from human activity, such as socioeconomic processes, and biological processes are likely to experience concept drift. Therefore, periodic retraining, also known as refreshing, of any model is necessary. \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Concept_drift-1.md b/data/en.wikipedia.org/wiki/Concept_drift-1.md new file mode 100644 index 000000000..9694f47a6 --- /dev/null +++ b/data/en.wikipedia.org/wiki/Concept_drift-1.md @@ -0,0 +1,27 @@ +--- +title: "Concept drift" +chunk: 2/4 +source: "https://en.wikipedia.org/wiki/Concept_drift" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:53:40.159460+00:00" +instance: "kb-cron" +--- + +=== Remedy methods === +DDM (Drift Detection Method): detects drift by monitoring the model's error rate over time. When the error rate passes a set threshold, it enters a warning phase, and if it passes another threshold, it enters a drift phase. +EDDM (Early Drift Detection Method): improves DDM's detection rate by tracking the average distance between two errors instead of only the error rate. +ADWIN (Adaptive Windowing): dynamically stores a window of recent data and warns the user if it detects a significant change between the statistics of the window's earlier data compared to more recent data. +KSWIN (Kolmogorov–Smirnov Windowing): detects drift based on the Kolmogorov-Smirnov statistical test. +DDM and EDDM: Concept Drift Detection + +online supervised methods that rely on sequential error monitoring to estimate the evolving error rate. +ADWIN and KSWIN: Windowing + +maintain a "window", a subset of the most recent data, of the data stream, which it checks for statistical differences across the window. + +== Applications in security == +Concept drift is a recurring issue in security analytics, especially in malware and intrusion detection. In these systems, models are often trained on past logs, binaries or network traces, but the behaviour of attackers changes over time as new malware families, obfuscation techniques and campaigns appear. When the data no longer resemble the training set, the decision boundaries learned by classifiers or anomaly detectors can become misaligned with the current threat landscape and detection performance can drop unless the models are updated or replaced. +Several studies on Windows malware model detection as an evolving data stream and track how performance changes as time passes. They show that classifiers trained on a fixed time window can perform well on nearby data but deteriorate quickly when evaluated on samples collected months or years later, even when large amounts of training data are available. In order to keep up with this, security systems often use sliding or adaptive windows, which restrict training to the most recent portion of the data so that older, less relevant examples are gradually discarded. They also employ drift detectors such as ADWIN and KSWIN that monitor error rates or changes in the distribution of recent observations and signal when the statistics of the incoming stream differ significantly from the past, prompting retraining or model replacement. +Related problems appear in spam filtering, fraud detection and intrusion detection, where adversaries change content, patterns of activity or network behavior to evade models trained on historical data. In these settings drift can be gradual, as new types of spam or fraud emerge, or abrupt, after a sudden shift in attack techniques. Common strategies to remain effective include updating models with recent labelled examples, using ensembles that give more weight to classifiers trained on newer data, and designing features that are less sensitive to superficial changes in how attacks are carried out. +Research on machine learning for security has also shown that not handling concept drift correctly during evaluation can lead to a lot of bias. Researchers studied 30 learning-based security systems and found that many detectors were only tested in short time periods or lab-only. This ignores temporal correlations, non-stationarity, and the way attacks change in the real world, which can make these systems seem much more effective than they really are when they are put to use. The authors point out ways that people can snoop on time, like using features from newer malware samples when training models that are supposed to work on older data, calculating normalization statistics, or embeddings on the whole dataset. They also talk about using old benchmark datasets that don't train models for current threats. Researchers suggest using time-aware evaluation protocols that keep causal ordering, don't let future information leak into training, and make sure that security data takes into account both time and space so that reported performance better shows the effects of concept drift in real-world situations. \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Concept_drift-2.md b/data/en.wikipedia.org/wiki/Concept_drift-2.md new file mode 100644 index 000000000..f20353231 --- /dev/null +++ b/data/en.wikipedia.org/wiki/Concept_drift-2.md @@ -0,0 +1,36 @@ +--- +title: "Concept drift" +chunk: 3/4 +source: "https://en.wikipedia.org/wiki/Concept_drift" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:53:40.159460+00:00" +instance: "kb-cron" +--- + +=== Concept Drift, Concept Evolution, and Adversarial Manipulation === +Concept drift refers to when the relationship between input features and their corresponding labels changes gradually over time. Consider a simple example: phishing emails. The older kinds of attacks would leverage prominent signals, such as the presence of the word "lottery," while newer variants would introduce more polished phrasing, such as "lottery and not scam," in attempts to appear more legitimate and evade detection. This is a form of concept drift because the underlying class—phishing—remains the same, but the feature patterns that define it shift as attackers adapt. It is important, though, to distinguish concept drift from its related but inherently different challenge: concept evolution.“Concept evolution” occurs when an entirely new attack family emerges with feature patterns the model has never encountered. The model is exposed to new features and patterns that were not present in the training data. In the multi-class setting, it means a completely new label is introduced. For the binary setting (benign vs malicious), a completely new malicious family is introduced; however, the model still tries to make predictions based on outdated feature patterns. One such example would be if a model is trained with both trojan and ransomware features, but then crypto-miner malware emerged. This is still "malicious" in label; however, it means that features are now new for the model. Concept drift and concept evolution pinpoint how fast attackers adapt and keep evolving their techniques to exploit the weaknesses in cyber-defense models. The attackers regularly update their payloads, communication methodologies, and even their depended-upon APIs. This translates to a model that is only trained on historical data becoming outdated unless updated or designed to learn continuously. +These behavior changes do not always happen organically; attackers can introduce them intentionally. The primary causes of concept drift are adversarial attacks because slight, carefully crafted changes to malware by the attacker will affect the input-label relationship, leading to models misclassifying malware as benign. Adversarial attacks generally fall into two settings: white-box and black-box. In white-box settings, an attacker fully accesses the model with its architecture, parameters, and gradients. One example in white-box would be those attacks in which an attacker uses gradient-based approaches to build slight perturbations to a sample, causing its misclassification. On the other extreme are black-box attacks. An adversary knows the input and output of the model but not its inner workings. Attackers can probe the model's outputs—sometimes sending thousands of slightly varied inputs—until something slips past the detector, or they train a substitute model and then generate adversarial examples that transfer to the original system. These tactics gradually shift the patterns a model observes, accelerating drift and making previously reliable defenses far less effective. +These attacker-driven manipulations not only cause concept drift but also reveal the weaknesses of modern learning techniques such as transfer learning. Transfer learning enables defenders to take an existing model, pre-trained for some other task, and fine-tune it rather than training from scratch. This has considerably increased success across diverse domains such as image classification and natural language processing. For malware detection, for instance, Microsoft and Intel recently demonstrated that converting malware binaries into grayscale images enables malware detection using pre-trained vision models, yielding strong performance—for example, 87% recall with only 0.1% false positives. While such an approach reduces the training time and computational cost drastically, it naturally brings one important drawback: the base model architecture is often publicly known. An adversary may use this fact and create adversarial examples designed to exploit the known weaknesses of the underlying model, which then enables those manipulations to "transfer" to the newly trained malware detector. In this regard, transfer learning offers the significant advantages of co-opting prior work while also running the risk of inheriting vulnerabilities that attackers can readily exploit. +In addition to rapidly evolving attacks, ML based defenses face a known issue of acquiring accurate data or labels. Label aggregators may encounter difficulty when labeling new samples, and initially these labels can be incorrect. As time passes, the labels typically stabilize toward their ground truth. This process can take anywhere from days to years. This is considered a contributing factor to concept drift because during this time the original labels may be updated, new attacks may be produced, or new classes may appear. This phenomenon is called delayed labels, and it contributes to the broader causes of concept drift. Therefore, delayed labels are often taken into account when building defense solutions and their evaluations. To detect drift between labels and sample mappings, drift detectors are commonly used during evaluation. + +== See also == +Data stream mining +Data mining +Snyk, a company whose portfolio includes drift detection in software applications + +== Further reading == +Many papers have been published describing algorithms for concept drift detection. Only reviews, surveys and overviews are here: + +=== Reviews === + +== External links == + +=== Software === +Frouros: An open-source Python library for drift detection in machine learning systems. +NannyML: An open-source Python library for detecting univariate and multivariate distribution drift and estimating machine learning model performance without ground truth labels. +RapidMiner: Formerly Yet Another Learning Environment (YALE): free open-source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time-varying concepts, and tracking drifting concept. It is used in combination with its data stream mining plugin (formerly concept drift plugin). +EDDM (Early Drift Detection Method): free open-source implementation of drift detection methods in Weka. +MOA (Massive Online Analysis): free open-source software specific for mining data streams with concept drift. It contains a prequential evaluation method, the EDDM concept drift methods, a reader of ARFF real datasets, and artificial stream generators as SEA concepts, STAGGER, rotating hyperplane, random tree, and random radius based functions. MOA supports bi-directional interaction with Weka. + +=== Datasets === \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Concept_drift-3.md b/data/en.wikipedia.org/wiki/Concept_drift-3.md new file mode 100644 index 000000000..c7c42e625 --- /dev/null +++ b/data/en.wikipedia.org/wiki/Concept_drift-3.md @@ -0,0 +1,70 @@ +--- +title: "Concept drift" +chunk: 4/4 +source: "https://en.wikipedia.org/wiki/Concept_drift" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:53:40.159460+00:00" +instance: "kb-cron" +--- + +==== Real ==== +USP Data Stream Repository, 27 real-world stream datasets with concept drift compiled by Souza et al. (2020). Access +Airline, approximately 116 million flight arrival and departure records (cleaned and sorted) compiled by E. Ikonomovska. Reference: Data Expo 2009 Competition [1]. Access +Chess.com (online games) and Luxembourg (social survey) datasets compiled by I. Zliobaite. Access +ECUE spam 2 datasets each consisting of more than 10,000 emails collected over a period of approximately 2 years by an individual. Access from S.J.Delany webpage +Elec2, electricity demand, 2 classes, 45,312 instances. Reference: M. Harries, Splice-2 comparative evaluation: Electricity pricing, Technical report, The University of South Wales, 1999. Access from J.Gama webpage. Comment on applicability. +PAKDD'09 competition data represents the credit evaluation task. It is collected over a five-year period. Unfortunately, the true labels are released only for the first part of the data. Access +Sensor stream and Power supply stream datasets are available from X. Zhu's Stream Data Mining Repository. Access +SMEAR is a benchmark data stream with a lot of missing values. Environment observation data over 7 years. Predict cloudiness. Access +Text mining, a collection of text mining datasets with concept drift, maintained by I. Katakis. Access +Gas Sensor Array Drift Dataset, a collection of 13,910 measurements from 16 chemical sensors utilized for drift compensation in a discrimination task of 6 gases at various levels of concentrations. Access + +==== Other ==== +KDD'99 competition data contains simulated intrusions in a military network environment. It is often used as a benchmark to evaluate handling concept drift. Access + +==== Synthetic ==== +Extreme verification latency benchmark Souza, V.M.A.; Silva, D.F.; Gama, J.; Batista, G.E.A.P.A. (2015). "Data Stream Classification Guided by Clustering on Nonstationary Environments and Extreme Verification Latency". Proceedings of the 2015 SIAM International Conference on Data Mining (SDM). SIAM. pp. 873–881. doi:10.1137/1.9781611974010.98. ISBN 978-1-61197-401-0. S2CID 19198944. Access from Nonstationary Environments – Archive. +Sine, Line, Plane, Circle and Boolean Data Sets Minku, L.L.; White, A.P.; Yao, X. (2010). "The Impact of Diversity on On-line Ensemble Learning in the Presence of Concept Drift" (PDF). IEEE Transactions on Knowledge and Data Engineering. 22 (5): 730–742. Bibcode:2010ITKDE..22..730M. doi:10.1109/TKDE.2009.156. S2CID 16592739. Access from L.Minku webpage. +SEA concepts Street, N.W.; Kim, Y. (2001). "A streaming ensemble algorithm (SEA) for large-scale classification" (PDF). KDD'01: Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining. pp. 377–382. doi:10.1145/502512.502568. ISBN 978-1-58113-391-2. S2CID 11868540. Access from J.Gama webpage. +STAGGER Schlimmer, J.C.; Granger, R.H. (1986). "Incremental Learning from Noisy Data". Mach. Learn. 1 (3): 317–354. doi:10.1007/BF00116895. S2CID 33776987. +Mixed Gama, J.; Medas, P.; Castillo, G.; Rodrigues, P. (2004). "Learning with drift detection". Brazilian symposium on artificial intelligence. Springer. pp. 286–295. doi:10.1007/978-3-540-28645-5_29. ISBN 978-3-540-28645-5. S2CID 2606652. + +==== Data generation frameworks ==== +Minku, White & Yao 2010 Download from L.Minku webpage. +Lindstrom, P.; Delany, S.J.; MacNamee, B. (2008). "Autopilot: Simulating Changing Concepts in Real Data" (PDF). Proceedings of the 19th Irish Conference on Artificial Intelligence & Cognitive Science. pp. 272–263. +Narasimhamurthy, A.; Kuncheva, L.I. (2007). "A framework for generating data to simulate changing environments". AIAP'07: Proceedings of the 25th IASTED International Multi-Conference: artificial intelligence and applications. pp. 384–389. Code + +=== Projects === +INFER: Computational Intelligence Platform for Evolving and Robust Predictive Systems (2010–2014), Bournemouth University (UK), Evonik Industries (Germany), Research and Engineering Centre (Poland) +HaCDAIS: Handling Concept Drift in Adaptive Information Systems (2008–2012), Eindhoven University of Technology (the Netherlands) +KDUS: Knowledge Discovery from Ubiquitous Streams, INESC Porto and Laboratory of Artificial Intelligence and Decision Support (Portugal) +ADEPT: Adaptive Dynamic Ensemble Prediction Techniques, University of Manchester (UK), University of Bristol (UK) +ALADDIN: autonomous learning agents for decentralised data and information networks (2005–2010) +GAENARI: C++ incremental decision tree algorithm. it minimize concept drifting damage. (2022) + +=== Benchmarks === +NAB: The Numenta Anomaly Benchmark, benchmark for evaluating algorithms for anomaly detection in streaming, real-time applications. (2014–2018) + +=== Meetings === +2014 +[] Special Session on "Concept Drift, Domain Adaptation & Learning in Dynamic Environments" @IEEE IJCNN 2014 +2013 +RealStream Real-World Challenges for Data Stream Mining Workshop-Discussion at the ECML PKDD 2013, Prague, Czech Republic. +LEAPS 2013 The 1st International Workshop on Learning stratEgies and dAta Processing in nonStationary environments +2011 +LEE 2011 Special Session on Learning in evolving environments and its application on real-world problems at ICMLA'11 +HaCDAIS 2011 The 2nd International Workshop on Handling Concept Drift in Adaptive Information Systems +ICAIS 2011 Track on Incremental Learning +IJCNN 2011 Special Session on Concept Drift and Learning Dynamic Environments +CIDUE 2011 Symposium on Computational Intelligence in Dynamic and Uncertain Environments +2010 +HaCDAIS 2010 International Workshop on Handling Concept Drift in Adaptive Information Systems: Importance, Challenges and Solutions +ICMLA10 Special Session on Dynamic learning in non-stationary environments +SAC 2010 Data Streams Track at ACM Symposium on Applied Computing +SensorKDD 2010 International Workshop on Knowledge Discovery from Sensor Data +StreamKDD 2010 Novel Data Stream Pattern Mining Techniques +Concept Drift and Learning in Nonstationary Environments at IEEE World Congress on Computational Intelligence +MLMDS'2010 Special Session on Machine Learning Methods for Data Streams at the 10th International Conference on Intelligent Design and Applications, ISDA'10 + +== References == \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Continuous_analytics-0.md b/data/en.wikipedia.org/wiki/Continuous_analytics-0.md new file mode 100644 index 000000000..eeafd78f1 --- /dev/null +++ b/data/en.wikipedia.org/wiki/Continuous_analytics-0.md @@ -0,0 +1,26 @@ +--- +title: "Continuous analytics" +chunk: 1/1 +source: "https://en.wikipedia.org/wiki/Continuous_analytics" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:53:41.378945+00:00" +instance: "kb-cron" +--- + +Continuous analytics is a data science process that abandons ETLs and complex batch data pipelines in favor of cloud-native and microservices paradigms. Continuous data processing enables real time interactions and immediate insights with fewer resources. + + +== Defined == +Analytics is the application of mathematics and statistics to big data. Data scientists write analytics programs to look for solutions to business problems, like forecasting demand or setting an optimal price. The continuous approach runs multiple stateless engines which concurrently enrich, aggregate, infer and act on the data. Data scientists, dashboards and client apps all access the same raw or real-time data derivatives with proper identity-based security, data masking and versioning in real-time. +Traditionally, data scientists have not been part of IT development teams, like regular Java programmers. This is because their skills set them apart in their own department not normally related to IT, i.e., math, statistics, and data science. So it is logical to conclude that their approach to writing software code does not enjoy the same efficiencies as the traditional programming team. In particular traditional programming has adopted the Continuous Delivery approach to writing code and the agile methodology. That releases software in a continuous circle, called iterations. +Continuous analytics then is the extension of the continuous delivery software development model to the big data analytics development team. The goal of the continuous analytics practitioner then is to find ways to incorporate writing analytics code and installing big data software into the agile development model of automatically running unit and functional tests and building the environment system with automated tools. +To make this work means getting data scientists to write their code in the same code repository that regular programmers use so that software can pull it from there and run it through the build process. It also means saving the configuration of the big data cluster (sets of virtual machines) in some kind of repository as well. That facilitates sending out analytics code and big data software and objects in the same automated way as the continuous integration process. + + +== External links == +Continuous analytics +Development model + + +== References == \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Cosine_similarity-0.md b/data/en.wikipedia.org/wiki/Cosine_similarity-0.md new file mode 100644 index 000000000..4b2ed1834 --- /dev/null +++ b/data/en.wikipedia.org/wiki/Cosine_similarity-0.md @@ -0,0 +1,699 @@ +--- +title: "Cosine similarity" +chunk: 1/3 +source: "https://en.wikipedia.org/wiki/Cosine_similarity" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:53:42.619881+00:00" +instance: "kb-cron" +--- + +In data analysis, cosine similarity is a measure of similarity between two non-zero vectors defined in an inner product space. Cosine similarity is the cosine of the angle between the vectors; that is, it is the dot product of the vectors divided by the product of their lengths. It follows that the cosine similarity does not depend on the magnitudes of the vectors, but only on their angle. The cosine similarity always belongs to the interval + + + + [ + − + 1 + , + + + 1 + ] + . + + + {\displaystyle [-1,+1].} + + For example, two proportional vectors have a cosine similarity of +1, two orthogonal vectors have a similarity of 0, and two opposite vectors have a similarity of −1. In some contexts, the component values of the vectors cannot be negative, in which case the cosine similarity is bounded in + + + + [ + 0 + , + 1 + ] + + + {\displaystyle [0,1]} + +. +For example, in information retrieval and text mining, each word is assigned a different coordinate and a document is represented by the vector of the numbers of occurrences of each word in the document. Cosine similarity then gives a useful measure of how similar two documents are likely to be, in terms of their subject matter, and independently of the length of the documents. +The technique is also used to measure cohesion within clusters in the field of data mining. +One advantage of cosine similarity is its low complexity, especially for sparse vectors: only the non-zero coordinates need to be considered. +Other names for cosine similarity include Orchini similarity and Tucker coefficient of congruence; the Otsuka–Ochiai similarity (see below) is cosine similarity applied to binary data. + +== Definition == +The cosine of two non-zero vectors can be derived by using the Euclidean dot product formula: + + + + + + A + + ⋅ + + B + + = + + ‖ + + A + + ‖ + + + ‖ + + B + + ‖ + + cos + ⁡ + θ + + + {\displaystyle \mathbf {A} \cdot \mathbf {B} =\left\|\mathbf {A} \right\|\left\|\mathbf {B} \right\|\cos \theta } + + +Given two n-dimensional vectors of attributes, A and B, the cosine similarity, cos(θ), is represented using a dot product and magnitude as + + + + + + cosine similarity + + = + + S + + C + + + ( + A + , + B + ) + := + cos + ⁡ + ( + θ + ) + = + + + + + A + + ⋅ + + B + + + + ‖ + + A + + ‖ + ‖ + + B + + ‖ + + + + = + + + + + ∑ + + i + = + 1 + + + n + + + + + A + + i + + + + B + + i + + + + + + + + + ∑ + + i + = + 1 + + + n + + + + + A + + i + + + 2 + + + + + + ⋅ + + + + ∑ + + i + = + 1 + + + n + + + + + B + + i + + + 2 + + + + + + + + + , + + + {\displaystyle {\text{cosine similarity}}=S_{C}(A,B):=\cos(\theta )={\mathbf {A} \cdot \mathbf {B} \over \|\mathbf {A} \|\|\mathbf {B} \|}={\frac {\sum \limits _{i=1}^{n}{A_{i}B_{i}}}{{\sqrt {\sum \limits _{i=1}^{n}{A_{i}^{2}}}}\cdot {\sqrt {\sum \limits _{i=1}^{n}{B_{i}^{2}}}}}},} + + +where + + + + + A + + i + + + + + {\displaystyle A_{i}} + + and + + + + + B + + i + + + + + {\displaystyle B_{i}} + + are the + + + + i + + + {\displaystyle i} + +th components of vectors + + + + + A + + + + {\displaystyle \mathbf {A} } + + and + + + + + B + + + + {\displaystyle \mathbf {B} } + +, respectively. +The resulting similarity ranges from −1 meaning exactly opposite, to +1 meaning exactly the same, with 0 indicating orthogonality (no correlation), while in-between values indicate intermediate similarity or dissimilarity. +For text matching, the attribute vectors A and B are usually the term frequency vectors of the documents. Cosine similarity can be seen as a method of normalizing document length during comparison. In the case of information retrieval, the cosine similarity of two documents will range from + + + + 0 + → + 1 + + + {\displaystyle 0\to 1} + +, since the term frequencies cannot be negative. This remains true when using TF-IDF weights. The angle between two term frequency vectors cannot be greater than 90°. +If the attribute vectors are normalized by subtracting the vector means (e.g., + + + + A + − + + + + A + ¯ + + + + + + {\displaystyle A-{\bar {A}}} + +), the measure is called the centered cosine similarity and is equivalent to the Pearson correlation coefficient. For an example of centering, + + + + + + if + + + A + = + [ + + A + + 1 + + + , + + A + + 2 + + + + ] + + T + + + , + + then + + + + + A + ¯ + + + + = + + + [ + + + + + ( + + A + + 1 + + + + + + A + + 2 + + + ) + + 2 + + + , + + + + ( + + A + + 1 + + + + + + A + + 2 + + + ) + + 2 + + + + ] + + + T + + + , + + + {\displaystyle {\text{if}}\,A=[A_{1},A_{2}]^{T},{\text{ then }}{\bar {A}}=\left[{\frac {(A_{1}+A_{2})}{2}},{\frac {(A_{1}+A_{2})}{2}}\right]^{T},} + + + + + + + so + + A + − + + + + A + ¯ + + + + = + + + [ + + + + + ( + + A + + 1 + + + − + + A + + 2 + + + ) + + 2 + + + , + + + + ( + − + + A + + 1 + + + + + + A + + 2 + + + ) + + 2 + + + + ] + + + T + + + . + + + {\displaystyle {\text{ so }}A-{\bar {A}}=\left[{\frac {(A_{1}-A_{2})}{2}},{\frac {(-A_{1}+A_{2})}{2}}\right]^{T}.} + + +=== Cosine distance === +When the distance between two unit-length vectors is defined to be the length of their vector difference then + + + + + dist + ⁡ + ( + + A + + , + + B + + ) + = + + + ( + + A + + − + + B + + ) + ⋅ + ( + + A + + − + + B + + ) + + + = + + + + A + + ⋅ + + A + + − + 2 + ( + + A + + ⋅ + + B + + ) + + + + B + + ⋅ + + B + + + + = + + + 2 + ( + 1 + − + + S + + C + + + ( + + A + + , + + B + + ) + ) + + + + . + + + {\displaystyle \operatorname {dist} (\mathbf {A} ,\mathbf {B} )={\sqrt {(\mathbf {A} -\mathbf {B} )\cdot (\mathbf {A} -\mathbf {B} )}}={\sqrt {\mathbf {A} \cdot \mathbf {A} -2(\mathbf {A} \cdot \mathbf {B} )+\mathbf {B} \cdot \mathbf {B} }}={\sqrt {2(1-S_{C}(\mathbf {A} ,\mathbf {B} ))}}\,.} + + +Nonetheless the cosine distance is often defined without the square root or factor of 2: + + + + + + cosine distance + + = + + D + + C + + + ( + A + , + B + ) + := + 1 + − + + S + + C + + + ( + A + , + B + ) + + . + + + {\displaystyle {\text{cosine distance}}=D_{C}(A,B):=1-S_{C}(A,B)\,.} + + +It is important to note that, by virtue of being proportional to squared Euclidean distance, the cosine distance is not a true distance metric; it does not exhibit the triangle inequality property — or, more formally, the Schwarz inequality — and it violates the coincidence axiom. To repair the triangle inequality property while maintaining the same ordering, one can convert to Euclidean distance + + + + + + 2 + ( + 1 + − + + S + + C + + + ( + A + , + B + ) + ) + + + + + {\textstyle {\sqrt {2(1-S_{C}(A,B))}}} + + or angular distance θ = arccos(SC(A, B)). Alternatively, the triangular inequality that does work for angular distances can be expressed directly in terms of the cosines; see below. + +=== Angular distance and similarity === +The normalized angle, referred to as angular distance, between any two vectors + + + + A + + + {\displaystyle A} + + and + + + + B + + + {\displaystyle B} + + is a formal distance metric and can be calculated from the cosine similarity. The complement of the angular distance metric can then be used to define angular similarity function bounded between 0 and 1, inclusive. +When the vector elements may be positive or negative: \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Cosine_similarity-1.md b/data/en.wikipedia.org/wiki/Cosine_similarity-1.md new file mode 100644 index 000000000..8ca42adde --- /dev/null +++ b/data/en.wikipedia.org/wiki/Cosine_similarity-1.md @@ -0,0 +1,593 @@ +--- +title: "Cosine similarity" +chunk: 2/3 +source: "https://en.wikipedia.org/wiki/Cosine_similarity" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:53:42.619881+00:00" +instance: "kb-cron" +--- + + + + + + angular distance + + = + + D + + θ + + + := + + + + arccos + ⁡ + ( + + cosine similarity + + ) + + π + + + = + + + θ + π + + + + + {\displaystyle {\text{angular distance}}=D_{\theta }:={\frac {\arccos({\text{cosine similarity}})}{\pi }}={\frac {\theta }{\pi }}} + + + + + + + angular similarity + + = + + S + + θ + + + := + 1 + − + + angular distance + + = + 1 + − + + + θ + π + + + + + {\displaystyle {\text{angular similarity}}=S_{\theta }:=1-{\text{angular distance}}=1-{\frac {\theta }{\pi }}} + + +Or, if the vector elements are always positive: + + + + + + angular distance + + = + + D + + θ + + + := + + + + 2 + ⋅ + arccos + ⁡ + ( + + cosine similarity + + ) + + π + + + = + + + + 2 + θ + + π + + + + + {\displaystyle {\text{angular distance}}=D_{\theta }:={\frac {2\cdot \arccos({\text{cosine similarity}})}{\pi }}={\frac {2\theta }{\pi }}} + + + + + + + angular similarity + + = + + S + + θ + + + := + 1 + − + + angular distance + + = + 1 + − + + + + 2 + θ + + π + + + + + {\displaystyle {\text{angular similarity}}=S_{\theta }:=1-{\text{angular distance}}=1-{\frac {2\theta }{\pi }}} + + +Unfortunately, computing the inverse cosine (arccos) function is slow, making the use of the angular distance more computationally expensive than using the more common (but not metric) cosine distance above. + +=== L2-normalized Euclidean distance === +Another effective proxy for cosine distance can be obtained by + + + + + L + + 2 + + + + + {\displaystyle L_{2}} + + normalisation of the vectors, followed by the application of normal Euclidean distance. Using this technique each term in each vector is first divided by the magnitude of the vector, yielding a vector of unit length. Then the Euclidean distance over the end-points of any two vectors is a proper metric which gives the same ordering as the cosine distance (a monotonic transformation of Euclidean distance; see below) for any comparison of vectors, and furthermore avoids the potentially expensive trigonometric operations required to yield a proper metric. Once the normalisation has occurred, the vector space can be used with the full range of techniques available to any Euclidean space, notably standard dimensionality reduction techniques. This normalised form distance is often used within many deep learning algorithms. + +=== Otsuka–Ochiai coefficient === +In biology, there is a similar concept known as the Otsuka–Ochiai coefficient named after Yanosuke Otsuka (also spelled as Ōtsuka, Ootsuka or Otuka, Japanese: 大塚 弥之助) and Akira Ochiai (Japanese: 落合 明), also known as the Ochiai–Barkman or Ochiai coefficient, which can be represented as: + + + + + K + = + + + + + | + + A + ∩ + B + + | + + + + + | + + A + + | + + × + + | + + B + + | + + + + + + + {\displaystyle K={\frac {|A\cap B|}{\sqrt {|A|\times |B|}}}} + + +Here, + + + + A + + + {\displaystyle A} + + and + + + + B + + + {\displaystyle B} + + are sets, and + + + + + | + + A + + | + + + + {\displaystyle |A|} + + is the number of elements in + + + + A + + + {\displaystyle A} + +. If sets are represented as bit vectors, the Otsuka–Ochiai coefficient can be seen to be the same as the cosine similarity. It is identical to the score introduced by Godfrey Thomson. +In a recent book, the coefficient is tentatively misattributed to another Japanese researcher with the family name Otsuka. The confusion arises because in 1957 Akira Ochiai attributes the coefficient only to Otsuka (no first name mentioned) by citing an article by Ikuso Hamai (Japanese: 浜井 生三), who in turn cites the original 1936 article by Yanosuke Otsuka. + +== Properties == +The most noteworthy property of cosine similarity is that it reflects a relative, rather than absolute, comparison of the individual vector dimensions. For any positive constant + + + + a + + + {\displaystyle a} + + and vector + + + + V + + + {\displaystyle V} + +, the vectors + + + + V + + + {\displaystyle V} + + and + + + + a + V + + + {\displaystyle aV} + + are maximally similar. The measure is thus most appropriate for data where frequency is more important than absolute values; notably, term frequency in documents. +However more recent metrics with a grounding in information theory, such as Jensen–Shannon, SED, and triangular divergence have been shown to have improved semantics in at least some contexts. + +Cosine similarity is related to Euclidean distance as follows. Denote Euclidean distance by the usual + + + + ‖ + A + − + B + ‖ + + + {\displaystyle \|A-B\|} + +, and observe that + + + + + ‖ + A + − + B + + ‖ + + 2 + + + = + ( + A + − + B + ) + ⋅ + ( + A + − + B + ) + = + ‖ + A + + ‖ + + 2 + + + + + ‖ + B + + ‖ + + 2 + + + − + 2 + ( + A + ⋅ + B + ) + + + + {\displaystyle \|A-B\|^{2}=(A-B)\cdot (A-B)=\|A\|^{2}+\|B\|^{2}-2(A\cdot B)\ } + + (polarization identity) +by expansion. When A and B are normalized to unit length, + + + + ‖ + A + + ‖ + + 2 + + + = + ‖ + B + + ‖ + + 2 + + + = + 1 + + + {\displaystyle \|A\|^{2}=\|B\|^{2}=1} + + so this expression is equal to + + + + + 2 + ( + 1 + − + cos + ⁡ + ( + A + , + B + ) + ) + . + + + {\displaystyle 2(1-\cos(A,B)).} + + +In short, the cosine distance can be expressed in terms of Euclidean distance as + + + + + + D + + C + + + ( + A + , + B + ) + = + + + + ‖ + A + − + B + + ‖ + + 2 + + + + 2 + + + + + w + h + e + n + + + ‖ + A + + ‖ + + 2 + + + = + ‖ + B + + ‖ + + 2 + + + = + 1 + + + {\displaystyle D_{C}(A,B)={\frac {\|A-B\|^{2}}{2}}\quad \mathrm {when} \quad \|A\|^{2}=\|B\|^{2}=1} + +. +The Euclidean distance is called the chord distance (because it is the length of the chord on the unit circle) and it is the Euclidean distance between the vectors which were normalized to unit sum of squared values within them. +Null distribution: For data which can be negative as well as positive, the null distribution for cosine similarity is the distribution of the dot product of two independent random unit vectors. This distribution has a mean of zero and a variance of + + + + 1 + + / + + n + + + {\displaystyle 1/n} + + (where + + + + n + + + {\displaystyle n} + + is the number of dimensions), and although the distribution is bounded between −1 and +1, as + + + + n + + + {\displaystyle n} + + grows large the distribution is increasingly well-approximated by the normal distribution. Other types of data such as bitstreams, which only take the values 0 or 1, the null distribution takes a different form and may have a nonzero mean. + +== Triangle inequality for cosine similarity == +The ordinary triangle inequality for angles (i.e., arc lengths on a unit hypersphere) gives us that + + + + + + | + + + ∠ + + A + C + + − + ∠ + + C + B + + + + | + + ≤ + + ∠ + + A + B + + + ≤ + + ∠ + + A + C + + + + + + ∠ + + C + B + + + . + + + {\displaystyle |~\angle {AC}-\angle {CB}~|\leq ~\angle {AB}~\leq ~\angle {AC}~+~\angle {CB}~.} + + +Because the cosine function decreases as an angle in [0, π] radians increases, the sense of these inequalities is reversed when we take the cosine of each value: \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Cosine_similarity-2.md b/data/en.wikipedia.org/wiki/Cosine_similarity-2.md new file mode 100644 index 000000000..dbca3ce1b --- /dev/null +++ b/data/en.wikipedia.org/wiki/Cosine_similarity-2.md @@ -0,0 +1,409 @@ +--- +title: "Cosine similarity" +chunk: 3/3 +source: "https://en.wikipedia.org/wiki/Cosine_similarity" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:53:42.619881+00:00" +instance: "kb-cron" +--- + + + + + cos + ⁡ + ( + ∠ + + A + C + + − + ∠ + + C + B + + ) + ≥ + cos + ⁡ + ( + ∠ + + A + B + + ) + ≥ + cos + ⁡ + ( + ∠ + + A + C + + + + ∠ + + C + B + + ) + . + + + {\displaystyle \cos(\angle {AC}-\angle {CB})\geq \cos(\angle {AB})\geq \cos(\angle {AC}+\angle {CB}).} + + +Using the cosine addition and subtraction formulas, these two inequalities can be written in terms of the original cosines, + + + + + cos + ⁡ + ( + A + , + C + ) + ⋅ + cos + ⁡ + ( + C + , + B + ) + + + + + + ( + + 1 + − + cos + ⁡ + ( + A + , + C + + ) + + 2 + + + + ) + + ⋅ + + ( + + 1 + − + cos + ⁡ + ( + C + , + B + + ) + + 2 + + + + ) + + + + ≥ + cos + ⁡ + ( + A + , + B + ) + , + + + {\displaystyle \cos(A,C)\cdot \cos(C,B)+{\sqrt {\left(1-\cos(A,C)^{2}\right)\cdot \left(1-\cos(C,B)^{2}\right)}}\geq \cos(A,B),} + + + + + + cos + ⁡ + ( + A + , + B + ) + ≥ + cos + ⁡ + ( + A + , + C + ) + ⋅ + cos + ⁡ + ( + C + , + B + ) + − + + + + ( + + 1 + − + cos + ⁡ + ( + A + , + C + + ) + + 2 + + + + ) + + ⋅ + + ( + + 1 + − + cos + ⁡ + ( + C + , + B + + ) + + 2 + + + + ) + + + + . + + + {\displaystyle \cos(A,B)\geq \cos(A,C)\cdot \cos(C,B)-{\sqrt {\left(1-\cos(A,C)^{2}\right)\cdot \left(1-\cos(C,B)^{2}\right)}}.} + + +This form of the triangle inequality can be used to bound the minimum and maximum similarity of two objects A and B if the similarities to a reference object C is already known. This is used for example in metric data indexing, but has also been used to accelerate spherical k-means clustering the same way the Euclidean triangle inequality has been used to accelerate regular k-means. + +== Soft cosine measure == +A soft cosine or ("soft" similarity) between two vectors considers similarities between pairs of features. The traditional cosine similarity considers the vector space model (VSM) features as independent or completely different, while the soft cosine measure proposes considering the similarity of features in VSM, which help generalize the concept of cosine (and soft cosine) as well as the idea of (soft) similarity. +For example, in the field of natural language processing (NLP) the similarity among features is quite intuitive. Features such as words, n-grams, or syntactic n-grams can be quite similar, though formally they are considered as different features in the VSM. For example, words "play" and "game" are different words and thus mapped to different points in VSM; yet they are semantically related. In case of n-grams or syntactic n-grams, Levenshtein distance can be applied (in fact, Levenshtein distance can be applied to words as well). +For calculating soft cosine, the matrix s is used to indicate similarity between features. It can be calculated through Levenshtein distance, WordNet similarity, or other similarity measures. Then we just multiply by this matrix. +Given two N-dimension vectors + + + + a + + + {\displaystyle a} + + and + + + + b + + + {\displaystyle b} + +, the soft cosine similarity is calculated as follows: + + + + + + + + + + + s + o + f + t + _ + c + o + s + i + n + e + + + 1 + + + ⁡ + ( + a + , + b + ) + = + + + + + ∑ + + i + , + j + + + N + + + + s + + i + j + + + + a + + i + + + + b + + j + + + + + + + + ∑ + + i + , + j + + + N + + + + s + + i + j + + + + a + + i + + + + a + + j + + + + + + + + ∑ + + i + , + j + + + N + + + + s + + i + j + + + + b + + i + + + + b + + j + + + + + + + + , + + + + + + + {\displaystyle {\begin{aligned}\operatorname {soft\_cosine} _{1}(a,b)={\frac {\sum \nolimits _{i,j}^{N}s_{ij}a_{i}b_{j}}{{\sqrt {\sum \nolimits _{i,j}^{N}s_{ij}a_{i}a_{j}}}{\sqrt {\sum \nolimits _{i,j}^{N}s_{ij}b_{i}b_{j}}}}},\end{aligned}}} + + +where sij = similarity(featurei, featurej). +If there is no similarity between features (sii = 1, sij = 0 for i ≠ j), the given equation is equivalent to the conventional cosine similarity formula. +The time complexity of this measure is quadratic, which makes it applicable to real-world tasks. Note that the complexity can be reduced to subquadratic. An efficient implementation of such soft cosine similarity is included in the Gensim open source library. + +== See also == +Sørensen–Dice coefficient +Hamming distance +Correlation +Jaccard index +SimRank +Information retrieval + +== References == + +== External links == +Weighted cosine measure +A tutorial on cosine similarity using Python \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Cross-industry_standard_process_for_data_mining-0.md b/data/en.wikipedia.org/wiki/Cross-industry_standard_process_for_data_mining-0.md new file mode 100644 index 000000000..0e7a5c9d1 --- /dev/null +++ b/data/en.wikipedia.org/wiki/Cross-industry_standard_process_for_data_mining-0.md @@ -0,0 +1,42 @@ +--- +title: "Cross-industry standard process for data mining" +chunk: 1/1 +source: "https://en.wikipedia.org/wiki/Cross-industry_standard_process_for_data_mining" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:53:43.884915+00:00" +instance: "kb-cron" +--- + +The Cross-industry standard process for data mining, known as CRISP-DM, is an open standard process model that describes common approaches used by data mining experts. It is the most widely-used analytics model. +In 2015, IBM released a new methodology called Analytics Solutions Unified Method for Data Mining/Predictive Analytics (also known as ASUM-DM), which refines and extends CRISP-DM. + + +== History == +CRISP-DM was conceived in 1996 and became a European Union project under the ESPRIT funding initiative in 1997. The project was led by five companies: Integral Solutions Ltd (ISL), Teradata, Daimler AG, NCR Corporation, and OHRA, an insurance company. +This core consortium brought different experiences to the project. ISL, was later acquired and merged into SPSS. The computer giant NCR Corporation produced the Teradata data warehouse and its own data mining software. Daimler-Benz had a significant data mining team. OHRA was starting to explore the potential use of data mining. +The first version of the methodology was presented at the 4th CRISP-DM SIG Workshop in Brussels in March 1999, and published as a step-by-step data mining guide later that year. +Between 2006 and 2008, a CRISP-DM 2.0 SIG was formed, and there were discussions about updating the CRISP-DM process model. The current status of these efforts is not known. However, the original crisp-dm.org website cited in the reviews, and the CRISP-DM 2.0 SIG website are both no longer active. +While many non-IBM data mining practitioners use CRISP-DM, IBM is the primary corporation that currently uses the CRISP-DM process model. It makes some of the old CRISP-DM documents available for download and it has incorporated it into its SPSS Modeler product. +Based on current research, CRISP-DM is the most widely used form of data-mining model because of its various advantages which solved the existing problems in the data mining industries. Some of the drawbacks of this model is that it does not perform project management activities. The success of CRISP-DM is largely attributable to the fact that it is industry, tool, and application neutral. + + +== Major phases == + +CRISP-DM breaks the process of data mining into six major phases: + +Business Understanding +Data Understanding +Data Preparation +Modeling +Evaluation +Deployment +The sequence of the phases is not strict and moving back and forth between different phases is usually required. The arrows in the process diagram indicate the most important and frequent dependencies between phases. The outer circle in the diagram symbolizes the cyclic nature of data mining itself. A data mining process continues after a solution has been deployed. The lessons learned during the process can trigger new, often more focused business questions, and subsequent data mining processes will benefit from the experiences of previous ones. + + +== Polls and Alternative Process Frameworks == +Polls conducted at the same website (KDNuggets) in 2002, 2004, 2007, and 2014 show that it was the leading methodology used by industry data miners who decided to respond to the survey. The only other data mining approach named in these polls was SEMMA. However, SAS Institute clearly states that SEMMA is not a data mining methodology, but rather a "logical organization of the functional toolset of SAS Enterprise Miner." A review and critique of data mining process models in 2009 called the CRISP-DM the "de facto standard for developing data mining and knowledge discovery projects." Other reviews of CRISP-DM and data mining process models include Kurgan and Musilek's 2006 review, and Azevedo and Santos' 2008 comparison of CRISP-DM and SEMMA. Efforts to update the methodology started in 2006, but have, as of June 2015, not led to a new version, and the "Special Interest Group" (SIG) responsible along with the website has long disappeared (see History of CRISP-DM). +In 2024, Harvard Business Review published an updated framework, bizML, that is designed for greater relevance to business personnel and to be specific for machine learning projects in particular, rather than for analytics, data science, or data mining projects in general. + + +== References == \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/DOACROSS_parallelism-0.md b/data/en.wikipedia.org/wiki/DOACROSS_parallelism-0.md new file mode 100644 index 000000000..2220ded3e --- /dev/null +++ b/data/en.wikipedia.org/wiki/DOACROSS_parallelism-0.md @@ -0,0 +1,32 @@ +--- +title: "DOACROSS parallelism" +chunk: 1/1 +source: "https://en.wikipedia.org/wiki/DOACROSS_parallelism" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:54:04.174776+00:00" +instance: "kb-cron" +--- + +DOACROSS parallelism is a parallelization technique used to perform Loop-level parallelism by utilizing synchronisation primitives between statements in a loop. This technique is used when a loop cannot be fully parallelized by DOALL parallelism due to data dependencies between loop iterations, typically loop-carried dependencies. The sections of the loop which contain loop-carried dependence are synchronized, while treating each section as a parallel task on its own. Therefore, DOACROSS parallelism can be used to complement DOALL parallelism to reduce loop execution times. + + +== Description == +DOACROSS parallelism is particularly useful when one statement depends on the values generated by another statement. In such a loop, DOALL parallelism can not be implemented in a straightforward manner. If the first statement blocks the execution of the second statement until the required value has been produced, then the two statements would be able to execute independent of each other (i.e.), each of the aforementioned statements would be parallelized for simultaneous execution using DOALL parallelism. +The following pseudocode illustrates the operation of DOACROSS parallelism in such a situation. + + +=== Example === +In this example, each iteration of the loop requires the value written into a by the previous iteration. However, the entire statement is not dependent on the previous iteration, but only a portion of it. The statement is split into two blocks to illustrate this.The first statement has no loop carried dependence now, and the result of this statement is stored in the variable temp. The post () command is used to signal that the required result has been produced for utilization by other threads. The wait (i-2) command waits for the value a[i-2] before unblocking. The execution time of DOACROSS parallelism largely depends on what fraction of the program suffers from loop-carried dependence. Larger gains are observed when a sizable portion of the loop is affected by loop-carried dependence. + + +== Drawbacks == +DOACROSS parallelism suffers from significant space and granularity overheads due to the synchronization primitives used. Modern day compilers often overlook this method because of this major disadvantage. The overheads may be reduced by reducing the frequency of synchronization across the loop, by applying the primitives for groups of statements at a time. + + +== See also == +Data parallelism +Task parallelism + + +== References == \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/DOE_mean_plot-0.md b/data/en.wikipedia.org/wiki/DOE_mean_plot-0.md new file mode 100644 index 000000000..4fcfa4840 --- /dev/null +++ b/data/en.wikipedia.org/wiki/DOE_mean_plot-0.md @@ -0,0 +1,20 @@ +--- +title: "DOE mean plot" +chunk: 1/1 +source: "https://en.wikipedia.org/wiki/DOE_mean_plot" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:54:06.621514+00:00" +instance: "kb-cron" +--- + +In statistics, specifically the analysis of data in the design of experiments, a DOE mean plot is a graphical tool used to analyze data from an experiment. In it, it demonstrates the relative importance of all the factors in an experiment with respect to a chosen location statistic, in this case the mean. Each factor is plotted and all mean levels of the factor (two or more) are connected with straight lines. The plot is meant to complement other analyses, such as the typical analysis of variance. + + +== Example == + +Shown below is an example from Box, Hunter, and Hunter (1978). In it, several factors are chosen to determine their effect(s) on bike speed. These factors are: seat height, dynamo engaged (yes/no), angle of the handlebar, chosen gear, raincoat (on/off), breakfast (yes/no), and tire pressure (low/high), each at two levels. The plot shows the mean response level for each variable (vertical) versus the factor (horizontal). +To interpret the graph, all that is needed is to determine the ordering of the factors by length of the line connecting the two dots. The longer the line, the more significant the factor. Therefore, this plot clearly demonstrates that factor 4 (chosen gear) is the most important factor in determining bike speed, factor 2 (dynamo engaged or not) is the second most important factor, and so on. + + +== References == \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Dark_data-0.md b/data/en.wikipedia.org/wiki/Dark_data-0.md new file mode 100644 index 000000000..9e6d29121 --- /dev/null +++ b/data/en.wikipedia.org/wiki/Dark_data-0.md @@ -0,0 +1,37 @@ +--- +title: "Dark data" +chunk: 1/1 +source: "https://en.wikipedia.org/wiki/Dark_data" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:53:45.047576+00:00" +instance: "kb-cron" +--- + +Dark data is data which is acquired through various computer network operations but not used in any manner to derive insights or for decision making. The ability of an organisation to collect data can exceed the throughput at which it can analyse the data. In some cases the organisation may not even be aware that the data is being collected. IBM estimate that roughly 90 percent of data generated by sensors and analog-to-digital conversions never get used. +In an industrial context, dark data can include information gathered by sensors and telematics. +Organizations retain dark data for a multitude of reasons, and it is estimated that most companies are only analyzing 1% of their data. Often it is stored for regulatory compliance and record keeping. Some organizations believe that dark data could be useful to them in the future, once they have acquired better analytic and business intelligence technology to process the information. Because storage is inexpensive, storing data is easy. However, storing and securing the data usually entails greater expenses (or even risk) than the potential return profit. +In academic discourse, the term dark data was essentially coined by Bryan P. Heidorn. He uses it to describe research data, especially from the long tail of science (the many, small research projects), which are not or no longer available for research because they disappear in a drawer without adequate data management. Without this, the data become dark, and further reasons for this are e.g. missing metadata annotation, missing data management plans and data curators. + + +== Analysis == +The term "dark data" very often refers to data that is not amenable to computer processing. For example, a company might have a great deal of data that exists only as scanned page-images. Even the bare text in such documents is not available without something like Optical character recognition, which can vary greatly in accuracy. Even with OCR, the significance of each part of the data is unavailable. An obvious examples is whether a capitalized word is a name or not, and if so, whether it represents a person, place, organization, or even a work of art. Bibliographic and other references, data within tables (that may be labeled quite adequately for humans, but not for processing), and countless assertions represented with the full complexity and ambiguity of human language. +A lot of unused data is very valuable, and would be used if it could be; but is blocked because it is in formats that are difficult to process, categorise, identify, and analyse. Often the reason that business does not use their dark data is because of the amount of resources it would take and the difficulty of having that data analysed. In other words, the data is "dark" not because it is not used, but because it cannot (feasibly or affordably) be used, given its poor representation. +There are many data representations that can make data much more accessible for automation. However, a great deal of information lacks any such identification of information items or relationships; and much more loses it during "downhill" conversion such as saving to page-oriented representations, printing, scanning, or faxing. The journey back "uphill" can be costly. +According to Computer Weekly, 60% of organisations believe that their own business intelligence reporting capability is "inadequate" and 65% say that they have "somewhat disorganised content management approaches". + + +== Relevance == +Useful data may become dark data after it becomes irrelevant, as it is not processed fast enough. This is called "perishable insights" in "live flowing data". For example, if the geolocation of a customer is known to a business, the business can make offer based on the location, however if this data is not processed immediately, it may be irrelevant in the future. According to IBM, about 60 percent of data loses its value immediately. + + +== Storage == +According to the New York Times, 90% of energy used by data centres is wasted. If data was not stored, energy costs could be saved. Furthermore, there are costs associated with the underutilisation of information and thus missed opportunities. According to Datamation, "the storage environments of EMEA organizations consist of 54 percent dark data, 32 percent redundant, obsolete and trivial data and 14 percent business-critical data. By 2020, this can add up to $891 billion in storage and management costs that can otherwise be avoided." +The continuous storage of dark data can put an organisation at risk, especially if this data is sensitive. In the case of a breach, this can result in serious repercussions. These can be financial, legal and can seriously hurt an organisation's reputation. For example, a breach of private records of customers could result in the stealing of sensitive information, which could result in identity theft. Another example could be the breach of the company's own sensitive information, for example relating to research and development. These risks can be mitigated by assessing and auditing whether this data is useful to the organisation, employing strong encryption and security and finally, if it is determined to be discarded, then it should be discarded in a way that it becomes unretrievable. + + +== Future == +It is generally considered that as more advanced computing systems for analysis of data are built, the higher the value of dark data will be. It has been noted that "data and analytics will be the foundation of the modern industrial revolution". Of course, this includes data that is currently considered "dark data" since there are not enough resources to process it. All this data that is being collected can be used in the future to bring maximum productivity and an ability for organisations to meet consumers' demand. Technology advancements are helping to leverage this dark data affordably. Furthermore, many organisations do not realise the value of dark data right now, for example in healthcare and education organisations deal with large amounts of data that could create a significant "potential to service students and patients in the manner in which the consumer and financial services pursue their target population". + + +== References == \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Data-driven_model-0.md b/data/en.wikipedia.org/wiki/Data-driven_model-0.md new file mode 100644 index 000000000..b55a98cd8 --- /dev/null +++ b/data/en.wikipedia.org/wiki/Data-driven_model-0.md @@ -0,0 +1,25 @@ +--- +title: "Data-driven model" +chunk: 1/1 +source: "https://en.wikipedia.org/wiki/Data-driven_model" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:53:55.946685+00:00" +instance: "kb-cron" +--- + +Data-driven models are a class of computational models that primarily rely on historical data collected throughout a system's or process' lifetime to establish relationships between input, internal, and output variables. Commonly found in numerous articles and publications, data-driven models have evolved from earlier statistical models, overcoming limitations posed by strict assumptions about probability distributions. These models have gained prominence across various fields, particularly in the era of big data, artificial intelligence, and machine learning, where they offer valuable insights and predictions based on the available data. + + +== Background == +These models have evolved from earlier statistical models, which were based on certain assumptions about probability distributions that often proved to be overly restrictive. The emergence of data-driven models in the 1950s and 1960s coincided with the development of digital computers, advancements in artificial intelligence research, and the introduction of new approaches in non-behavioural modelling, such as pattern recognition and automatic classification. + + +== Key Concepts == +Data-driven models encompass a wide range of techniques and methodologies that aim to intelligently process and analyse large datasets. Examples include fuzzy logic, fuzzy and rough sets for handling uncertainty, neural networks for approximating functions, global optimization and evolutionary computing, statistical learning theory, and Bayesian methods. These models have found applications in various fields, including economics, customer relations management, financial services, medicine, and the military, among others. +Machine learning, a subfield of artificial intelligence, is closely related to data-driven modelling as it also focuses on using historical data to create models that can make predictions and identify patterns. In fact, many data-driven models incorporate machine learning techniques, such as regression, classification, and clustering algorithms, to process and analyse data. +In recent years, the concept of data-driven models has gained considerable attention in the field of water resources, with numerous applications, academic courses, and scientific publications using the term as a generalization for models that rely on data rather than physics. This classification has been featured in various publications and has even spurred the development of hybrid models in the past decade. Hybrid models attempt to quantify the degree of physically based information used in hydrological models and determine whether the process of building the model is primarily driven by physics or purely data-based. As a result, data-driven models have become an essential topic of discussion and exploration within water resources management and research. +The term "data-driven modelling" (DDM) refers to the overarching paradigm of using historical data in conjunction with advanced computational techniques, including machine learning and artificial intelligence, to create models that can reveal underlying trends, patterns, and, in some cases, make predictions Data-driven models can be built with or without detailed knowledge of the underlying processes governing the system behavior, which makes them particularly useful when such knowledge is missing or fragmented. + + +== References == \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Data-informed_decision-making-0.md b/data/en.wikipedia.org/wiki/Data-informed_decision-making-0.md new file mode 100644 index 000000000..427bac59e --- /dev/null +++ b/data/en.wikipedia.org/wiki/Data-informed_decision-making-0.md @@ -0,0 +1,15 @@ +--- +title: "Data-informed decision-making" +chunk: 1/1 +source: "https://en.wikipedia.org/wiki/Data-informed_decision-making" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:53:57.178692+00:00" +instance: "kb-cron" +--- + +Data-informed decision-making (DIDM) refers to the collection and analysis of data to guide decisions and improve chances of success. Another form of this process is referred to as data-driven decision-making, "which is defined similarly as making decisions based on hard data as opposed to intuition, observation, or guesswork." DIDM is used in education communities, where data is used with the goal of helping students and improving curricula, among other fields in which data is used to inform decisions. While "data based decision-making" is a more common term, "data-informed decision-making" is the preferred term, since decisions should not be based solely on quantitative data. Data-driven decision-making is commonly used in the context of business growth and entrepreneurship. Many educators have access to some type of a data system for analyzing their students' data. These data systems present data to educators in an over-the-counter data format (embedding labels, supplemental documentation, and a help system, making key package/display and content decisions) to improve the success of educators' data-informed decision-making. In business, fostering and actively supporting data-driven decision-making in their firm and among their colleagues may be one of the central responsibilities of CIOs (Chief Information Officers) or CDOs (Chief Data Officers). +Assessment in higher education is a form of data-driven decision-making aimed at using evidence of what students learn to improve curriculum, student learning, and teaching. Standardized tests, grades, and student work scored by rubrics are forms of student learning outcomes assessment. Formative assessments, Formative assessments, specifically, allow educators to use data from student performance more immediately to modify teaching and learning strategies. There are numerous organizations aimed at promoting the assessment of student learning through DIDM, including the National Institute for Learning Outcomes Assessment, the Association for the Assessment of Student Learning in Higher Education, and, to an extent, the Association of American Colleges and Universities. + + +== References == \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Data_access-0.md b/data/en.wikipedia.org/wiki/Data_access-0.md new file mode 100644 index 000000000..b5bbfe4c7 --- /dev/null +++ b/data/en.wikipedia.org/wiki/Data_access-0.md @@ -0,0 +1,33 @@ +--- +title: "Data access" +chunk: 1/1 +source: "https://en.wikipedia.org/wiki/Data_access" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:53:46.267466+00:00" +instance: "kb-cron" +--- + +Data access is a generic term referring to a process which has both an IT-specific meaning and other connotations involving access rights in a broader legal and/or political sense. In the former it typically refers to software and activities related to storing, retrieving, or acting on data housed in a database or other repository. + + +== Details == +Two fundamental types of data access exist: + +sequential access (as in magnetic tape, for example) +random access (as in indexed media) +Data access crucially involves authorization to access different data repositories. Data access can help distinguish the abilities of administrators and users. For example, administrators may have the ability to remove, edit and add data, while general users may not even have "read" rights if they lack access to particular information. +Historically, each repository (including each different database, file system, etc.), might require the use of different methods and languages, and many of these repositories stored their content in different and incompatible formats. +Over the years standardized languages, methods, and formats, have developed to serve as interfaces between the often proprietary, and always idiosyncratic, specific languages and methods. Such standards include SQL (1974- ), ODBC (ca 1990- ), JDBC, XQJ, ADO.NET, XML, XQuery, XPath (1999- ), and Web Services. +Some of these standards enable translation of data from unstructured (such as HTML or free-text files) to structured (such as XML or SQL). +Structures such as connection strings and DBURLs +can attempt to standardise methods of connecting to databases. + + +== See also == +Right of access to personal data +Data access object +Data access layer + + +== References == \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Data_analysis-0.md b/data/en.wikipedia.org/wiki/Data_analysis-0.md index a6adb6c72..1f5d88d0a 100644 --- a/data/en.wikipedia.org/wiki/Data_analysis-0.md +++ b/data/en.wikipedia.org/wiki/Data_analysis-0.md @@ -4,7 +4,7 @@ chunk: 1/5 source: "https://en.wikipedia.org/wiki/Data_analysis" category: "reference" tags: "science, encyclopedia" -date_saved: "2026-05-05T06:27:21.931767+00:00" +date_saved: "2026-05-05T09:53:29.395055+00:00" instance: "kb-cron" --- diff --git a/data/en.wikipedia.org/wiki/Data_analysis-1.md b/data/en.wikipedia.org/wiki/Data_analysis-1.md index 1102c1c7a..03b46da4b 100644 --- a/data/en.wikipedia.org/wiki/Data_analysis-1.md +++ b/data/en.wikipedia.org/wiki/Data_analysis-1.md @@ -4,7 +4,7 @@ chunk: 2/5 source: "https://en.wikipedia.org/wiki/Data_analysis" category: "reference" tags: "science, encyclopedia" -date_saved: "2026-05-05T06:27:21.931767+00:00" +date_saved: "2026-05-05T09:53:29.395055+00:00" instance: "kb-cron" --- diff --git a/data/en.wikipedia.org/wiki/Data_analysis-2.md b/data/en.wikipedia.org/wiki/Data_analysis-2.md index 4b2368e5e..ff3b33ceb 100644 --- a/data/en.wikipedia.org/wiki/Data_analysis-2.md +++ b/data/en.wikipedia.org/wiki/Data_analysis-2.md @@ -4,7 +4,7 @@ chunk: 3/5 source: "https://en.wikipedia.org/wiki/Data_analysis" category: "reference" tags: "science, encyclopedia" -date_saved: "2026-05-05T06:27:21.931767+00:00" +date_saved: "2026-05-05T09:53:29.395055+00:00" instance: "kb-cron" --- diff --git a/data/en.wikipedia.org/wiki/Data_analysis-3.md b/data/en.wikipedia.org/wiki/Data_analysis-3.md index fc6e9ccb0..3894e6bff 100644 --- a/data/en.wikipedia.org/wiki/Data_analysis-3.md +++ b/data/en.wikipedia.org/wiki/Data_analysis-3.md @@ -4,7 +4,7 @@ chunk: 4/5 source: "https://en.wikipedia.org/wiki/Data_analysis" category: "reference" tags: "science, encyclopedia" -date_saved: "2026-05-05T06:27:21.931767+00:00" +date_saved: "2026-05-05T09:53:29.395055+00:00" instance: "kb-cron" --- diff --git a/data/en.wikipedia.org/wiki/Data_analysis-4.md b/data/en.wikipedia.org/wiki/Data_analysis-4.md index c3d6be43c..09b09e0d1 100644 --- a/data/en.wikipedia.org/wiki/Data_analysis-4.md +++ b/data/en.wikipedia.org/wiki/Data_analysis-4.md @@ -4,7 +4,7 @@ chunk: 5/5 source: "https://en.wikipedia.org/wiki/Data_analysis" category: "reference" tags: "science, encyclopedia" -date_saved: "2026-05-05T06:27:21.931767+00:00" +date_saved: "2026-05-05T09:53:29.395055+00:00" instance: "kb-cron" --- diff --git a/data/en.wikipedia.org/wiki/Data_analysis_for_fraud_detection-0.md b/data/en.wikipedia.org/wiki/Data_analysis_for_fraud_detection-0.md new file mode 100644 index 000000000..82ff321f0 --- /dev/null +++ b/data/en.wikipedia.org/wiki/Data_analysis_for_fraud_detection-0.md @@ -0,0 +1,42 @@ +--- +title: "Data analysis for fraud detection" +chunk: 1/2 +source: "https://en.wikipedia.org/wiki/Data_analysis_for_fraud_detection" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:53:47.474137+00:00" +instance: "kb-cron" +--- + +Fraud represents a significant problem for governments and businesses and specialized analysis techniques for discovering fraud using them are required. Some of these methods include knowledge discovery in databases (KDD), data mining, machine learning and statistics. They offer applicable and successful solutions in different areas of electronic fraud crimes. +In general, the primary reason to use data analytics techniques is to tackle fraud since many internal control systems have serious weaknesses. For example, the currently prevailing approach employed by many law enforcement agencies to detect companies involved in potential cases of fraud consists in receiving circumstantial evidence or complaints from whistleblowers. As a result, a large number of fraud cases remain undetected and unprosecuted. In order to effectively test, detect, validate, correct error and monitor control systems against fraudulent activities, businesses entities and organizations rely on specialized data analytics techniques such as data mining, data matching, the sounds like function, regression analysis, clustering analysis, and gap analysis. Techniques used for fraud detection fall into two primary classes: statistical techniques and artificial intelligence. + +== Statistical techniques == + +Examples of statistical data analysis techniques are: + +Data preprocessing techniques for detection, validation, error correction, and filling up of missing or incorrect data. +Calculation of various statistical parameters such as averages, quantiles, performance metrics, probability distributions, and so on. For example, the averages may include average length of call, average number of calls per month and average delays in bill payment. +Models and probability distributions of various business activities either in terms of various parameters or probability distributions. +Computing user profiles. +Time-series analysis of time-dependent data. +Clustering and classification to find patterns and associations among groups of data. +Data matching Data matching is used to compare two sets of collected data. The process can be performed based on algorithms or programmed loops. Trying to match sets of data against each other or comparing complex data types. Data matching is used to remove duplicate records and identify links between two data sets for marketing, security or other uses. +Sounds like Function is used to find values that sound similar. The Phonetic similarity is one way to locate possible duplicate values, or inconsistent spelling in manually entered data. The ‘sounds like’ function converts the comparison strings to four-character American Soundex codes, which are based on the first letter, and the first three consonants after the first letter, in each string. +Regression analysis allows you to examine the relationship between two or more variables of interest. Regression analysis estimates relationships between independent variables and a dependent variable. This method can be used to help understand and identify relationships among variables and predict actual results. +Gap analysis is used to determine whether business requirements are being met, if not, what are the steps that should be taken to meet successfully. +Matching algorithms to detect anomalies in the behavior of transactions or users as compared to previously known models and profiles. Techniques are also needed to eliminate false alarms, estimate risks, and predict future of current transactions or users. +Some forensic accountants specialize in forensic analytics which is the procurement and analysis of electronic data to reconstruct, detect, or otherwise support a claim of financial fraud. The main steps in forensic analytics are data collection, data preparation, data analysis, and reporting. For example, forensic analytics may be used to review an employee's purchasing card activity to assess whether any of the purchases were diverted or divertible for personal use. + +== Artificial intelligence == +Fraud detection is a knowledge-intensive activity. The main AI techniques used for fraud detection include: + +Data mining to classify, cluster, and segment the data and automatically find associations and rules in the data that may signify interesting patterns, including those related to fraud. +Expert systems to encode expertise for detecting fraud in the form of rules. +Pattern recognition to detect approximate classes, clusters, or patterns of suspicious behavior either automatically (unsupervised) or to match given inputs. +Machine learning techniques to automatically identify characteristics of fraud. +Neural nets to independently generate classification, clustering, generalization, and forecasting that can then be compared against conclusions raised in internal audits or formal financial documents such as 10-Q. +Other techniques such as link analysis, Bayesian networks, decision theory, and sequence matching are also used for fraud detection. A new and novel technique called System properties approach has also been employed where ever rank data is available. +Statistical analysis of research data is the most comprehensive method for determining if data fraud exists. Data fraud as defined by the Office of Research Integrity (ORI) includes fabrication, falsification and plagiarism. + +== Machine learning and data mining == \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Data_analysis_for_fraud_detection-1.md b/data/en.wikipedia.org/wiki/Data_analysis_for_fraud_detection-1.md new file mode 100644 index 000000000..c13c24d9c --- /dev/null +++ b/data/en.wikipedia.org/wiki/Data_analysis_for_fraud_detection-1.md @@ -0,0 +1,40 @@ +--- +title: "Data analysis for fraud detection" +chunk: 2/2 +source: "https://en.wikipedia.org/wiki/Data_analysis_for_fraud_detection" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:53:47.474137+00:00" +instance: "kb-cron" +--- + +Early data analysis techniques were oriented toward extracting quantitative and statistical data characteristics. These techniques facilitate useful data interpretations and can help to get better insights into the processes behind the data. Although the traditional data analysis techniques can indirectly lead us to knowledge, it is still created by human analysts. +To go beyond, a data analysis system has to be equipped with a substantial amount of background knowledge, and be able to perform reasoning tasks involving that knowledge and the data provided. In effort to meet this goal, researchers have turned to ideas from the machine learning field. This is a natural source of ideas, since the machine learning task can be described as turning background knowledge and examples (input) into knowledge (output). +If data mining results in discovering meaningful patterns, data turns into information. Information or patterns that are novel, valid and potentially useful are not merely information, but knowledge. One speaks of discovering knowledge, before hidden in the huge amount of data, but now revealed. +The machine learning and artificial intelligence solutions may be classified into two categories: 'supervised' and 'unsupervised' learning. These methods seek for accounts, customers, suppliers, etc. that behave 'unusually' in order to output suspicion scores, rules or visual anomalies, depending on the method. +Whether supervised or unsupervised methods are used, note that the output gives us only an indication of fraud likelihood. No stand alone statistical analysis can assure that a particular object is a fraudulent one, but they can identify them with very high degrees of accuracy. As a result, effective collaboration between machine learning model and human analysts is vital to the success of fraud detection applications. + +=== Supervised learning === + +In supervised learning, a random sub-sample of all records is taken and manually classified as either 'fraudulent' or 'non-fraudulent' (task can be decomposed on more classes to meet algorithm requirements). Relatively rare events such as fraud may need to be over sampled to get a big enough sample size. These manually classified records are then used to train a supervised machine learning algorithm. After building a model using this training data, the algorithm should be able to classify new records as either fraudulent or non-fraudulent. +Supervised neural networks, fuzzy neural nets, and combinations of neural nets and rules, have been extensively explored and used for detecting fraud in mobile phone networks and financial statement fraud. +Bayesian learning neural network is implemented for credit card fraud detection, telecommunications fraud, auto claim fraud detection, and medical insurance fraud. +Hybrid knowledge/statistical-based systems, where expert knowledge is integrated with statistical power, use a series of data mining techniques for the purpose of detecting cellular clone fraud. Specifically, a rule-learning program to uncover indicators of fraudulent behaviour from a large database of customer transactions is implemented. +Cahill et al. (2000) design a fraud signature, based on data of fraudulent calls, to detect telecommunications fraud. For scoring a call for fraud its probability under the account signature is compared to its probability under a fraud signature. The fraud signature is updated sequentially, enabling event-driven fraud detection. +Link analysis comprehends a different approach. It relates known fraudsters to other individuals, using record linkage and social network methods. +This type of detection is only able to detect frauds similar to those which have occurred previously and been classified by a human. To detect a novel type of fraud may require the use of an unsupervised machine learning algorithm. + +=== Unsupervised learning === + +In contrast, unsupervised methods don't make use of labelled records. +Bolton and Hand use Peer Group Analysis and Break Point Analysis applied on spending behaviour in credit card accounts. Peer Group Analysis detects individual objects that begin to behave in a way different from objects to which they had previously been similar. Another tool Bolton and Hand develop for behavioural fraud detection is Break Point Analysis. Unlike Peer Group Analysis, Break Point Analysis operates on the account level. A break point is an observation where anomalous behaviour for a particular account is detected. Both the tools are applied on spending behaviour in credit card accounts. +A combination of unsupervised and supervised methods for credit card fraud detection is in Carcillo et al (2019). + +== Geolocation == + +== Available datasets == +A major limitation for the validation of existing fraud detection methods is the lack of public datasets. One of the few examples is the Credit Card Fraud Detection dataset made available by the ULB Machine Learning Group. + +== See also == + +== References == \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Data_exploration-0.md b/data/en.wikipedia.org/wiki/Data_exploration-0.md new file mode 100644 index 000000000..3a025f6ca --- /dev/null +++ b/data/en.wikipedia.org/wiki/Data_exploration-0.md @@ -0,0 +1,41 @@ +--- +title: "Data exploration" +chunk: 1/1 +source: "https://en.wikipedia.org/wiki/Data_exploration" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:53:48.638463+00:00" +instance: "kb-cron" +--- + +Data exploration is an approach similar to initial data analysis, whereby a data analyst uses visual exploration to understand what is in a dataset and the characteristics of the data, rather than through traditional data management systems. These characteristics can include size or amount of data, completeness of the data, correctness of the data, possible relationships amongst data elements or files/tables in the data. +Data exploration is typically conducted using a combination of automated and manual activities. Automated activities can include data profiling or data visualization or tabular reports to give the analyst an initial view into the data and an understanding of key characteristics. +This is often followed by manual drill-down or filtering of the data to identify anomalies or patterns identified through the automated actions. Data exploration can also require manual scripting and queries into the data (e.g. using languages such as SQL or R) or using spreadsheets or similar tools to view the raw data. +All of these activities are aimed at creating a mental model and understanding of the data in the mind of the analyst, and defining basic metadata (statistics, structure, relationships) for the data set that can be used in further analysis. +Once this initial understanding of the data is had, the data can be pruned or refined by removing unusable parts of the data (data cleansing), correcting poorly formatted elements and defining relevant relationships across datasets. This process is also known as determining data quality. +Data exploration can also refer to the ad hoc querying or visualization of data to identify potential relationships or insights that may be hidden in the data and does not require to formulate assumptions beforehand. +Traditionally, this had been a key area of focus for statisticians, with John Tukey being a key evangelist in the field. Today, data exploration is more widespread and is the focus of data analysts and data scientists; the latter being a relatively new role within enterprises and larger organizations. + + +== Interactive Data Exploration == +This area of data exploration has become an area of interest in the field of machine learning. This is a relatively new field and is still evolving. As its most basic level, a machine-learning algorithm can be fed a data set and can be used to identify whether a hypothesis is true based on the dataset. Common machine learning algorithms can focus on identifying specific patterns in the data. Many common patterns include regression and classification or clustering, but there are many possible patterns and algorithms that can be applied to data via machine learning. +By employing machine learning, it is possible to find patterns or relationships in the data that would be difficult or impossible to find via manual inspection, trial and error or traditional exploration techniques. + + +== Software == +Trifacta – a data preparation and analysis platform +Paxata – self-service data preparation software +Alteryx – data blending and advanced data analytics software +Microsoft Power BI - interactive visualization and data analysis tool +OpenRefine - a standalone open source desktop application for data clean-up and data transformation +Tableau software – interactive data visualization software + + +== See also == +Exploratory data analysis +Machine learning +Data profiling +Data visualization + + +== References == \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Data_farming-0.md b/data/en.wikipedia.org/wiki/Data_farming-0.md index dd47401d0..1eac52780 100644 --- a/data/en.wikipedia.org/wiki/Data_farming-0.md +++ b/data/en.wikipedia.org/wiki/Data_farming-0.md @@ -4,7 +4,7 @@ chunk: 1/1 source: "https://en.wikipedia.org/wiki/Data_farming" category: "reference" tags: "science, encyclopedia" -date_saved: "2026-05-05T09:49:53.929022+00:00" +date_saved: "2026-05-05T09:53:49.876847+00:00" instance: "kb-cron" --- diff --git a/data/en.wikipedia.org/wiki/Data_fusion-0.md b/data/en.wikipedia.org/wiki/Data_fusion-0.md new file mode 100644 index 000000000..4ce06ed45 --- /dev/null +++ b/data/en.wikipedia.org/wiki/Data_fusion-0.md @@ -0,0 +1,84 @@ +--- +title: "Data fusion" +chunk: 1/2 +source: "https://en.wikipedia.org/wiki/Data_fusion" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:53:51.162435+00:00" +instance: "kb-cron" +--- + +Data fusion is the process of integrating multiple data sources to produce more consistent, accurate, and useful information than that provided by any individual data source. +Data fusion processes are often categorized as low, intermediate, or high, depending on the processing stage at which fusion takes place. Low-level data fusion combines several sources of raw data to produce new raw data. The expectation is that fused data is more informative and synthetic than the original inputs. +For example, sensor fusion is also known as (multi-sensor) data fusion and is a subset of information fusion. +The concept of data fusion has origins in the evolved capacity of humans and animals to incorporate information from multiple senses to improve their ability to survive. For example, a combination of sight, touch, smell, and taste may indicate whether a substance is edible. + +== The JDL/DFIG model == + +In the mid-1980s, the Joint Directors of Laboratories formed the Data Fusion Subpanel (which later became known as the Data Fusion Group). With the advent of the World Wide Web, data fusion thus included data, sensor, and information fusion. The JDL/DFIG introduced a model of data fusion that divided the various processes. Currently, the six levels with the Data Fusion Information Group (DFIG) model are: + +Level 0: Source Preprocessing (or Data Assessment) +Level 1: Object Assessment +Level 2: Situation Assessment +Level 3: Impact Assessment (or Threat Refinement) +Level 4: Process Refinement (or Resource Management) +Level 5: User Refinement (or Cognitive Refinement) +Level 6: Mission Refinement (or Mission Management) +Although the JDL Model (Level 1–4) is still in use today, it is often criticized for its implication that the levels necessarily happen in order and also for its lack of adequate representation of the potential for a human-in-the-loop. The DFIG model (Level 0–5) explored the implications of situation awareness, user refinement, and mission management. Despite these shortcomings, the JDL/DFIG models are useful for visualizing the data fusion process, facilitating discussion and common understanding, and important for systems-level information fusion design. + +== Geospatial applications == +In the geospatial (GIS) domain, data fusion is often synonymous with data integration. In these applications, there is often a need to combine diverse data sets into a unified (fused) data set which includes all of the data points and time steps from the input data sets. The fused data set is different from a simple combined superset in that the points in the fused data set contain attributes and metadata which might not have been included for these points in the original data set. +A simplified example of this process is shown below where data set "α" is fused with data set β to form the fused data set δ. Data points in set "α" have spatial coordinates X and Y and attributes A1 and A2. Data points in set β have spatial coordinates X and Y and attributes B1 and B2. The fused data set contains all points and attributes. + +In a simple case where all attributes are uniform across the entire analysis domain, the attributes may be simply assigned: M?, N?, Q?, R? to M, N, Q, R. In a real application, attributes are not uniform and some type of interpolation is usually required to properly assign attributes to the data points in the fused set. + +In a much more complicated application, marine animal researchers use data fusion to combine animal tracking data with bathymetric, meteorological, sea surface temperature (SST) and animal habitat data to examine and understand habitat utilization and animal behavior in reaction to external forces such as weather or water temperature. Each of these data sets exhibit a different spatial grid and sampling rate so a simple combination would likely create erroneous assumptions and taint the results of the analysis. But through the use of data fusion, all data and attributes are brought together into a single view in which a more complete picture of the environment is created. This enables scientists to identify key locations and times and form new insights into the interactions between the environment and animal behaviors. +In the figure at right, rock lobsters are studied off the coast of Tasmania. Hugh Pederson of the University of Tasmania used data fusion software to fuse southern rock lobster tracking data (color-coded for in yellow and black for day and night, respectively) with bathymetry and habitat data to create a unique 4D picture of rock lobster behavior. + +== Data integration == + +In applications outside of the geospatial domain, differences in the usage of the terms Data integration and Data fusion apply. In areas such as business intelligence, for example, data integration is used to describe the combining of data, whereas data fusion is integration followed by reduction or replacement. Data integration might be viewed as set combination wherein the larger set is retained, whereas fusion is a set reduction technique with improved confidence. + +=== Application areas === + +== From multiple traffic sensing modalities == +The data from the different sensing technologies can be combined in intelligent ways to determine the traffic state accurately. A Data fusion based approach that utilizes the road side collected acoustic, image and sensor data has been shown to combine the advantages of the different individual methods. + +== Decision fusion == +In many cases, geographically dispersed sensors are severely energy- and bandwidth-limited. Therefore, the raw data concerning a certain phenomenon are often summarized in a few bits from each sensor. When inferring on a binary event (i.e., + + + + + + + H + + + + 0 + + + + + {\displaystyle {\mathcal {H}}_{0}} + + or + + + + + + + H + + + + 1 + + + + + {\displaystyle {\mathcal {H}}_{1}} + + ), in the extreme case only binary decisions are sent from sensors to a Decision Fusion Center (DFC) and combined in order to obtain improved classification performance. \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Data_fusion-1.md b/data/en.wikipedia.org/wiki/Data_fusion-1.md new file mode 100644 index 000000000..5a62d02fc --- /dev/null +++ b/data/en.wikipedia.org/wiki/Data_fusion-1.md @@ -0,0 +1,49 @@ +--- +title: "Data fusion" +chunk: 2/2 +source: "https://en.wikipedia.org/wiki/Data_fusion" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:53:51.162435+00:00" +instance: "kb-cron" +--- + +== For enhanced contextual awareness == +With a multitude of built-in sensors including motion sensor, environmental sensor, position sensor, a modern mobile device typically gives mobile applications access to a number of sensory data which could be leveraged to enhance the contextual awareness. Using signal processing and data fusion techniques such as feature generation, feasibility study and principal component analysis (PCA) such sensory data will greatly improve the positive rate of classifying the motion and contextual relevant status of the device. Many context-enhanced information techniques are provided by Snidaro, et al. + +== Statistical methods == + +=== Bayesian auto-regressive Gaussian processes === + +Gaussian processes are a popular machine learning model. If an auto-regressive relationship between the data is assumed, and each data source is assumed to be a Gaussian process, this constitutes a non-linear Bayesian regression problem. + +=== Semiparametric estimation === +Many data fusion methods assume common conditional distributions across several data sources. Recently, methods have been developed to enable efficient estimation within the resulting semiparametric model. + +== See also == +Data assimilation +Data munging +Image fusion +Information integration +Integrative level +Meta-analysis +Sensor fusion + +== References == + +=== Sources === +General references +Hall, Dave L.; Llinas, James (1997). "Introduction to Multisensor Data Fusion". Proceedings of the IEEE. 85 (1): 6–23. doi:10.1109/5.554205. +Blasch, Erik; Kadar, Ivan; Salerno, John; Kokar, Mieczyslaw M.; Das, Subrata; Powell, Gerald M.; Corkill, Daniel D.; Ruspini, Enrique H. (2006). "Issues and Challenges in Situation Assessment (Level 2 Fusion)" (PDF). Journal of Advances in Information Fusion. 1 (2). Archived from the original (PDF) on 2015-05-27. + +== Bibliography == +Hall, David L.; McMullen, Sonya A. H. (2004). Mathematical Techniques in Multisensor Data Fusion, Second Edition. Norwood, MA: Artech House, Inc. ISBN 978-1-5805-3335-5. +Mitchell, H. B. (2007). Multi-sensor Data Fusion – An Introduction. Berlin: Springer-Verlag. ISBN 978-3-540-71463-7. +Das, S. (2008). High-Level Data Fusion. Norwood, MA: Artech House Publishers. ISBN 978-1-59693-281-4. + +== External links == + +Discriminant Correlation Analysis (DCA) +Sensordata Fusion, An Introduction +International Society of Information Fusion +Sensor Fusion for Nanopositioning \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Data_profiling-0.md b/data/en.wikipedia.org/wiki/Data_profiling-0.md new file mode 100644 index 000000000..25fd17470 --- /dev/null +++ b/data/en.wikipedia.org/wiki/Data_profiling-0.md @@ -0,0 +1,63 @@ +--- +title: "Data profiling" +chunk: 1/1 +source: "https://en.wikipedia.org/wiki/Data_profiling" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:53:52.402806+00:00" +instance: "kb-cron" +--- + +Data profiling is the process of examining the data available from an existing information source (e.g. a database or a file) and collecting statistics or informative summaries about that data. The purpose of these statistics may be to: + +Find out whether existing data can be easily used for other purposes +Improve the ability to search data by tagging it with keywords, descriptions, or assigning it to a category +Assess data quality, including whether the data conforms to particular standards or patterns +Assess the risk involved in integrating data in new applications, including the challenges of joins +Discover metadata of the source database, including value patterns and distributions, key candidates, foreign-key candidates, and functional dependencies +Assess whether known metadata accurately describes the actual values in the source database +Understanding data challenges early in any data intensive project, so that late project surprises are avoided. Finding data problems late in the project can lead to delays and cost overruns. +Have an enterprise view of all data, for uses such as master data management, where key data is needed, or data governance for improving data quality. + + +== Introduction == +Data profiling refers to the analysis of information for use in a data warehouse in order to clarify the structure, content, relationships, and derivation rules of the data. Profiling helps to not only understand anomalies and assess data quality, but also to discover, register, and assess enterprise metadata. The result of the analysis is used to determine the suitability of the candidate source systems, usually giving the basis for an early go/no-go decision, and also to identify problems for later solution design. + + +== How data profiling is conducted == +Data profiling utilizes methods of descriptive statistics such as minimum, maximum, mean, mode, percentile, standard deviation, frequency, variation, aggregates such as count and sum, and additional metadata information obtained during data profiling such as data type, length, discrete values, uniqueness, occurrence of null values, typical string patterns, and abstract type recognition. The metadata can then be used to discover problems such as illegal values, misspellings, missing values, varying value representation, and duplicates. +Different analyses are performed for different structural levels. E.g. single columns could be profiled individually to get an understanding of frequency distribution of different values, type, and use of each column. Embedded value dependencies can be exposed in a cross-columns analysis. Finally, overlapping value sets possibly representing foreign key relationships between entities can be explored in an inter-table analysis. +Normally, purpose-built tools are used for data profiling to ease the process. The computational complexity increases when going from single column, to single table, to cross-table structural profiling. Therefore, performance is an evaluation criterion for profiling tools. + + +== When is data profiling conducted? == +According to Kimball, data profiling is performed several times and with varying intensity throughout the data warehouse developing process. A light profiling assessment should be undertaken immediately after candidate source systems have been identified and DW/BI business requirements have been satisfied. The purpose of this initial analysis is to clarify at an early stage if the correct data is available at the appropriate detail level and that anomalies can be handled subsequently. If this is not the case the project may be terminated. +Additionally, more in-depth profiling is done prior to the dimensional modeling process in order assess what is required to convert data into a dimensional model. Detailed profiling extends into the ETL system design process in order to determine the appropriate data to extract and which filters to apply to the data set. +Additionally, data profiling may be conducted in the data warehouse development process after data has been loaded into staging, the data marts, etc. Conducting data at these stages helps ensure that data cleaning and transformations have been done correctly and in compliance of requirements. + + +=== Benefits and examples === +Data profiling produces concise summaries of datasets that reveal structure and quality issues. Typical outputs include statistics and dependency findings that help analysts understand data, accelerate integration/migration, and improve data quality programs. + +Common profiling statistics and checks +Completeness (e.g., percentage of NULL/blank values). +Distinctness/uniqueness (number of distinct values; candidate keys). +Value distributions (frequencies, histograms, ranges, outliers). +Pattern/format frequencies (e.g., regex or mask patterns for codes, emails, addresses). +Structural dependencies such as keys, functional dependencies, and inclusion/foreign-key dependencies used for integration and schema matching. +How results are used +Profiling provides baselines and evidence for data quality dimensions (such as completeness, accuracy, consistency, timeliness) and supplies metrics for remediation; many data governance frameworks treat profiling as an input to ongoing data quality management. +Example. For a column ``customer_email``, a profile may report 98.2% non-NULL values, 97.9% values matching an email pattern, top domains by frequency, and no candidate key; for a shipment table, a functional dependency such as ``order_id → customer_id`` may be discovered to support integration. + + +== See also == +Data quality +Data governance +Master data management +Database normalization +Data visualization +Analysis paralysis +Data analysis + + +== References == \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Data_science-0.md b/data/en.wikipedia.org/wiki/Data_science-0.md new file mode 100644 index 000000000..beeeb41fa --- /dev/null +++ b/data/en.wikipedia.org/wiki/Data_science-0.md @@ -0,0 +1,37 @@ +--- +title: "Data science" +chunk: 1/2 +source: "https://en.wikipedia.org/wiki/Data_science" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:53:53.540719+00:00" +instance: "kb-cron" +--- + +Data science is an interdisciplinary academic field that uses statistics, scientific computing, scientific methods, processing, scientific visualization, algorithms, and systems to extract or extrapolate knowledge from potentially noisy, structured, or unstructured data. +Data science plays a critical role in modern decision-making by enabling organizations to extract actionable insights from large and complex datasets. +Data science also integrates domain knowledge from the underlying application domain (e.g., natural sciences, information technology, and medicine). Data science is multifaceted and can be described as a science, a research paradigm, a research method, a discipline, a workflow, and a profession. +Data science is "a concept to unify statistics, data analysis, informatics, and their related methods" to "understand and analyze actual phenomena" with data. It uses techniques and theories drawn from many fields within the context of mathematics, statistics, computer science, information science, and domain knowledge. However, data science is distinct from computer science and information science. Turing Award winner Jim Gray imagined data science as a "fourth paradigm" of science (empirical, theoretical, computational, and now data-driven) and asserted that "everything about science is changing because of the impact of information technology" and the data deluge. +Data science is often described as a multidisciplinary and rapidly evolving field because it draws on techniques from diverse areas, such as computer science, statistics, information science, and other subject-specific disciplines. Some researchers argue that the combination of the different fields resembles how information science was decades ago (Mayernik, 2023). These similarities help clarify how data science gradually became its own field of study. +A data scientist is a professional who creates programming code and combines it with statistical knowledge to summarize data. + +== Foundations == +Data science is an interdisciplinary field focused on extracting knowledge from typically large data sets and applying the knowledge from that data to solve problems in other application domains. The field encompasses preparing data for analysis, formulating data science problems, analyzing data, and summarizing these findings. As such, it incorporates skills from computer science, mathematics, data visualization, graphic design, communication, and business. +Vasant Dhar writes that statistics emphasizes quantitative data and description. In contrast, data science deals with quantitative and qualitative data (e.g., from images, text, sensors, transactions, customer information, etc.) and emphasizes prediction and action. Andrew Gelman of Columbia University has described statistics as a non-essential part of data science. Stanford professor David Donoho writes that data science is not distinguished from statistics by the size of datasets or use of computing and that many graduate programs misleadingly advertise their analytics and statistics training as the essence of a data-science program. He describes data science as an applied field growing out of traditional statistics. + +== Etymology == + +=== Early usage === +In 1962, John Tukey described a field he called "data analysis", which resembles modern data science. In 1985, in a lecture given to the Chinese Academy of Sciences in Beijing, C. F. Jeff Wu used the term "data science" for the first time as an alternative name for statistics. Later, attendees at a 1992 statistics symposium at the University of Montpellier II acknowledged the emergence of a new discipline focused on data of various origins and forms, combining established concepts and principles of statistics and data analysis with computing. +The term "data science" has been traced back to 1974, when Peter Naur proposed it as an alternative name to computer science. In his 1974 book Concise Survey of Computer Methods, Peter Naur proposed using the term ‘data science’ rather than ‘computer science’ to reflect the growing emphasis on data-driven methods In 1996, the International Federation of Classification Societies became the first conference to specifically feature data science as a topic. However, the definition was still in flux. After the 1985 lecture at the Chinese Academy of Sciences in Beijing, in 1997 C. F. Jeff Wu again suggested that statistics should be renamed data science. He reasoned that a new name would help statistics shed inaccurate stereotypes, such as being synonymous with accounting or limited to describing data. In 1998, Hayashi Chikio argued for data science as a new, interdisciplinary concept, with three aspects: data design, collection, and analysis. + +=== Modern usage === +In 2012, technologists Thomas H. Davenport and DJ Patil declared "Data Scientist: The Sexiest Job of the 21st Century", a catchphrase that was picked up even by major-city newspapers like the New York Times and the Boston Globe. A decade later, they reaffirmed it, stating that "the job is more in demand than ever with employers". +The modern conception of data science as an independent discipline is sometimes attributed to William S. Cleveland. In 2014, the American Statistical Association's Section on Statistical Learning and Data Mining changed its name to the Section on Statistical Learning and Data Science, reflecting the ascendant popularity of data science. +Over the last few years, many colleges have begun to create more structured undergraduate programs in data science. According to a report by the National Academies, strong programs typically include training in statistics, computing, ethics, and communication, as well as hands-on work in a specific field (National Academies of Sciences, Engineering, and Medicine, 2018). As schools try to prepare students for jobs that use data, these practices become more common. +The professional title of "data scientist" has been attributed to DJ Patil and Jeff Hammerbacher in 2008. Though it was used by the National Science Board in their 2005 report "Long-Lived Digital Data Collections: Enabling Research and Education in the 21st Century", it referred broadly to any key role in managing a digital data collection. + +== Data science and data analysis == + +In data science, data analysis is the process of inspecting, cleaning, transforming, and modelling data to discover useful information, draw conclusions, and support decision-making. It includes exploratory data analysis (EDA), which uses graphics and descriptive statistics to explore patterns and generate hypotheses, and confirmatory data analysis, which applies statistical inference to test hypotheses and quantify uncertainty. +Typical activities comprise: \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Data_science-1.md b/data/en.wikipedia.org/wiki/Data_science-1.md new file mode 100644 index 000000000..2ca855ab0 --- /dev/null +++ b/data/en.wikipedia.org/wiki/Data_science-1.md @@ -0,0 +1,47 @@ +--- +title: "Data science" +chunk: 2/2 +source: "https://en.wikipedia.org/wiki/Data_science" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:53:53.540719+00:00" +instance: "kb-cron" +--- + +data collection and integration; +data cleaning and preparation (handling missing values, outliers, encoding, normalisation); +feature engineering and selection; +visualisation and descriptive statistics; +fitting and evaluating statistical or machine-learning models; +communicating results and ensuring reproducibility (e.g., reports, notebooks, and dashboards). +Lifecycle frameworks such as CRISP-DM describe these steps from business understanding through deployment and monitoring. +Data science involves working with larger datasets that often require advanced computational and statistical methods to analyze. Data scientists often work with unstructured data such as text or images and use machine learning algorithms to build predictive models. Data science often uses statistical analysis, data preprocessing, and supervised learning. +Recent studies indicate that AI is moving towards data-centric approaches, focusing on the quality of datasets rather than just improving AI models. This trend focuses on improving system performance by cleaning, refining, and labeling data (Bhatt et al., 2024). As AI systems grow larger, the data-centric view has become increasingly important. + +== Cloud computing for data science == + +Cloud computing can offer access to large amounts of computational power and storage. In big data, where volumes of information are continually generated and processed, these platforms can be used to handle complex and resource-intensive analytical tasks. +Some distributed computing frameworks are designed to handle big data workloads. These frameworks can enable data scientists to process and analyze large datasets in parallel, which can reduce processing times. + +== Ethical consideration in data science == +Data science involves collecting, processing, and analyzing data which often includes personal and sensitive information. Ethical concerns include potential privacy violations, bias perpetuation, and negative societal impacts. +Ethics education in data science has grown to encompass both technical principles and more expansive philosophical questions. Research indicates that data science ethics courses are increasingly integrating human-centric topics, including fairness, accountability, and responsible decision-making, thereby connecting them to enduring discussions in moral and political philosophy (Colando & Hardin, 2024). The goal of this method is to help students understand how data-driven technologies affect society. +Machine learning models can amplify existing biases present in training data, leading to discriminatory or unfair outcomes. Another area of data science that is growing is the push for better ways to cite data. Citing datasets makes it easier for other researchers to understand what data was used and for studies to be repeated (Lafia et al., 2023). These practices give the people who collect and manage data the credit they deserve, which is becoming more important in modern research. + +== See also == + +Python (programming language) +R (programming language) +Data engineering +Big data +Machine learning +Artificial intelligence +Bioinformatics +Astroinformatics +Topological data analysis +List of data science journals +List of data science software +List of open-source data science software +Data science notebook software + +== References == \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Data_version_control-0.md b/data/en.wikipedia.org/wiki/Data_version_control-0.md new file mode 100644 index 000000000..aefc809a8 --- /dev/null +++ b/data/en.wikipedia.org/wiki/Data_version_control-0.md @@ -0,0 +1,85 @@ +--- +title: "Data version control" +chunk: 1/1 +source: "https://en.wikipedia.org/wiki/Data_version_control" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:53:54.761385+00:00" +instance: "kb-cron" +--- + +Data version control is a method of working with data sets. It is similar to the version control systems used in traditional software development, but is optimized to allow better processing of data and collaboration in the context of data analytics, research, and any other form of data analysis. Data version control may also include specific features and configurations designed to facilitate work with large data sets and data lakes. + + +== History == + + +=== Background === +As early as 1985, researchers recognized the need for defining timing attributes in database tables, which would be necessary for tracking changes to databases. This research continued into the 1990s, and the theory was formalized into practical methods for managing data in relational databases, providing some of the foundational concepts for what would later become data version control. +In the early 2010s the size of data sets was rapidly expanding, and relational databases were no longer sufficient to manage the amounts of data organizations were accumulating. The rise of the Apache Hadoop eco system, with HDFS as a storage layer, and later object storage had become dominant in big data operations. Research into data management tools and data version control systems increased sharply, along with demand for such tools from both academia and the private and public sectors. + + +=== Version controlled databases === +The first versioned database was proposed in 2012 for the SciDB database, and demonstrated it was possible to create chains and trees of different versions of the database while decreasing both the overall storage size and access speeds associated with previous methods. In 2014, a proposal was made to generalize these principles into a platform that could be used for any application. +In 2016, a prototype for a data version control system was developed during a Kaggle competition. This software was later used internally at an AI firm, and eventually spun off as a startup. Since then, a number of data version control systems, both open and closed source, have been developed and offered commercially, with a subset dedicated specifically to machine learning. + + +== Use cases == + + +=== Reproducibility === +A wide range of scientific disciplines have adopted automated analysis of large quantities of data, including astrophysics, seismology, biology and medicine, social sciences and economics, and many other fields. The principle of reproducibility is an important aspect of formalizing findings in scientific disciplines, and in the context of data science presents a number of challenges. Most datasets are constantly changing, whether due to the addition of more data or changes in the structure and format of the data, and small changes can have significant effects on the outcome of experiments. Data version control allows for recording the exact state of data sets at a particular moment of time, making it easier to reproduce and understand experimental outcomes. If data practitioners can only know the present state of the data, they may run into a number of challenges such as difficulties in problem debugging or complying with data audits. + + +=== Development and testing === +Data version control is sometimes used in testing and development of applications that interact with large quantities of data. Some data version control tools allow users to create replicas of their production environment for testing purposes. This approach allows them to test data integration processes such as extract, transform and load (ETL) and understand the changes made to data without having a negative impact on the consumers of the production data. + + +=== Machine learning and artificial intelligence === +In the context of machine learning, data version control can be used for optimizing the performance of models. It can allow automating the process of analyzing outcomes with different versions of a data set to continuously improve performance. It is possible that open source data version control software could eliminate the need for proprietary AI platforms by extending tools like Git and CI/CD for use by machine learning engineers. Many open-source solutions build on Git-like semantics to provide these capabilities, as Git itself was designed for small text files and doesn't support typical machine learning datasets, which are very large. + + +=== CI/CD for data === +CI/CD methodologies can be applied to datasets using data version control. Version control enables users to integrate with automation servers that allow establishing a CI/CD process for data. By adding testing platforms to the process, they can guarantee high quality of the data product. In this scenario, teams execute Continuous Integration (CI) tests on data and set checks in place to ensure the data is promoted to production only all the set data quality and data governance criteria are met. + + +=== Experimentation in isolated environments === +To experiment on a dataset without impacting production data, one can use data version control to create replicas of the production environment where tests can be carried out. Such replicas allow testing and understanding of changes safely applied to data. +Data version control tools allow replication environments without the time- and resource-consuming maintenance. Instead, such tools allow objects to be shared using metadata. + + +=== Rollback === +Continuous changes in data sets can sometimes cause functionality issues or lead to undesired outcomes, especially when applications are using the data. Data version control tools allow for the possibility to roll back a data set to an earlier state. This can be used to restore or improve functionality of an application or to correct errors or bad data which has been mistakenly included. + + +== Examples == +Version controlled data sources: + +Kaggle +Quilt +Dolt +Kamu +Data version control for data lakes: + +LakeFS +Project Nessie +Git-LFS +ML-Ops systems that implement data version control: + +DVC +Pachyderm +Neptune +activeloop +graviti +dagshub +alectio +Galileo +Voxel51 +dstack +dvid + + +== See also == + + +== References == \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Dataset_shift-0.md b/data/en.wikipedia.org/wiki/Dataset_shift-0.md new file mode 100644 index 000000000..e4ec85dd5 --- /dev/null +++ b/data/en.wikipedia.org/wiki/Dataset_shift-0.md @@ -0,0 +1,61 @@ +--- +title: "Dataset shift" +chunk: 1/1 +source: "https://en.wikipedia.org/wiki/Dataset_shift" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:53:59.453907+00:00" +instance: "kb-cron" +--- + +Dataset shift is a phenomenon in machine learning and statistics in which the joint distribution of input variables and target labels is different in the training phase and the deployment or test phase (i.e., + + + + + P + + t + r + a + i + n + + + ( + X + , + Y + ) + ≠ + + P + + t + e + s + t + + + ( + X + , + Y + ) + + + {\displaystyle P_{train}(X,Y)\neq P_{test}(X,Y)} + +). This happens when the statistical properties of data used to train a model are no longer representative of the data encountered in real-world use, often resulting in degraded predictive performance and diminished generalization ability. +Dataset shift is a generic term for a number of particular types of distributional change. Covariate shift is when the distribution of the input features changes, but the conditional relationship between inputs and outputs remains constant . Prior probability shift (or label shift) happens when the distribution of target labels changes, but the conditional distribution of inputs given labels stays the same. Concept shift (also known as concept drift) is the change of the conditional relationship between inputs and outputs that renders previously learned patterns invalid over time. +A key challenge for deploying machine learning systems is dataset shift, in particular in dynamic environments where the data distributions change over time. Detecting and mitigating such shifts is an active area of research, e.g., drift detection, domain adaptation, continual learning. + + +== See also == +Concept drift +Domain adaptation +Overfitting +Statistical classification + + +== References == \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Degree-Rips_bifiltration-0.md b/data/en.wikipedia.org/wiki/Degree-Rips_bifiltration-0.md new file mode 100644 index 000000000..b30a497f2 --- /dev/null +++ b/data/en.wikipedia.org/wiki/Degree-Rips_bifiltration-0.md @@ -0,0 +1,405 @@ +--- +title: "Degree-Rips bifiltration" +chunk: 1/1 +source: "https://en.wikipedia.org/wiki/Degree-Rips_bifiltration" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:54:00.644610+00:00" +instance: "kb-cron" +--- + +The degree-Rips bifiltration is a simplicial filtration used in topological data analysis for analyzing the shape of point cloud data. It is a multiparameter extension of the Vietoris–Rips filtration that possesses greater stability to data outliers than single-parameter filtrations, and which is more amenable to practical computation than other multiparameter constructions. Introduced in 2015 by Lesnick and Wright, the degree-Rips bifiltration is a parameter-free and density-sensitive vehicle for performing persistent homology computations on point cloud data. + + +== Definition == +It is standard practice in topological data analysis (TDA) to associate a sequence of nested simplicial complexes to a finite data set in order to detect the persistence of topological features over a range of scale parameters. One way to do this is by considering the sequence of Vietoris–Rips complexes of a finite set in a metric space indexed over all scale parameters. +If + + + + X + + + {\displaystyle X} + + is a finite set in a metric space, then this construction is known as the Vietoris–Rips (or simply "Rips") filtration on + + + + X + + + {\displaystyle X} + +, commonly denoted + + + + + Rips + + ( + X + ) + + + {\displaystyle {\text{Rips}}(X)} + + or + + + + + + R + + + ( + X + ) + + + {\displaystyle {\mathcal {R}}(X)} + +. The Rips filtration can be expressed as a functor + + + + + Rips + + ( + X + ) + : + + R + + → + + S + i + m + p + + + + {\displaystyle {\text{Rips}}(X):\mathbb {R} \to \mathbf {Simp} } + + from the real numbers (viewed as a poset category) to the category of simplicial complexes and simplicial maps, a subcategory of the category + + + + + T + o + p + + + + {\displaystyle \mathbf {Top} } + + of topological spaces and continuous maps via the geometric realization functor. +The Rips filtration is indexed over a single parameter, but we can capture more information (e.g., density) about the underlying data set by considering multiparameter filtrations. A filtration indexed by the product of two totally-ordered sets is known as a bifiltration, first introduced by Gunnar Carlsson and Afra Zomorodian in 2009. +The degree-Rips bifiltration filters each simplicial complex in the Rips filtration by the degree of each vertex in the graph isomorphic to the 1-skeleton at each index. More formally, let + + + + ( + a + , + b + ) + + + {\displaystyle (a,b)} + + be an element of + + + + + + R + + + 2 + + + + + {\displaystyle \mathbb {R} ^{2}} + + and define + + + + + G + + a + , + b + + + + + {\displaystyle G_{a,b}} + + to be the subgraph of the 1-skeleton of + + + + + Rips + + ( + X + + ) + + b + + + + + {\displaystyle {\text{Rips}}(X)_{b}} + + containing all vertices whose degree is at least + + + + a + + + {\displaystyle a} + +. Subsequently building the maximal simplicial complex possible on this 1-skeleton, we obtain a complex + + + + + D-Rips + + ( + X + + ) + + a + , + b + + + + + {\displaystyle {\text{D-Rips}}(X)_{a,b}} + +. By doing this for all possible vertex degrees, and across all scale parameters in the Rips filtration, we extend the Rips construction to a bifiltration + + + + { + + D-Rips + + ( + X + + ) + + a + , + b + + + + } + + ( + a + , + b + ) + ∈ + + + R + + + 2 + + + + + + + {\displaystyle \{{\text{D-Rips}}(X)_{a,b}\}_{(a,b)\in \mathbb {R} ^{2}}} + +. +Note that since the size of each complex will decrease as + + + + a + + + {\displaystyle a} + + increases, we should identify the indexing set + + + + + + R + + + 2 + + + + + {\displaystyle \mathbb {R} ^{2}} + + with + + + + + + R + + + op + + + × + + R + + + + {\displaystyle \mathbb {R} ^{\text{op}}\times \mathbb {R} } + +, where + + + + + + R + + + op + + + + + {\displaystyle \mathbb {R} ^{\text{op}}} + + is the opposite poset category of + + + + + R + + + + {\displaystyle \mathbb {R} } + +. Therefore the degree-Rips bifiltration can be viewed as a functor + + + + + D-Rips + + ( + X + ) + : + + + R + + + op + + + × + + R + + → + + S + i + m + p + + + + {\displaystyle {\text{D-Rips}}(X):\mathbb {R} ^{\operatorname {op} }\times \mathbb {R} \to \mathbf {Simp} } + +. +The idea behind the degree-Rips bifiltration is that vertices of higher degree will correspond to higher density regions of the underlying data set. However, since degree-Rips does not depend on an arbitrary choice of a parameter (such as a pre-selected density parameter, which is a priori difficult to determine), it is a convenient tool for analyzing data. + + +== Applications to data analysis == +The degree-Rips bifiltration possesses several properties that make it a useful tool in data analysis. For example, each of its skeleta has polynomial size; the k-dimensional skeleton of + + + + + D-Rips + + ( + X + ) + + + {\displaystyle {\text{D-Rips}}(X)} + + has + + + + O + ( + + | + + X + + + | + + + k + + + 2 + + + ) + + + {\displaystyle O(|X|^{k+2})} + + simplices, where + + + + O + + + {\displaystyle O} + + denotes an asymptotic upper bound. Moreover, it has been shown that the degree-Rips bifiltration possesses reasonably strong stability properties with respect to perturbations of the underlying data set. Further work has also been done examining the stable components and homotopy types of degree-Rips complexes. +The software RIVET was created in order to visualize several multiparameter invariants (i.e., data structures that attempt to capture underlying geometric information of the data) of 2-parameter persistence modules, including the persistent homology modules of the degree-Rips bifiltration. These invariants include the Hilbert function, rank invariant, and fibered barcode. +As a follow-up to the introduction of degree-Rips in their original 2015 paper, Lesnick and Wright showed in 2022 that a primary component of persistent homology computations (namely, computing minimal presentations and bigraded Betti numbers) can be achieved efficiently in a way that outperforms other persistent homology software. Methods of improving algorithmic efficiency of multiparameter persistent homology have also been explored that suggest the possibility of substantial speed increases for data analysis tools such as RIVET. +The degree-Rips bifiltration has been used for analyzing data clusters with respect to variations in density. There has been some preliminary experimental analysis of the performance of degree-Rips with respect to outliers in particular, but this is an ongoing area of research as of February 2023. + + +== References == \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Directional-change_intrinsic_time-0.md b/data/en.wikipedia.org/wiki/Directional-change_intrinsic_time-0.md new file mode 100644 index 000000000..26b976bf5 --- /dev/null +++ b/data/en.wikipedia.org/wiki/Directional-change_intrinsic_time-0.md @@ -0,0 +1,157 @@ +--- +title: "Directional-change intrinsic time" +chunk: 1/1 +source: "https://en.wikipedia.org/wiki/Directional-change_intrinsic_time" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:54:03.000900+00:00" +instance: "kb-cron" +--- + +Directional-change intrinsic time is an event-based operator to dissect a data series into a sequence of alternating trends of defined size + + + + δ + + + {\displaystyle \delta } + +. + +The directional-change intrinsic time operator was developed for the analysis of financial market data series. It is an alternative methodology to the concept of continuous time. Directional-change intrinsic time operator dissects a data series into a set of drawups and drawdowns or up and down trends that alternate with each other. An established trend comes to an end as soon as a trend reversal is observed. A price move that extends a trend is called overshoot and leads to new price extremes. +Figure 1 provides an example of a price curve dissected by the directional change intrinsic time operator. +The frequency of directional-change intrinsic events maps (1) the volatility of price changes conditional to (2) the selected threshold + + + + δ + + + {\displaystyle \delta } + +. The stochastic nature of the underlying process is mirrored in the non-equal number of intrinsic events observed over equal periods of physical time. +Directional-change intrinsic time operator is a noise filtering technique. It identifies regime shifts, when trend changes of a particular size occur and hides price fluctuations that are smaller than the threshold + + + + δ + + + {\displaystyle \delta } + +. + + +== Application == +The directional-change intrinsic time operator was used to analyze high frequency foreign exchange market data and has led to the discovery of a large set of scaling laws that have not been previously observed. The scaling laws identify properties of the underlying data series, such as the size of the expected price overshoot after an intrinsic time event or the number of expected directional-changes within a physical time interval or price threshold. For example, a scaling relating the expected number of directional-changes + + + + N + ( + δ + ) + + + {\displaystyle N(\delta )} + + observed over the fixed period to the size of the threshold + + + + δ + + + {\displaystyle \delta } + +: + + + + + N + ( + δ + ) + = + + + ( + + + δ + + C + + N + , + D + C + + + + + ) + + + + E + + N + , + D + C + + + + + + + {\displaystyle N(\delta )=\left({\frac {\delta }{C_{N,DC}}}\right)^{E_{N,DC}}} + +, +where + + + + + C + + N + , + D + C + + + + + {\displaystyle C_{N,DC}} + + and + + + + + E + + N + , + D + C + + + + + {\displaystyle E_{N,DC}} + + are the scaling law coefficients. +Other applications of the directional-change intrinsic time in finance include: + +trading strategy characterised by the annual Sharpe ratio 3.04 +tools designed to monitor liquidity at multiple trend scales. +The methodology can also be used for applications beyond economics and finance. It can be applied to other scientific domains and opens a new avenue of research in the area of BigData. + + +== References == + Text in this draft was copied from Petrov, Vladimir; Golub, Anton; Olsen, Richard (2019). "Instantaneous Volatility Seasonality of High-Frequency Markets in Directional-Change Intrinsic Time". Journal of Risk and Financial Management. 12 (2): 54. doi:10.3390/jrfm12020054. hdl:10419/239003., which is available under a Creative Commons Attribution 4.0 International License. \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Directional_component_analysis-0.md b/data/en.wikipedia.org/wiki/Directional_component_analysis-0.md new file mode 100644 index 000000000..79b828cc8 --- /dev/null +++ b/data/en.wikipedia.org/wiki/Directional_component_analysis-0.md @@ -0,0 +1,147 @@ +--- +title: "Directional component analysis" +chunk: 1/2 +source: "https://en.wikipedia.org/wiki/Directional_component_analysis" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:54:01.826255+00:00" +instance: "kb-cron" +--- + +Directional component analysis (DCA) is a statistical method used in climate science for identifying representative patterns of variability in space-time data-sets such as historical climate observations, weather prediction ensembles or climate ensembles. +The first DCA pattern is a pattern of weather or climate variability that is both likely to occur (measured using likelihood) and has a large impact (for a specified linear impact function, and given certain mathematical conditions: see below). +The first DCA pattern contrasts with the first PCA pattern, which is likely to occur, but may not have a large impact, and with a pattern derived from the gradient of the impact function, which has a large impact, but may not be likely to occur. +DCA differs from other pattern identification methods used in climate research, such as EOFs, rotated EOFs and extended EOFs in that it takes into account an external vector, the gradient of the impact. +DCA provides a way to reduce large ensembles from weather forecasts or climate models to just two patterns. +The first pattern is the ensemble mean, and the second pattern is the DCA pattern, which represents variability around the ensemble mean in a way that takes impact into account. +DCA contrasts with other methods that have been proposed for the reduction of ensembles in that it takes impact into account in addition to the structure of the ensemble. + +== Overview == + +=== Inputs === +DCA is calculated from two inputs: + +a multivariate dataset of weather or climate data, such as historical climate observations, or a weather or climate ensemble +a linear impact function. The linear impact function is a function which defines a level of impact for every spatial pattern in the weather or climate data as a weighted sum of the values at different locations in the spatial pattern. An example is the mean value across the spatial pattern. The linear impact function can be generated as the first term in the multivariate Taylor series of a non-linear impact function. + +=== Formula === +Consider a space-time data set + + + + X + + + {\displaystyle X} + +, containing individual spatial pattern vectors + + + + x + + + {\displaystyle x} + +, where the individual patterns are each considered as single samples from a multivariate normal distribution with mean zero and covariance matrix + + + + C + + + {\displaystyle C} + +. +We define a linear impact function of a spatial pattern as + + + + + r + + t + + + x + + + {\displaystyle r^{t}x} + +, where + + + + r + + + {\displaystyle r} + + is a vector of spatial weights. +The first DCA pattern is given in terms the covariance matrix + + + + C + + + {\displaystyle C} + + and the weights + + + + r + + + {\displaystyle r} + + by the proportional expression + + + + + x + ∝ + C + r + + + {\displaystyle x\propto Cr} + +. + +The pattern can then be normalized to any length as required. + +=== Properties === +If the weather or climate data is elliptically distributed (e.g., is distributed as a multivariate normal distribution or a multivariate t-distribution) then the first DCA pattern (DCA1) is defined as the spatial pattern with the following mathematical properties: + +DCA1 maximises probability density for a given value of impact +DCA1 maximises impact for a given value of probability density +DCA1 maximises the product of impact and probability density +DCA1 is the conditional expectation, conditional on exceeding a certain level of impact +DCA1 is the impact-weighted ensemble mean +Any modification of DCA1 will lead to a pattern that is either less extreme, or has a lower probability density. + +=== Rainfall Example === +For instance, in a rainfall anomaly dataset, using an impact metric defined as the total rainfall anomaly, the first DCA pattern is the spatial pattern that has the highest probability density for a given total rainfall anomaly. If the given total rainfall anomaly is chosen to have a large value, then this pattern combines being extreme in terms of the metric (i.e., representing large amounts of total rainfall) with being likely in terms of the pattern, and so is well suited as a representative extreme pattern. + +=== Comparison with PCA === +The main differences between Principal component analysis (PCA) and DCA are + +PCA is a function of just the covariance matrix, and the first PCA pattern is defined so as to maximise explained variance +DCA is a function of the covariance matrix and a vector direction (the gradient of the impact function), and the first DCA pattern is defined so as to maximise probability density for a given value of the impact metric +As a result, for unit vector spatial patterns: + +The first PCA spatial pattern always corresponds to a higher explained variance, but has a lower value of the impact metric (e.g., the total rainfall anomaly), except in degenerate cases +The first DCA spatial pattern always corresponds to a higher value of the impact metric, but has a lower value of the explained variance, except in degenerate cases +The degenerate cases occur when the PCA and DCA patterns are equal. +Also, given the first PCA pattern, the DCA pattern can be scaled so that: + +The scaled DCA pattern has the same probability density as the first PCA pattern, but higher impact, or +The scaled DCA pattern has the same impact as the first PCA pattern, but higher probability density. + +== Two Dimensional Example == +Source: + +Figure 1 gives an example, which can be understood as follows: \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Directional_component_analysis-1.md b/data/en.wikipedia.org/wiki/Directional_component_analysis-1.md new file mode 100644 index 000000000..3dee015af --- /dev/null +++ b/data/en.wikipedia.org/wiki/Directional_component_analysis-1.md @@ -0,0 +1,276 @@ +--- +title: "Directional component analysis" +chunk: 2/2 +source: "https://en.wikipedia.org/wiki/Directional_component_analysis" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:54:01.826255+00:00" +instance: "kb-cron" +--- + +The two axes represent anomalies of annual mean rainfall at two locations, with the highest total rainfall anomaly values towards the top right corner of the diagram +The joint variability of the rainfall anomalies at the two locations is assumed to follow a bivariate normal distribution +The ellipse shows a single contour of probability density from this bivariate normal, with higher values inside the ellipse +The red dot at the centre of the ellipse shows zero rainfall anomalies at both locations +The blue parallel-line arrow shows the principal axis of the ellipse, which is also the first PCA spatial pattern vector +In this case, the PCA pattern is scaled so that it touches the ellipse +The diagonal straight line shows a line of constant positive total rainfall anomaly, assumed to be at some fairly extreme level +The red dotted-line arrow shows the first DCA pattern, which points towards the point at which the diagonal line is tangent to the ellipse +In this case, the DCA pattern is scaled so that it touches the ellipse +From this diagram, the DCA pattern can be seen to possess the following properties: + +Of all the points on the diagonal line, it is the one with the highest probability density +Of all the points on the ellipse, it is the one with the highest total rainfall anomaly +It has the same probability density as the PCA pattern, but represents higher total rainfall (i.e., points further towards the top right hand corner of the diagram) +Any change of the DCA pattern will reduce either the probability density (if it moves out of the ellipse) or reduce the total rainfall anomaly (if it moves along or into the ellipse) +In this case the total rainfall anomaly of the PCA pattern is quite small, because of anticorrelations between the rainfall anomalies at the two locations. As a result, the first PCA pattern is not a good representative example of a pattern with large total rainfall anomaly, while the first DCA pattern is. +In + + + + n + + + {\displaystyle n} + + dimensions the ellipse becomes an ellipsoid, the diagonal line becomes an + + + + n + − + 1 + + + {\displaystyle n-1} + + dimensional plane, and the PCA and DCA patterns are vectors in + + + + n + + + {\displaystyle n} + + dimensions. + +== Applications == + +=== Application to Climate Variability === +DCA has been applied to the CRU data-set of historical rainfall variability in order to understand the most likely patterns of rainfall extremes in the US and China. + +=== Application to Ensemble Weather Forecasts === +DCA has been applied to ECMWF medium-range weather forecast ensembles in order to identify the most likely patterns of extreme temperatures in the ensemble forecast. + +=== Application to Ensemble Climate Model Projections === +DCA has been applied to ensemble climate model projections in order to identify the most likely patterns of extreme future rainfall. + +== Derivation of the First DCA Pattern == +Source: +Consider a space-time data-set + + + + X + + + {\displaystyle X} + +, containing individual spatial pattern vectors + + + + x + + + {\displaystyle x} + +, where the individual patterns are each considered as single samples from a multivariate normal distribution with mean zero and covariance matrix + + + + C + + + {\displaystyle C} + +. +As a function of + + + + x + + + {\displaystyle x} + +, the log probability density is proportional to + + + + − + + x + + t + + + + C + + − + 1 + + + x + + + {\displaystyle -x^{t}C^{-1}x} + +. +We define a linear impact function of a spatial pattern as + + + + + r + + t + + + x + + + {\displaystyle r^{t}x} + +, where + + + + r + + + {\displaystyle r} + + is a vector of spatial weights. +We then seek to find the spatial pattern that maximises the probability density for a given value of the linear impact function. This is equivalent to finding the spatial pattern that maximises the log probability density for a given value of the linear impact function, which is slightly easier to solve. +This is a constrained maximisation problem, and can be solved using the method of Lagrange multipliers. +The Lagrangian function is given by + + + + + L + ( + x + , + λ + ) + = + − + + x + + t + + + + C + + − + 1 + + + x + − + λ + ( + + r + + t + + + x + − + 1 + ) + + + {\displaystyle L(x,\lambda )=-x^{t}C^{-1}x-\lambda (r^{t}x-1)} + + +Differentiating by + + + + x + + + {\displaystyle x} + + and setting to zero gives the solution + + + + + x + ∝ + C + r + + + {\displaystyle x\propto Cr} + + +Normalising so that + + + + x + + + {\displaystyle x} + + is unit vector gives + + + + + x + = + C + r + + / + + ( + + r + + t + + + C + C + r + + ) + + 1 + + / + + 2 + + + + + {\displaystyle x=Cr/(r^{t}CCr)^{1/2}} + + +This is the first DCA pattern. +Subsequent patterns can be derived which are orthogonal to the first, to form an orthonormal set and a method for matrix factorisation. + +== References == \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Document-oriented_database-0.md b/data/en.wikipedia.org/wiki/Document-oriented_database-0.md new file mode 100644 index 000000000..7da28ddff --- /dev/null +++ b/data/en.wikipedia.org/wiki/Document-oriented_database-0.md @@ -0,0 +1,57 @@ +--- +title: "Document-oriented database" +chunk: 1/2 +source: "https://en.wikipedia.org/wiki/Document-oriented_database" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:54:05.454091+00:00" +instance: "kb-cron" +--- + +A document-oriented database, or document store, is a computer program and data storage system designed for storing, retrieving, and managing document-oriented information, also known as semi-structured data. +Document-oriented databases are one of the main categories of NoSQL databases, and the popularity of the term "document-oriented database" has grown alongside the adoption of NoSQL itself. XML databases are a subclass of document-oriented databases optimized for XML documents. Graph databases are similar, but add another layer, the relationship, which allows them to link documents for rapid traversal. +Document-oriented databases are conceptually an extension of the key–value store, another type of NoSQL database. In key-value stores, data is treated as opaque by the database, whereas document-oriented systems exploit the internal structure of documents to extract metadata and optimize storage and queries. Although in practice the distinction can be minimal due to modern tooling, document stores are designed to provide a richer programming experience with modern programming techniques. +Document databases differ significantly from traditional relational databases (RDBs). Relational databases store data in predefined tables, often requiring an object to be split across multiple tables. In contrast, document databases store all information for a given object in a single document, with each document potentially having a unique structure. This design eliminates the need for object-relational mapping when loading data into the database. + +== Documents == +The central concept of a document-oriented database is the notion of a document. Although implementations vary in their specific definitions, document-oriented databases generally treat documents as self-contained units that encapsulate and encode data in a standardized format. Common encoding formats include XML, YAML, JSON, as well as binary representations such as BSON. +Documents in a document store are equivalent to the programming concept of an object. They are not required to adhere to a fixed schema, and documents within the same collection may contain different fields or structures. Fields may be optional, and documents of the same logical type may differ in composition. For example, the following illustrates a document encoded in JSON: + +A second document might be encoded in XML as: + +The two example documents share some structural elements but also contain unique fields. The structure, text, and other data within each document are collectively referred to as the document's content and can be accessed or modified using retrieval or editing operations. Unlike relational databases, in which each record contains the same fields and unused fields are left empty, document-oriented databases do not require uniform fields across documents. This design allows new information to be added to some documents without affecting the structure of others. +Document databases often support the storage of additional metadata alongside the document content. Such metadata may relate to organizational features, security, indexing, or other implementation-specific features. + +=== CRUD operations === +The core operations supported by a document-oriented database for manipulating documents are similar to those in other databases. Although terminology is not perfectly standardized, these operations are generally recognized as Create, Read, Update, and Delete (CRUD). + +Creation (C): Adds a new document to the database. +Retrieval (R): Retrieves documents or fields based on queries. +Update (U): Modifies the contents of existing documents. +Deletion (D): Removes documents from the database. + +=== Keys === +Documents in a document-oriented database are addressed via a unique identifier. This identifier, often a string, URI, or path, can be used to retrieve the document from the database. Most document stores maintain an index on the key to optimize retrieval, and in some implementations the key is required when creating or inserting a new document. + +=== Retrieval === +In addition to key-based access, document-oriented databases typically provide an API or query language that enables retrieval based on document content or associated metadata. For example, a query may return all documents with a specific field matching a given value. The available query features, indexing options, and performance characteristics vary across implementations. +Document stores differ from key-value stores in that they exploit the internal structure and metadata of stored documents. In many key-value stores, values are treated as opaque or "black-box" data, meaning the database system does not interpret their internal structure. By contrast, document-oriented databases can classify and interpret document content. This enables queries that distinguish between types of data––for example, retrieving all phone numbers containing "555" without also matching a postal code such as "55555." + +=== Editing === +Document databases typically provide mechanisms for updating or editing the content or metadata of a document. Updates may involve replacing the entire document or modifying individual elements or fields within the document. + +=== Organization === +Document database implementations support a variety of methods for organizing documents, including: + +Collections: Groups of documents. Depending on the implementation, a document may be required to belong to a single collection or may be allowed in multiple collections. +Tags and non-visible metadata: Additional data stored outside the main document content. +Directory hierarchies: Documents organized in a tree-like structure, often based on path or URI. +These organizational structures may differ between logical and physical representations (e.g. on disk or in memory). + +== Relationship to other databases == + +=== Relationship to key-value stores === +A document-oriented database can be viewed as a specialized form of key-value store, which is itself a category of NoSQL database. In a basic key-value store, the stored value is typically treated as opaque by the database system. By contrast, a document-oriented database provides APIs or a query and update language that allows queries and modifications based on the internal structure of the document. For users who do not require advanced query, retrieval, or update capabilities, the distinction between document-oriented databases and key-value stores may be minimal. + +=== Relationship to search engines === +Some search engine and information retrieval systems, such as Apache Solr and Elasticsearch, provide document storage and support core document operations. As a result, they may meet certain functional definitions of a document-oriented database, although their primary design goals differ. \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Document-oriented_database-1.md b/data/en.wikipedia.org/wiki/Document-oriented_database-1.md new file mode 100644 index 000000000..cfdf08525 --- /dev/null +++ b/data/en.wikipedia.org/wiki/Document-oriented_database-1.md @@ -0,0 +1,48 @@ +--- +title: "Document-oriented database" +chunk: 2/2 +source: "https://en.wikipedia.org/wiki/Document-oriented_database" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:54:05.454091+00:00" +instance: "kb-cron" +--- + +=== Relationship to relational databases === +In a relational database, data is organized into predefined types represented as tables. Each table contains rows (records) with a fixed set of columns (fields), so all records in a table share the same structure. Administrators typically define indexes on selected fields to improve query performance. A central principle of relational database design is database normalization, in which data that might otherwise be repeated is stored in separate tables and linked using keys. When records in different tables are related, a foreign key is used to associate them. +For example, an address book application may store a contact's name, image, phone numbers, mailing addresses, and email addresses. In a normalized relational design, separate tables might be created for contacts, phone numbers, and email addresses. The phone number table would include a foreign key referencing the associated contact. To reconstruct a complete contact record, the database retrieves related information from each table using the foreign keys and combines it into a single record. +In contrast, a document-oriented database stores all data related to an object within a single document, and stored in the database as a single entry. In the address book example,the contact's name, image, and contact information may be stored together in one document. The document is retrieved using a unique key, and all related information is returned together, without needing to look up multiple tables. +A key difference between the document-oriented and relational models is that the data formats are not predefined in the document case. In most cases, any sort of document can be stored in a database, and documents can change in type and form over time. For example, a new field such as COUNTRY_FLAG can be added to new documents as they are inserted without affecting existing documents. To aid retrieval, document-oriented systems generally allow the administrator to provide hints to the database for locating certain types of information. These hints work in a similar fashion to indexes in relational databases. Many systems also allow additional metadata outside the content of the document itself, such as tagging entries as part of an address book, which enables retrieval of related information, for instance, all the address book entries. This provides functionality similar to a table, but separates the concept (categories of data) from its physical implementation (tables). +In the traditional normalized relational model, objects in the database are represented as separate rows of data, with no inherent structure beyond what is defined in the tables. This can create difficulties when translating programming objects to and from their corresponding database rows, a challenge known as object-relational impedance mismatch. In contrast, document stores often map programming objects directly into the database, preserving much of their internal structure. Databases using this approach are frequently described as NoSQL systems. + +== Implementations == + +=== XML database implementations === + +Most XML databases are document-oriented databases. + +== See also == +Database theory +Data hierarchy +Data analysis +Full-text search +In-memory database +Internet Message Access Protocol (IMAP) +Machine-readable document +Multi-model database +NoSQL +Object database +Online database +Real-time database +Relational database +Content management system + +== Notes == + +== References == + +== Further reading == +Arkin, Assaf. (September 20, 2007). "Read Consistency: Dumb Databases, Smart Services."Labnotes: Don't Let the Bubble Go to Your Head! Archived March, 27 2008 at the Wayback Machine. + +== External links == +DB-Engines Ranking of Document Stores by popularity, updated monthly \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Double_mass_analysis-0.md b/data/en.wikipedia.org/wiki/Double_mass_analysis-0.md new file mode 100644 index 000000000..d87fdcfd4 --- /dev/null +++ b/data/en.wikipedia.org/wiki/Double_mass_analysis-0.md @@ -0,0 +1,162 @@ +--- +title: "Double mass analysis" +chunk: 1/1 +source: "https://en.wikipedia.org/wiki/Double_mass_analysis" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:54:07.787564+00:00" +instance: "kb-cron" +--- + +Double mass analysis is a simple graphical method to evaluate the consistency of hydrological data. The DM approach plots the cumulative data of one variable against the cumulative data of a second variable. A break in the slope of a linear function fit to the data is thought to represent a change in the relation between the variables. This approach provides a robust method to determine a change in the behavior of precipitation and recharge in a simple graphical method. It is a commonly used data analysis approach for investigating the behaviour of records made of hydrological or meteorological data at a number of locations. It is used to determine whether there is a need for corrections to the data - to account for changes in data collection procedures or other local conditions. Such changes may result from a variety of things including changes in instrumentation, changes in observation procedures, or changes in gauge location or surrounding conditions. Double mass analysis for checking consistency of a hydrological or meteorological record is considered to be an essential tool before taking it for analysis purpose. This method is based on the hypothesis that each item of the recorded data of a population is consistent. +An example of a double mass analysis is a "double mass plot", or "double mass curve". For this, points and/or a joining line are plotted where the x- and y- coordinates are determined by the running totals of the values observed at two stations. If both stations are affected to the same extent by the same trends then a double mass curve should follow a straight line. A break in the slope of the curve would indicate that conditions have changed at one location but not at another. Breaks in the double-mass curve of such variables are caused by changes in the relation between the variables. These changes may be due to changes in the method of data collection or to physical changes that affect the relation. This technique is based on the principle that when each recorded data comes from the same parent population, they are consistent. + + +== Procedure for Data Correction == +Let + + + + ( + + x + + i + + + , + + y + + i + + + ) + + + {\displaystyle (x_{i},y_{i})} + + be the data points then the procedure for double mass analysis is as follows; + +Divide the data into + + + + + n + + i + + + + + {\displaystyle n_{i}} + + distinct categories of equal slope ( + + + + + S + + i + + + + + {\displaystyle S_{i}} + +). +Obtain correction factor for category + + + + + n + + i + + + 1 + + + + + {\displaystyle n_{i+1}} + + as; + + + + + c + + i + + + = + + + + S + + i + + + + S + + i + + + 1 + + + + + + + {\displaystyle c_{i}={\frac {S_{i}}{S_{i+1}}}} + + +Multiply + + + + + n + + i + + + 1 + + + + + {\displaystyle n_{i+1}} + + category with + + + + + c + + i + + + + + {\displaystyle c_{i}} + + to get corrected data. +After correction, repeat this process until all data points have the same slope. + + +== See also == +Statistics + + +== Notes == + + +== Further reading == +Dubreuil P. (1974) Initiation à l'analyse hydrologique Masson& Cie et ORSTOM, Paris. \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Error_level_analysis-0.md b/data/en.wikipedia.org/wiki/Error_level_analysis-0.md new file mode 100644 index 000000000..54dd2a9a9 --- /dev/null +++ b/data/en.wikipedia.org/wiki/Error_level_analysis-0.md @@ -0,0 +1,46 @@ +--- +title: "Error level analysis" +chunk: 1/1 +source: "https://en.wikipedia.org/wiki/Error_level_analysis" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:54:08.944694+00:00" +instance: "kb-cron" +--- + +Error level analysis (ELA) is the analysis of compression artifacts in digital data with lossy compression such as JPEG. + + +== Principles == +When used, lossy compression is normally applied uniformly to a set of data, such as an image, resulting in a uniform level of compression artifacts. +Alternatively, the data may consist of parts with different levels of compression artifacts. This difference may arise from the different parts having been repeatedly subjected to the same lossy compression a different number of times, or the different parts having been subjected to different kinds of lossy compression. A difference in the level of compression artifacts in different parts of the data may therefore indicate that the data has been edited. +In the case of JPEG, even a composite with parts subjected to matching compressions will have a difference in the compression artifacts. +In order to make the typically faint compression artifacts more readily visible, the data to be analyzed is subjected to an additional round of lossy compression, this time at a known, uniform level, and the result is subtracted from the original data under investigation. The resulting difference image is then inspected manually for any variation in the level of compression artifacts. In 2007, N. Krawetz denoted this method "error level analysis". +Additionally, digital data formats such as JPEG sometimes include metadata describing the specific lossy compression used. If in such data the observed compression artifacts differ from those expected from the given metadata description, then the metadata may not describe the actual compressed data, and thus indicate that the data have been edited. + + +== Limitations == +By its nature, data without lossy compression, such as a PNG image, cannot be subjected to error level analysis. Consequently, since editing could have been performed on data without lossy compression with lossy compression applied uniformly to the edited, composite data, the presence of a uniform level of compression artifacts does not rule out editing of the data. +Additionally, any non-uniform compression artifacts in a composite may be removed by subjecting the composite to repeated, uniform lossy compression. Also, if the image color space is reduced to 256 colors or less, for example, by conversion to GIF, then error level analysis will generate useless results. +More significant, the actual interpretation of the level of compression artifacts in a given segment of the data is subjective, and the determination of whether editing has occurred is therefore not robust. + + +== Controversy == +In May 2013, Dr Neal Krawetz used error level analysis on the 2012 World Press Photo of the Year and concluded on his Hacker Factor blog that it was "a composite" with modifications that "fail to adhere to the acceptable journalism standards used by Reuters, Associated Press, Getty Images, National Press Photographer's Association, and other media outlets". The World Press Photo organizers responded by letting two independent experts analyze the image files of the winning photographer and subsequently confirmed the integrity of the files. One of the experts, +Hany Farid, said about error level analysis that "It incorrectly labels altered images as original and incorrectly labels original images as altered with the same likelihood". Krawetz responded by clarifying that "It is up to the user to interpret the results. Any errors in identification rest solely on the viewer". +In May 2015, the citizen journalism team Bellingcat wrote that error level analysis revealed that the Russian Ministry of Defense had edited satellite images related to the Malaysia Airlines Flight 17 disaster. In a reaction to this, image forensics expert Jens Kriese said about error level analysis: "The method is subjective and not based entirely on science", and that it is "a method used by hobbyists". On his Hacker Factor Blog, the inventor of error level analysis Neal Krawetz criticized both Bellingcat's use of error level analysis as "misinterpreting the results" but also on several points Jens Kriese's "ignorance" regarding error level analysis. + + +== See also == +Image analysis + + +== References == + +"Tutorial: Error Level Analysis". fotoforensic.com. Retrieved 2015-07-23. + + +== External links == +Image Forensics : Error Level Analysis +FotoForensics +Fake Image Detector using Error Level Analysis \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Facet_theory-0.md b/data/en.wikipedia.org/wiki/Facet_theory-0.md new file mode 100644 index 000000000..5d6b3ed4e --- /dev/null +++ b/data/en.wikipedia.org/wiki/Facet_theory-0.md @@ -0,0 +1,31 @@ +--- +title: "Facet theory" +chunk: 1/8 +source: "https://en.wikipedia.org/wiki/Facet_theory" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:54:10.375224+00:00" +instance: "kb-cron" +--- + +Facet theory is a metatheory for the multivariate behavioral sciences that posits that scientific theories and measurements can be advanced by discovering relationships between conceptual classifications of research variables and empirical partitions of data-representation spaces. For this purpose, facet theory proposes procedures for (1) Constructing or selecting variables for observation, using the mapping sentence technique (a formal definitional framework for a system of observations), and (2) Analyzing multivariate data, using data representation spaces, notably those depicting similarity measures (e.g., correlations), or partially ordered sets, derived from the data. +Facet theory is characterized by its direct concern with the entire content-universe under study, containing many, possibly infinitely many, variables. Observed variables are regarded just as a sample of statistical units from the multitude of variables that make up the investigated attribute (the content-universe). Hence, Facet theory proposes techniques for sampling variables for observation from the entire content universe; and for making inferences from the sample of observed variables to the entire content universe. The sampling of variables is done with the aid of the mapping sentence technique (see Section 1); and inferences from the sample of observed variables to the entire content universe are made with respect to correspondences between conceptual classifications (of attribute-variables or of population-members) and partitions of empirical geometric representation spaces obtained in data analysis (see Sections 2 & 3). +Of the many types of representation spaces that have been proposed, two stand out as especially fruitful: Faceted-SSA (Faceted Smallest Space Analysis) for structuring the investigated attribute (see Section 2); and POSAC (Partial Order Scalogram Analysis by base Coordinates) for multiple scaling measurements of the investigated attribute (see Section 3). +Inasmuch as observed variables in a behavioral study form in fact but a sample from the content-universe of interest, facet theory's procedures and principles serve to avoid errors that may ensue from incidental sampling of observed variables, thus meeting the challenge of the replication crisis in psychological research and in behavioral research in general. +Facet Theory was initiated by Louis Guttman and has been further developed and applied in a variety of disciplines of the behavioral sciences including psychology, sociology, and business administration. + +== The mapping sentence == + +=== Definition and properties of the mapping sentence === +Definition (Guttman). A mapping sentence is a verbal statement of the domain and of the range of a mapping including connectives between facets as in ordinary language. +In the context of behavioral research, a mapping sentence is essentially a function whose domain consists of the respondents and of the stimuli as arguments, and whose image consists of the cartesian product of the ranges of responses to the stimuli, where each response-range is similarly ordered from high to low with respect to a concept common to all stimuli. When stimuli are classified a priori by one or more content criteria, the mapping sentence facilitates stratified sampling of the content-universe. A classification of the stimuli by their content is called a content facet; and the pre-specified set of responses to a stimulus (classifying respondents by their response to that stimulus) is called a range facet. +The mapping sentence defines the system of observations to be performed. As such, the mapping sentence provides also the essential concepts in terms of which research hypotheses may be formulated. + +=== An example from intelligence research === +Suppose members pi of a population P are observed with respect to their success in a written verbal intelligence test. Such observations may be described as a mapping from the observed population to the set of possible scores, say, R = {1,...,10}: Pq1 → R, where q1 is the sense in which a specific score is assigned to every individual in the observed population P, i.e., q1 is "verbal intelligence" in this example. Now, one may be interested in observing also the mathematical or, more specifically, the numerical intelligence of the investigated population; and possibly also their spatial intelligence. Each of these kinds of intelligence is a "sense" in which population members pi may be mapped into a range of scores R = {1,...,10}. Thus, 'intelligence' is now differentiated into three types of materials: verbal (q1), numerical (q2) and spatial (q3). Together, P, the population, and Q = {q1, q2, q3}, the set of types of intelligence, form a cartesian product which constitutes the mapping domain. The mapping is from the set of pairs (pi, qj) to the common range of test-scores R = {1,...,10}: P × Q → R. +A facet is a set that serves as a component-set of a cartesian product. Thus, P is called the population facet, Q is called a content facet, and the set of scores obtainable for each test is a range facet. The range facets of the various items (variables) need not be identical in size: they may have any finite number of scores, or categories, greater or equal to 2. + +=== The Common Meaning Range (CMR) === +The ranges of the items pertaining to an investigated content-universe – intelligence in this example – should all have a Common Meaning Range (CMR); that is, they must be ordered from high to low with respect to a common meaning. Following Guttman, the common meaning proposed for the ranges of intelligence-items is "correctness with respect to an objective rule". +The concept of CMR is central in facet theory: It serves to define the content-universe being studied by specifying the universe of items pertaining to that content-universe. Thus, the mapping-definition of intelligence, advanced by facet theory is: +"An item belongs to the universe of intelligence items if and only if its domain requires performance of a cognitive task concerning an objective rule and its range is ordered from high correctness to low correctness with respect to that rule." \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Facet_theory-1.md b/data/en.wikipedia.org/wiki/Facet_theory-1.md new file mode 100644 index 000000000..f690b906f --- /dev/null +++ b/data/en.wikipedia.org/wiki/Facet_theory-1.md @@ -0,0 +1,34 @@ +--- +title: "Facet theory" +chunk: 2/8 +source: "https://en.wikipedia.org/wiki/Facet_theory" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:54:10.375224+00:00" +instance: "kb-cron" +--- + +An initial framework for observing intelligence could be Mapping Sentence 1. +The mapping sentence serves as a unified semantic device for specifying the system of intelligence test items, according to the present conceptualization. Its content facet, the material facet, may now serve as a classification of intelligence test items to be considered. Thus, in designing observations, a stratified sampling of items is afforded by ensuring an appropriate selection of items from each of the material facet elements; that is, from each class of items: the verbal, the numerical and the spatial. + +=== Enriching the mapping sentence === +The research design can be enriched by introducing to the mapping sentence an additional, independent classification of the observations in the form of an additional content-facet, thereby facilitating systematic differentiations of the observations. For example, intelligence items may be classified also according to the cognitive operation required in order to respond correctly to an item: whether rule-recall (memory), rule-application, or rule-inference. Instead of the three sub-content-universes of intelligence defined by the material facet alone, we now have nine sub-content-universes defined by the cartesian multiplication of the material and the mental-operation facets. See mapping sentence 2. + +Another way of enriching a mapping sentence (and the scope of the research) is by adding an element (a class) to an existing content facet; for example, by adding Interpersonal material as a new element to the extant material facet. See Mapping Sentence 3. + +=== Content profiles === +A selection of one element from each of the two content facets defines a content profile which represents a sub-content-universe of intelligence. For example, the content profile (c2, q2) represents the application of rules for performing mathematical computations, such as performing long division. The 3x4=12 sub-content-universes constitute twelve classes of intelligence items. In designing observations, the researcher would strive to include a number of varied items from each of these 12 classes so that the sample of observed items would be representative of the entire intelligence universe. Of course, this stratified sampling of items depends on the researchers' conception of the studied domain, reflected in their choice of content-facets. But, in the larger cycle of the scientific investigation (which includes Faceted SSA of empirical data, see next section), this conception may undergo adjustments and remolding, converging to improved choices of content-facets and observations, and ultimately to robust theories in research domain. In general, mapping sentences may attain high levels of complexity, size and abstraction through various logical operations such as recursion, twist, decomposition and completion. + +=== Cartesian decomposition and completion: an example === +In drafting a mapping sentence, an effort is made to include the most salient content-facets, according to the researcher's existing conception of the investigated domain. And for each content facet, attempt is made to specify its elements (classes) so that they be exhaustive (complete) and exclusive (non-overlapping) of each other. Thus, the element 'interpersonal' has been added to the incumbent 3-element material facet of intelligence by a two-step facet-analytic procedure. Step 1, cartesian decomposition of the 3-element material facet into two binary elementary facets: The Environment Facet, whose elements are 'physical environment' and 'human-environment'; and the Symbolization Facet whose elements are 'symbolic' (or high symbolization), and 'concrete' (or low symbolization). Step 2, cartesian completion of the material facet is then sought by attempting to infer the missing material classifiable as 'human environment' and 'concrete'. + +In facet theory, this 2×2 classification of intelligence-testing material may now be formulated as an hypothesis to be tested empirically, using Faceted Smallest Space Analysis (SSA). + +=== Complementary topics concerning the mapping sentence === +Despite its seemingly rigid appearance, the mapping sentence format can accommodate complex semantic structures such as twists and recursions, while retaining its essential cartesian structure. +In addition to guiding the collection of data, mapping sentences have been used to content-analyze varieties of conceptualizations and texts—such as organizational quality, legal documents and even dream stories. + +== Concepts as spaces: faceted SSA == + +=== Description of Faceted Smallest Space Analysis (Faceted SSA) === +Facet theory conceives of a multivariate attribute as a content-universe defined by the set of all its items, as specified by the attribute mapping-definition, illustrated above. In facet-theoretical data analysis, the attribute (e.g., intelligence) is likened to a geometric space of suitable dimensionality, whose points represent all possible items. Observed items are processed by Faceted SSA, a version of Multidmensional Scaling (MDS) which involves the following steps: \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Facet_theory-2.md b/data/en.wikipedia.org/wiki/Facet_theory-2.md new file mode 100644 index 000000000..8570bdcd9 --- /dev/null +++ b/data/en.wikipedia.org/wiki/Facet_theory-2.md @@ -0,0 +1,38 @@ +--- +title: "Facet theory" +chunk: 3/8 +source: "https://en.wikipedia.org/wiki/Facet_theory" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:54:10.375224+00:00" +instance: "kb-cron" +--- + +Receiving as input (or computing from input data) a matrix of similarity coefficients, specifying, for each pair of items how similar they are. A common example is the computation of a correlation-coefficient matrix from input data, where the size of a correlation coefficient between two variables reflects the degree of similarity between them. +Mapping the items (variables) as points in a geometric space of a given dimensionality while preserving as well as possible the condition: If rij>rkl then dij aj, if and only if aik ≥ ajk for k = 1, ..., n, and aik′ > ajk′ for some k. Two different profiles are incomparable, denoted by ai $ aj, if neither ai > aj nor aj > ai. A, and therefore its subset A′, form a partially ordered set. +Facet Theoretical measurement consists in mapping points a(pi) of A' into a coordinate space X of the lowest dimensionality while preserving observed order relations, including incomparability: +Definition. The p.o. dimensionality of scalogram A' is the smallest m (m ≤ n) for which there exist m facets X1 ... Xm (each Xi is ordered) and there exists a 1-1 mapping Q:X′ → A′ from X′ ( + + + + + X + ′ + + ⊆ + X + = + + X + + 1 + + + × + ⋯ + × + + X + + m + + + + + {\textstyle X'\subseteq X=X_{1}\times \cdots \times X_{m}} + +) to A′ such that a > a′ if and only if x > x′ whenever Q maps points x, x′ in X′ to points a, a′ ∈ A. +The coordinate scales, Xi (i = 1, ..., m) represent underlying fundamental variables whose meanings must be inferred in any specific application. The well known Guttman scale [24] (example: 1111, 1121, 1131, 2131, 2231, 2232) is simply a 1-d scalogram, i.e. one all of whose profiles are comparable. +The procedure of identifying and interpreting the coordinate scales X1...Xm is called multiple scaling. multiple scaling is facilitated by partial order scalogram analysis by base coordinates (POSAC) for which algorithms and computer programs have been devised. In practice, a particular dimensionality is attempted and a solution that best accommodates the order-preserving condition is sought. The POSAC/LSA program finds an optimal solution in 2-d coordinate space, then goes on to analyze by Lattice Space Analysis (LSA) the role played by each of the variables in structuring the POSAC 2-space, thereby facilitating interpretation of the derived coordinate scales, X1, X2. Recent developments include the algorithms for computerized partitioning of the POSAC space by the range facet of each variable, which induces meaningful intervals on the coordinate scales, X, Y. + +=== Example 3. TV watching patterns: analysis of simplified survey data === +Source: +Members of a particular population were asked four questions: whether they watched TV the night before for an hour at 7 PM (hour 1), at 8 PM (hour 2), at 9 PM (hour 3) and at 10 PM (hour 4). A positive answer to a question was recorded as 1, and a negative answer, as 0. Thus, for example, the profile 1010 represents a person who watched TV at 7 PM and at 9 PM but not at 8 PM and at 10 PM. Suppose that out of the 16 combinatorially possible profiles, only the following eleven profiles were observed empirically: 0000, 1000, 0100, 0010, 0001, 1100, 0110, 0011, 1110, 0111, 1111. Figure 3 is an order-preserving mapping of these profiles into a 2-dimensional coordinate space. \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Facet_theory-5.md b/data/en.wikipedia.org/wiki/Facet_theory-5.md new file mode 100644 index 000000000..1bd5a67da --- /dev/null +++ b/data/en.wikipedia.org/wiki/Facet_theory-5.md @@ -0,0 +1,41 @@ +--- +title: "Facet theory" +chunk: 6/8 +source: "https://en.wikipedia.org/wiki/Facet_theory" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:54:10.375224+00:00" +instance: "kb-cron" +--- + +Given this POSAC solution, an attempt is made to interpret the two coordinates, X1 and X2, as two fundamental scales of the investigated phenomenon of evening TV watching by the investigated population. This is done by, first, interpreting the intervals (equivalence classes) within each coordinate, and then trying to conceptualize the derived meanings of the ordered intervals, in terms of a meaningful notion that may be attributed to the coordinate. +In the present simplified example, this is easy: Inspecting the map, we attempt to identify the feature that distinguishes all profiles with given score in X1. Thus, we find that profiles with X1=4, and only they, represent TV watching in the fourth hour. Profiles with X1 = 3 all have 1 in the third watching hour but 0 at the fourth hour, i.e., the third hour is the latest watching hour. X1 = 2 is assigned to, and only to, profiles whose latest watching hour is the second hour. And, finally, X1 = 1 is for the profile 1000 which represents the fact that the first hour is the only – and therefore the latest – watching hour (ignoring the profile 0000 of those who didn't watch TV at the specified hours, and could be assigned (0,0) in this coordinate-space). Hence, it may be concluded that intervals of coordinate X1 represent j=the latest hour—among the four hours observed—in which TV was watched, (j = 1, ..., 4). Similarly, it is found that intervals of coordinate X2 represent 5 − k for k (k = 1, ..., 4) is the earliest hour of TV watching. +Indeed, for profiles of the observed set, which represent a single sequence of continuous TV watching, specification of the earliest and latest watching hours, provide full description of the watching hours. +Example 3 illustrates key features of Multiple Scaling by POSAC that render this procedure a theory-based multivariate measurement: + +The two scores assigned by Multiple Scaling to every observed profile—and hence to every person in the observed sample—replace the more numerous scores (four, in the present example) of the observed variables, while retaining all observed order relations, including incomparability. The new scores assess observed persons on the two coordinate-scales, taken to constitute Nature's fundamental variables. +The two coordinate-scales have intrinsic meanings that probe into a deeper significance than the observed variables considered severally. In the present example, the earliest and the latest hour indeed exhaust the essential aspects of the pattern of TV watching, given the particular set of observed profiles. +The concepts derived for the fundamental, unobserved coordinate-scales retain the CMR—the essential meaning common to all observed variables. In the present example, the CMR is more (vs. less) TV watching. For, considering the observed variables, each of them records high (1) vs. low (0) TV watching in a given hour. And the derived coordinate-scales, too, record high (4) vs. low (1) TV watching, since ceteris paribus, the later is the latest watching hour, the more TV one watches (X1); and the earlier is the earliest watching hour, the more TV one watches ( X2). +These features are present also in applications that are less obvious, to produce scales with novel meanings. + +=== Example 4. Measuring distributive justice attitudes === +In the systemic theory of distributive justice (DJ), alternative allocations of a given amount of an educational resource (100 supplementary teaching hours) between gifted and disadvantaged pupils, may be classified by one of four types, the preference for each reflecting one's DJ attitude: +Equality, where the gifted and the disadvantaged pupils get the same amount of the supplementary resource; +Fairness, where the disadvantaged pupils get more of the resource than the gifted, in proportion to their weakness relative to the gifted; +Utility, where the gifted get more of the resource than the disadvantaged pupils (so as to promote future contribution to the general good); +Corrective Action, where the disadvantaged pupils get more of the resource than the gifted over and above the proportion of their weakness relative to the gifted pupils, (so as to compensate them for past accumulated disadvantage); +Following the Faceted SSA validation of the four DJ modes of Equality, Fairness, Utility, and Corrective Action, profiles based on eight dichotomized DJ attitudes variables observed on a sample of 191 respondents, were created. 35 of the 256 combinatorially possible profile were observed and analyzed by POSAC to obtain the measurement space shown in Figure 4. For each of the variables an optimal partition- line was computed that separates a high from a low score in that variable. (Logically, partition-lines must look like non-increasing step functions.) Then, for each of the four attitude types, the characteristic partition-line was identified as follows: + +Fairness—a straight vertical line; +Utility—a straight horizontal line; +Equality—an L-shaped line; +Corrective action—an inverted-L-shaped line +The content significance of the intervals induced by these partition-lines on the X coordinate and on the Y coordinate of the POSAC space, are now identified and thereby define the contents of the X and Y Coordinate Scales of DJ attitudes. +The X-coordinate Scale, interpreted as Enhanced Fairness Attitude Scale: + +Interval 1. Low Fairness & Low Equality DJ Attitude +Interval 2. Low Fairness & High Equality DJ Attitude +Interval 3. High Fairness & Low Corrective Action DJ Attitude +Interval 4. High Fairness & High Corrective Action DJ Attitude +That is, Enhanced Fairness Attitude, even if low, (interval 1 and 2) is somewhat present when Equality is favored (interval 2). And if Enhanced Fairness Attitude is high (intervals 3 and 4), it reaches the extreme level (interval 4) when Corrective Action is favored. +The Y-coordinate Scale, interpreted as Enhanced Utility Attitude Scale: \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Facet_theory-6.md b/data/en.wikipedia.org/wiki/Facet_theory-6.md new file mode 100644 index 000000000..2a6a8e4c2 --- /dev/null +++ b/data/en.wikipedia.org/wiki/Facet_theory-6.md @@ -0,0 +1,21 @@ +--- +title: "Facet theory" +chunk: 7/8 +source: "https://en.wikipedia.org/wiki/Facet_theory" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:54:10.375224+00:00" +instance: "kb-cron" +--- + +Interval 1. Low Utility & Low Equality DJ Attitude +Interval 2. Low Utility & High Equality DJ Attitude +Interval 3. High Utility s & Low Corrective Action DJ Attitude +Interval 4. High Utility & High Corrective Action DJ Attitude +That is, Enhanced Utility Attitude, even if low, (interval 1 and 2) is somewhat present when Equality is favored (interval 2). If Enhanced Utility Attitude is high (intervals 3 and 4), it reaches the extreme level (interval 4) when Corrective Action is favored. (This may well reflect the sentiment that, in the long run, the advancement of disadvantaged pupils serves the common good.) +The meanings of the fundamental variables, X and Y, while relying on the concepts of fairness and of utility, respectively, suggest new notions that modify them. The new notions were christened Enhanced (or Extended) Fairness and Enhanced (or Extended) Utility. + +=== Complementary topics in partial order spaces === +Higher order partition lines. The above simple measurement space illustrates partition-lines that are straight or have one bend. More complex measurement spaces result with items whose partition-lines have two or more bends. +While partial order spaces are used mainly for analyzing score profiles (based on range facets), under certain conditions, they may be applied to the analysis of content profiles; i.e., those based on content facets. +Relating POSAC Measurement Space to the SSA Concept Space. Based on the same data matrix, POSAC measurement space and Faceted SSA concept space are mathematically related. Proved relationships rely on the introduction of a new kind of coefficient, E*, the coefficient of structural similarity. While E* assesses pairwise similarity between variables, it does depend on variations in the remaining n-2 variables processed. That is, in the spirit of Facet Theory, E* depends on the sampled contents as well as on the sampled population. LSA1 procedure, within 2-dimensional POSAC/LSA program, is a special version of SSA with E* as the similarity coefficient, and with lattice ("city block") as the distance function. Under specified conditions, LSA1 may be readily derived from the boundary scales of the POSAC configuration, thereby highlighting concept/measurement space duality. \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Facet_theory-7.md b/data/en.wikipedia.org/wiki/Facet_theory-7.md new file mode 100644 index 000000000..c382f8904 --- /dev/null +++ b/data/en.wikipedia.org/wiki/Facet_theory-7.md @@ -0,0 +1,30 @@ +--- +title: "Facet theory" +chunk: 8/8 +source: "https://en.wikipedia.org/wiki/Facet_theory" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:54:10.375224+00:00" +instance: "kb-cron" +--- + +== Facet theory: comparisons and comments == +Concerned with the entire cycle of multivariate research – concept definition, observational design, and data analysis for concept-structure and measurement, Facet Theory constitutes a novel paradigm for the behavioral sciences. Hence, only limited aspects of it can be compared with specific statistical methods. +A distinctive feature of Facet Theory is its explicit concern with the entire +set of variables included in the investigated content-universe, regarding the subset of observed variables as but a sample from which inferences can be made. Hence, clusters of variables, if observed, are of no significance. They are simply unimportant artifacts of the procedure for sampling of the variables. This is in contrast with cluster analysis or factor analysis where recorded clustering patterns determine research results and interpretations. There have been various attempts to describe technical differences between Factor Analysis and Facet Theory. Briefly, it may be said that while Factor Analysis aims to structure the set of variables selected for observation, Facet Theory aims to structure the entire content universe of all variables, observed as well as unobserved, relying on the continuity principle and using regional hypotheses as an inferential procedure. +Guttman's SSA, as well as Multidimensional Scaling (MDS) in general, were often described as a procedure for visualizing similarities (e.g., correlations) between analyzed units (e.g., variables) in which the researcher has specific interest. (See, for example, Wikipedia, October 2020: "Multidimensional scaling (MDS) is a means of visualizing the level of similarity of individual cases of a dataset"). Modern Facet Theory, however, concerned with theory construction in the behavioral sciences, assigns SSA/MDS space a different role. Regarding the analyzed units as a sample of statistical units representing all units that pertain to the content-universe, their dispersion in the SSA/MDS space is used to infer the structure of the content universe. Namely, to infer space partitionings that define components of the content-universes and their spatial interrelationships. The inferred structure, if replicated, may suggest a theory in the investigated domain and provide a basis for theory-based measurements. +Misgivings and responses +One reservation that has been voiced concerns the usefulness of a successful SSA map (one whose partition-pattern matches a content-classification of the mapped variables). What are the consequences of an SSA map? Does such a map qualify as a theory? +In response, it may be pointed out that (a) consistently replicated empirical partition-patterns in a domain of research constitute a scientific lawfulness which, as such, are of interest to Science; (b) Often a partition-pattern leads to insights that explain behavior and may have potential applications. For example, the Radex Theory of Intelligence implies that inferential abilities are less differentiated by kinds of material than memory (or rule-recall, see Example 1 above). (c) Faceted SSA is a useful preliminary procedure for performing meaningful non arbitrary measurements by Multiple Scaling (POSAC). See Example 4. +A common doubt about SSA was voiced by a sympathetic but mystified user of SSA: "Smallest Space Analysis seems to come up with provocative pictures that an imaginative observer can usually make some sense of –– in fact, I have often referred to SSA as the sociologist's Rorschach test for imagination". Indeed, missing in Facet Theory are statistical significance tests that would indicate the stability of discovered or hypothesized partition patterns across population samples. For example, it is not clear how to compute the probability of obtaining a hypothesized partition pattern, assuming that in fact the variables are randomly dispersed over the SSA map. +In response, facet theorists claim that in Facet Theory the stability of research results is established by replications, as is the common practice in the natural sciences. Thus, if the same partition-pattern is observed across many population samples (and if no unexplained counterexamples are recorded), confidence in the research outcome would increase. Moreover, Facet Theory adds a stringent requirement for establishing scientific lawfulness, namely that the hypothesized partition-pattern would hold also across different selections of variables, sampled from the same mapping sentence. +Facet Theory is regarded as a promising metatheory for the behavioral sciences by Clyde Coombs, an eminent psychometrician and pioneer of mathematical psychology, who commented: “It is not uncommon for a behavioral theory to be somewhat ambiguous about its domain. The result is that an experiment usually can be performed which will support it and another experiment will disconfirm it. … The problem of how to define the boundaries of a domain, especially in social and behavioral science, is subtle and complex. Guttman’s facet theory (see Shye, 1978) is, I believe, the only substantial attempt to provide a general theory for characterizing domains; in this sense, it is a metatheory. As behavioral science advances so will the need for such theory.” + +== References == + +== Further reading == +Guttman, R. & Greenbaum, C. W. (1998). "Facet Theory: Its Development and Current Status." European Psychologist, Vol. 3, No. 1, March 1998, pp. 13–36. +Levy, S. (Ed.) (1994). Louis Guttman on Theory and Methodology: Selected Writings. Aldershot: Dartmouth. +Canter (Ed.) (1985). Facet Theory: Approaches to Social Research. New York: Springer. +Guttman, R. (1994). Radex Theory. In Robert J. Sternberg (Ed.), Encyclopedia of Human Intelligence. New York, NY: Macmillan Publishing, 907–912. +Hackett, P.M.W. (2021) Facet Theory and the Mapping Sentence: Evolving Philosophy, Use and Declarative Applications, (second, revised and enlarged edition), Basingstoke: Palgrave McMillan Publishers. \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Feature_engineering-0.md b/data/en.wikipedia.org/wiki/Feature_engineering-0.md new file mode 100644 index 000000000..371200dc3 --- /dev/null +++ b/data/en.wikipedia.org/wiki/Feature_engineering-0.md @@ -0,0 +1,91 @@ +--- +title: "Feature engineering" +chunk: 1/1 +source: "https://en.wikipedia.org/wiki/Feature_engineering" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:54:11.527727+00:00" +instance: "kb-cron" +--- + +Feature engineering is a preprocessing step in supervised machine learning and statistical modeling which transforms raw data into a more effective set of inputs. Each input comprises several attributes, known as features. By providing models with relevant information, feature engineering significantly enhances their predictive accuracy and decision-making capability. +Beyond machine learning, the principles of feature engineering are applied in various scientific fields, including physics. For example, physicists construct dimensionless numbers such as the Reynolds number in fluid dynamics, the Nusselt number in heat transfer, and the Archimedes number in sedimentation. They also develop first approximations of solutions, such as analytical solutions for the strength of materials in mechanics. + + +== Clustering == +One of the applications of feature engineering has been clustering of feature-objects or sample-objects in a dataset. Especially, feature engineering based on matrix decomposition has been extensively used for data clustering under non-negativity constraints on the feature coefficients. These include Non-Negative Matrix Factorization (NMF), Non-Negative Matrix-Tri Factorization (NMTF), Non-Negative Tensor Decomposition/Factorization (NTF/NTD), etc. The non-negativity constraints on coefficients of the feature vectors mined by the above-stated algorithms yields a part-based representation, and different factor matrices exhibit natural clustering properties. Several extensions of the above-stated feature engineering methods have been reported in literature, including orthogonality-constrained factorization for hard clustering, and manifold learning to overcome inherent issues with these algorithms. +Other classes of feature engineering algorithms include leveraging a common hidden structure across multiple inter-related datasets to obtain a consensus (common) clustering scheme. An example is Multi-view Classification based on Consensus Matrix Decomposition (MCMD), which mines a common clustering scheme across multiple datasets. MCMD is designed to output two types of class labels (scale-variant and scale-invariant clustering), and: + +is computationally robust to missing information, +can obtain shape- and scale-based outliers, +and can handle high-dimensional data effectively. +Coupled matrix and tensor decompositions are popular in multi-view feature engineering. + + +== Predictive modelling == +Feature engineering in machine learning and statistical modeling involves selecting, creating, transforming, and extracting data features. Key components include feature creation from existing data, transforming and imputing missing or invalid features, reducing data dimensionality through methods like Principal Components Analysis (PCA), Independent Component Analysis (ICA), and Linear Discriminant Analysis (LDA), and selecting the most relevant features for model training based on importance scores and correlation matrices. +Features vary in significance. Even relatively insignificant features may contribute to a model. Feature selection can reduce the number of features to prevent a model from becoming too specific to the training data set (overfitting). +Feature explosion occurs when the number of identified features is too large for effective model estimation or optimization. Common causes include: + +Feature templates - implementing feature templates instead of coding new features +Feature combinations - combinations that cannot be represented by a linear system +Feature explosion can be limited via techniques such as regularization, kernel methods, and feature selection. + + +== Automation == +Automation of feature engineering is a research topic that dates back to the 1990s. Machine learning software that incorporates automated feature engineering has been commercially available since 2016. Related academic literature can be roughly separated into two types: + +Multi-relational Decision Tree Learning (MRDTL) uses a supervised algorithm that is similar to a decision tree. +Deep Feature Synthesis uses simpler methods. + + +=== Multi-relational Decision Tree Learning (MRDTL) === +Multi-relational Decision Tree Learning (MRDTL) extends traditional decision tree methods to relational databases, handling complex data relationships across tables. It innovatively uses selection graphs as decision nodes, refined systematically until a specific termination criterion is reached. +Most MRDTL studies base implementations on relational databases, which results in many redundant operations. These redundancies can be reduced by using techniques such as tuple id propagation. + + +=== Open-source implementations === +There are a number of open-source libraries and tools that automate feature engineering on relational data and time series: + +featuretools is a Python library for transforming time series and relational data into feature matrices for machine learning. +MCMD: An open-source feature engineering algorithm for joint clustering of multiple datasets. +OneBM or One-Button Machine combines feature transformations and feature selection on relational data with feature selection techniques. OneBM helps data scientists reduce data exploration time allowing them to try and error many ideas in short time. On the other hand, it enables non-experts, who are not familiar with data science, to quickly extract value from their data with a little effort, time, and cost. +getML community is an open source tool for automated feature engineering on time series and relational data. It is implemented in C/C++ with a Python interface. It has been shown to be at least 60 times faster than tsflex, tsfresh, tsfel, featuretools or kats. +tsfresh is a Python library for feature extraction on time series data. It evaluates the quality of the features using hypothesis testing. +tsflex is an open source Python library for extracting features from time series data. Despite being 100% written in Python, it has been shown to be faster and more memory efficient than tsfresh, seglearn or tsfel. +seglearn is an extension for multivariate, sequential time series data to the scikit-learn Python library. +tsfel is a Python package for feature extraction on time series data. +kats is a Python toolkit for analyzing time series data. + + +=== Deep feature synthesis === +The deep feature synthesis (DFS) algorithm beat 615 of 906 human teams in a competition. + + +== Feature stores == +The feature store is where the features are stored and organized for the explicit purpose of being used to either train models (by data scientists) or make predictions (by applications that have a trained model). It is a central location where you can either create or update groups of features created from multiple different data sources, or create and update new datasets from those feature groups for training models or for use in applications that do not want to compute the features but just retrieve them when it needs them to make predictions. +A feature store includes the ability to store code used to generate features, apply the code to raw data, and serve those features to models upon request. Useful capabilities include feature versioning and policies governing the circumstances under which features can be used. +Feature stores can be standalone software tools or built into machine learning platforms. + + +== Alternatives == +Feature engineering can be a time-consuming and error-prone process, as it requires domain expertise and often involves trial and error. Deep learning algorithms may be used to process a large raw dataset without having to resort to feature engineering. However, deep learning algorithms still require careful preprocessing and cleaning of the input data. In addition, choosing the right architecture, hyperparameters, and optimization algorithm for a deep neural network can be a challenging and iterative process. + + +== See also == +Covariate +Data transformation +Feature extraction +Feature learning +Hashing trick +Instrumental variables estimation +Kernel method +List of datasets for machine learning research +Scale co-occurrence matrix +Space mapping + + +== References == + + +== Further reading == \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Funding_by_lottery-0.md b/data/en.wikipedia.org/wiki/Funding_by_lottery-0.md new file mode 100644 index 000000000..53e71b2f9 --- /dev/null +++ b/data/en.wikipedia.org/wiki/Funding_by_lottery-0.md @@ -0,0 +1,40 @@ +--- +title: "Funding by lottery" +chunk: 1/2 +source: "https://en.wikipedia.org/wiki/Funding_by_lottery" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:52:55.911338+00:00" +instance: "kb-cron" +--- + +Funding by lottery is the use of randomization in the allocation of science funding by research funding organizations. As of early 2025, about a dozen funders worldwide have implemented some form of funding by lottery in their funding calls. +There is no scholarly consensus on the benefits and drawbacks of funding by lottery. Proponents argue that it can help mitigate biases in funding allocation and minimize the high costs associated with grant writing and peer review. Critics, however, express concerns that diminishing the role of peer review panels in funding decisions could lead to a decline in the quality of funded projects and undermine public trust in funders, in scientific peer review, and ultimately in science at large. + +== Overview == +Science funding is generally distributed through competitive funding calls organized by research funding agencies (commonly referred to as "funders"). Eligible researchers submit research grant proposals in response to these calls, which are then evaluated by expert peer review panels. Funding is typically awarded to the proposals that receive the most positive evaluations from these panels. +Compared to this traditional, peer-reviewed science funding, funding by lottery fully or partly replaces peer review panel evaluations with random selection. In a full lottery, all eligible research grant proposals have an equal chance of winning the grant, with no input from a peer review panel. In a partial lottery, only eligible proposals that are positively evaluated by the peer review panel are entered into the lottery and have a chance to win the grant. +While some researchers advocate for the funding of science by means of full lotteries, at least for specific funding calls, all known examples of funding lotteries implemented by funders are partial lotteries. + +== History == +The selection mechanism behind funding by lottery has historical roots in the scrutiny and lot, a sortition system established in 14th-century Florence. Similar to contemporary funding by lottery, candidates in scrutiny and lot were first screened for eligibility, and then eligible individuals were randomly selected for political office. The motivation for using randomization is also comparable: in peer review, randomization is proposed as a way to reduce biases and prevent the concentration of funding in a few prestigious labs or research institutions. In Florentine sortition, randomization was designed to curb the concentration of political power among affluent families. +In the context of science funding, funding by lottery was first proposed by Daniel S. Greenberg in 1998. Greenberg argued that winning a grant in a competitive funding call is largely a matter of luck – an intuition that was later corroborated by empirical research, showing that peer review panels often struggle to distinguish among proposals around the funding line. Because the evaluations by peer review panels are so unreliable, Greenberg argued, a lottery would be an equally valid but more cost-effective method to allocate funding. +The idea remained theoretical until the 2010s, when a few funders began piloting the implementation of partial lotteries. Early adopters included the Health Research Council of New Zealand, the New Zealand Science for Technological Innovation, and the Volkswagen Foundation. These early experiments with partial lotteries were later followed by other Western funders, particularly for calls aimed at supporting transformative, innovative, blue skies, or high-risk, high-reward research. + +== Types == +Funding by lottery can be implemented in different ways. In practice, all funding lotteries used or in use by funders are partial lotteries, meaning that a peer review panel is in charge of evaluating proposals for merit and suitability, and randomization is introduced to assist in the final funding decision among positively evaluated proposals. Types of partial lotteries currently in use are: tie-breaking lotteries, partial lotteries "with bypass", and fundable-pool partial lotteries. + +=== Tie-breaking partial lotteries === +In tie-breaking partial lotteries, research grant proposals are first ranked by a peer review panel, and funding is awarded to the best-rated proposals. However, when multiple proposals receive the same evaluation at the funding threshold – meaning that the panel cannot distinguish among them – random selection is used to break the tie. +Funders using tie-breaking partial lotteries in all or some of their calls include the Swiss National Science Foundation, Natural Environment Research Council (UK), Research Council of Norway, and Science Foundation Ireland. + +=== Partial lotteries "with bypass" === +Partial lotteries "with bypass" rely on peer review panels to evaluate research grant proposals and categorize them as not fundable, fundable or outstanding. Not fundable proposals are declined. Fundable proposals enter the lottery pool. Outstanding proposals bypass the lottery and receive direct funding. +Funders using this implementation include the Volkswagen Foundation, Austrian Science Fund, and Novo Nordisk Foundation. + +=== Fundable-pool partial lotteries === +In fundable-pool partial lotteries, peer review panels evaluate research grant proposals and categorize them as either fundable or not fundable. Proposals deemed not fundable are declined, whereas those classified as fundable enter a lottery pool for random selection. +Adopters of fundable-pool partial lotteries include the Health Research Council of New Zealand, New Zealand Science for Technological Innovation, Canadian New Frontiers in Research Fund, and The British Academy. + +== Scholarly debate == +Surveys conducted in New Zealand and Germany suggest that researchers have mixed opinions on funding by lottery. Respondents seem to be more supportive of partial lotteries than full lotteries. In peer-reviewed articles and opinion pieces, scholars have outlined key arguments in favor and against the use of funding by lottery. \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Funding_by_lottery-1.md b/data/en.wikipedia.org/wiki/Funding_by_lottery-1.md new file mode 100644 index 000000000..2a7b4c14a --- /dev/null +++ b/data/en.wikipedia.org/wiki/Funding_by_lottery-1.md @@ -0,0 +1,26 @@ +--- +title: "Funding by lottery" +chunk: 2/2 +source: "https://en.wikipedia.org/wiki/Funding_by_lottery" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:52:55.911338+00:00" +instance: "kb-cron" +--- + +=== Arguments in favor === +The main argument in favor of funding by lottery is that randomization can help reduce the costs of grant writing and peer review while promoting fairer and more diverse funding outcomes. More specifically, funding by lottery may curb or eliminate biases in research evaluation, such as reviewer conservatism – i.e. the tendency to favor conventional ideas over more innovative or riskier ones – as well as bias against interdisciplinary research. +Furthermore, funding by lottery is seen as way to counteract the growing concentration of research funding in the hands of few renowned labs and research-performing institutions, a phenomenon known as the "Matthew effect" in science funding. +Lastly, some have argued that funding by lottery could remove some incentives for research misconduct. Because grant acquisition is considered a marker of academic success, researchers may feel pressured to obtain as much funding as possible. In extreme cases, this can lead to unethical practices, like submitting virtually identical research grant proposals to multiple funding calls – a problem known as "double-dipping". By decoupling grant acquisition from academic prestige, funding by lottery could help mitigate such incentives for misconduct. + +=== Arguments against === +Some scholars highlight the lack of empirical evidence supporting the claimed benefits of funding by lottery, particularly its potential to reduce costs and bias. Others argue that less drastic reforms to peer review could address its shortcomings more effectively. +Furthermore, critics of funding by lottery identify additional potential drawbacks. First, eliminating peer review panels, or reducing their role, would deprive applicants of valuable feedback that helps improve and refine their ideas and study designs. Second, randomization could weaken quality-based selection, creating incentives to submit lower-quality proposals, and ultimately leading to a decline in quality standards. +In addition, adopting funding by lottery could undermine the legitimacy of funders and their peer review panels, potentially damaging public trust in peer review and the scientific enterprise as a whole. + +== See also == +Funding of science +Peer review +Science policy + +== References == \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Geographic_analytics-0.md b/data/en.wikipedia.org/wiki/Geographic_analytics-0.md new file mode 100644 index 000000000..d9af43c6c --- /dev/null +++ b/data/en.wikipedia.org/wiki/Geographic_analytics-0.md @@ -0,0 +1,72 @@ +--- +title: "Geographic analytics" +chunk: 1/1 +source: "https://en.wikipedia.org/wiki/Geographic_analytics" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:54:12.686727+00:00" +instance: "kb-cron" +--- + +Geographic analytics is an analytical approach to strategic management and data analytics to make geographic decisions efficiently. Examples of such decisions are choosing the location for a warehouse or planning the regions for a marketing campaign. Data, information and framing conditions are visualized on maps to derive recommendations for action. +In comparison to geographic information systems (GIS), which primarily aim at the representation of information on maps (descriptive analytics), Geographic analytics additionally focuses on making business decisions based on the data visualization on the map (prescriptive analytics). + + +== Background == +Purely mathematically based approaches that are used in data analytics in order to support management decisions often have the disadvantage that they require a large amount of data. This often results in considerable efforts for gathering, cleaning and understanding the data. Furthermore, there can be “intangible” framing conditions that are disruptive to any purely data-driven optimization solution. +Example logistics: For the task to find the optimal location for a warehouse in a logistics network, so-called Center of Gravity models are being used. To minimize the cost, these models use transport volume data, customer locations, cost data, etc., in order to determine the optimal location. However, there are often framing conditions – for example, traffic infrastructure, borders, regulatory and even physical hurdles – which are difficult to be mathematically described and modelled. +In practice, such framing conditions are often only recognized at the end of the data-driven analysis. The framing conditions then have to be included into the model, which has to be recalculated. As a worst case, the approach to the problem has to be done from scratch. This results in delays of the decision-making process and often goes along with significant additional efforts. + + +== Application and objective == +Geographic analytics starts with the visualization of basic data on a map. By involving experts from the field, the visualization is then being used in order to determine framing conditions and focal points of the business problem. +As a result, the solution space, i.e. the number of possible solutions of the data analysis, is being reduced. In addition, framing conditions as well as data errors are being recognized in this early stage of the analysis. Only then, traditional data analytics methods are coming into play to find the optimal solution to the problem. +With this approach, lesser data is required for the overall analysis and time and effort for the analysis is being significantly reduced. Impracticable and flawed solutions are being identified and excluded upfront. + + +== Areas of application == +Geographic analytics is being used in connection with data analyses in order to support management decisions that contain a geographical component, such as location decisions, marketing campaigns, service center placements, etc. + + +=== Example Industries === +Logistics / Supply chain management +Example: Planning the location of a distribution center to minimize logistics costs when distributing products to customers. +Retail trade +Example: Opening of a new shop at a strategically favorable location for commuters +Oil and gas Industry +Example: Planning the locations of test wells in order to develop a new oil field at the lowest possible cost. + + +=== Example business areas === +Marketing +Example: Development of a regional marketing campaign for a new product +Infrastructure +Example: Traffic planning / planning of traffic expansions in a large city to reduce congestion times +Services +Example: Planning the placement of ATMs in order to achieve the highest possible degree of coverage in a region +Agriculture +Banking +Community Development & Planning +Disaster Management +Mapping & Navigation +Military +Public Health +Risk Assessment +Telecommunications Services + + +== Etymology == +The term and the methodology of geographic analytics were first described in 2013 by Jozo Acksteiner and Claudia Trautmann in the Supply Chain Management Review. + + +== Source == + This article incorporates text from this source, which is in the public domain: https://mgiss.co.uk/geographic-information-system-different-applications-of-gis/ + + +== References == + +Claudia Trautmann, Jozo Acksteiner: Geographic Analytics. How HP Visualizes is Supply Chain. In: Supply Chain Management Review January/February, 2013 (Geographic Analytics, How HP Visualizes its Supply Chain) +Travis Parker, Jozo Acksteiner: HP Embraces a New Level of Global Analytics. In: Supply Chain Brain, October, 2015 (HP Embraces a New Level of Global Analytics) +Dan Gilmore: Supply Chain News. Trip Report. CSCMP Conference Part 2. In: SupplyChainDigest, October, 2015 (Supply Chain News. Trip Report) +Jörn Kohlhammer et al. : Solving Problems with Visual Analytics In: Procedia Computer Science, December 2011 (Solving Problems with Visual Analytics) +Osthessenzeitung: Logistics Day. Digitlaization and Management. In: Osthessenzeitung, April, 2018 (Logistics Day Digitalization and Management) \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Health_care_analytics-0.md b/data/en.wikipedia.org/wiki/Health_care_analytics-0.md new file mode 100644 index 000000000..347e3bfba --- /dev/null +++ b/data/en.wikipedia.org/wiki/Health_care_analytics-0.md @@ -0,0 +1,34 @@ +--- +title: "Health care analytics" +chunk: 1/2 +source: "https://en.wikipedia.org/wiki/Health_care_analytics" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:54:13.909352+00:00" +instance: "kb-cron" +--- + +Health care analytics is the health care analysis activities that can be undertaken as a result of data collected from four areas within healthcare: (1) claims and cost data, (2) pharmaceutical and research and development (R&D) data, (3) clinical data (such as collected from electronic medical records (EHRs)), and (4) patient behaviors and preferences data (e.g. patient satisfaction or retail purchases, such as data captured in stores selling personal health products). Health care analytics is a growing industry in many countries including the United States, where it is expected to grow to more than $31 billion by 2022. It is also increasingly important to governments and public health agencies to support health policy and meet public expectations for transparency, as accelerated by the COVID-19 pandemic. +Health care analytics allows for the examination of patterns in various healthcare data in order to determine how clinical care can be improved for patients and provider teams, while limiting excessive spending and improving the health of populations. Areas of the industry focuses on clinical analysis, financial analysis, supply chain analysis, as well as marketing, fraud and HR analysis. There is increasing demand in many countries to incorporate social indicators of patients and providers within health care analytics, to inform improvements for health equity, such as in terms of addressing racism in healthcare or the health of Indigenous peoples. + +== Balancing Interests in Healthcare Analytics: innovation, privacy, and patient safety == +Healthcare analytics requires access to comprehensive data, but its usefulness depends on a balance between expansive limits on the collection of data that may risk the protection for patient rights, erroneous conclusions or statistical predictions, and misuse of results. Appropriate policies could support gains in process improvements, cost reductions, personalized medicine, and population health. Additionally, providing incentives to encourage appropriate use may address some concerns but could also inadvertently incentivize the misuse of data. Lastly, creating standards for IT infrastructure may encourage data sharing and use, but those standards would need to be reevaluated on a regular, ongoing basis as the fast pace of technological innovation causes standards and best practices to become quickly outdated. +Several areas to improve healthcare analytics through national, regional and local collaborations and legislation have been identified. + +=== Limiting data collection === +The needs of healthcare providers, government agencies, health plans, and researchers for quality data must be met to ensure adequate medical care and to make improvements to the healthcare system, while still ensuring the patients right to privacy. Data collection should be limited to necessity for medical care and by patient preference beyond that care. Such limits would protect patient privacy while minimizing infrastructure costs to house data. When possible, patients should be informed about what data is collected prior to engaging in medical services. For instance in Canada, data collection among Indigenous populations is governed by principles of First Nations ownership, control, access, and possession. + +=== Limiting data use === +Expanding availability of big data increases the risk of statistical errors, erroneous conclusions and predictions, and misuse of results. Evidence supports use of data for process improvements, cost reductions, personalized medicine, and public health. Innovative uses for individual health can harm underserved populations. In the United States, limiting use for denial and exclusion prevents use to determine eligibility for benefits or care and is harmonized with other federal anti-discrimination laws, such as Fair Credit Reporting Act, and is harmonized with anti-discrimination laws like the Civil Rights Act and the Genetic Information Nondiscrimination Act. + +=== Providing incentives to encourage appropriate use === +Increasing vertical integration in both public and private sector providers has created massive databases of electronic health records. In the United States, the ACA has provided Medicare and Medicaid incentives to providers to adopt EHR's. Large healthcare institutions also have internal motivation to apply healthcare analytics, largely for reducing costs by providing preventative care. Policy could increase data use by incentivizing insurers and providers to increase population tracking, which improves outcomes. + +=== Creating standards for the IT infrastructure === +Inappropriate IT infrastructure likely limits healthcare analytics findings and their impact on clinical practice. Establishing standards ensures IT infrastructure capable of housing big data balanced with addressing accessibility, ownership, and privacy. New possibilities could be explored such as private clouds and “a virtual sandbox” consisting of filtered data authorized to the researchers accessing the sandbox. Standards promote easier coordination in information collaboration between different medical and research organizations resulting in significantly improving patient care by improving communication between providers and reducing duplicity and costs. +Minimum standards are necessary to balance privacy and accessibility. Standardization helps improve patient care by facilitating research collaboration and easier communication between medical providers. The research can yield preventive care concepts that can reduce patient caseload and avoid long-term medical costs. + +== Healthcare analytics in UAE == +The Dubai Pharmacy College (DPCG) is a pioneer in healthcare data analytics education in the GCC region. DPC offers a Post-graduate certificate course in "Healthcare business data analytics" for healthcare professionals to motivate the intuition to explore the concept of healthcare data analytics and apply innovations in healthcare computing technologies. The aim of the certification program is to provide a platform for interprofessional researchers to utilize the fundamental technology including software applications for intelligent data acquisition, processing, and analysis of healthcare data. + +== Healthcare analytics in the United States == \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Health_care_analytics-1.md b/data/en.wikipedia.org/wiki/Health_care_analytics-1.md new file mode 100644 index 000000000..26c87bf86 --- /dev/null +++ b/data/en.wikipedia.org/wiki/Health_care_analytics-1.md @@ -0,0 +1,50 @@ +--- +title: "Health care analytics" +chunk: 2/2 +source: "https://en.wikipedia.org/wiki/Health_care_analytics" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:54:13.909352+00:00" +instance: "kb-cron" +--- + +=== Federal government role in health IT === +In the United States, multiple federal entities are heavily involved in health analytics infrastructure. Within the executive branch, the administration itself, Centers for Medicare and Medicaid Services (CMS), and Office of the National Coordinator for Health Information Technology (ONC) each have strategic plans and are involved in determining regulation. Within the legislative branch, multiple committees within the House of Representatives and Senate hold hearings and have opinions on using data and technology to reduce costs and improve outcomes in healthcare. +The ONC issued the Federal Health IT Strategic Plan 2015-2020. The plan outlines the steps federal agencies will take to achieve widespread use of health information technology (health IT) and electronic health information to enhance the health IT infrastructure, to advance person-centered and self-managed health, to transform health care delivery and community health, and to foster research, scientific knowledge and innovation. The plan is intended “to provide clarity in federal policies, programs, and actions and includes strategies to align program requirements, harmonize and simplify regulations, and aims to help health IT users to advance the learning health system to achieve better health.” +The Strategic Plan includes several key initiatives employing multiple strategies to meet its goals. These include: (1) finalizing and implementing an interoperability roadmap; (2) protecting the privacy and security of health information; (3) identifying, prioritizing and advancing technical standards; (4) increasing user and market confidence in the safety and safe use of health IT; (5) advancing a national communication infrastructure; and (6) collaborating among all stakeholders. + +=== Challenges to address === + +==== Creating an interoperability roadmap ==== +Dr.Aryan Chavan challenges to be addressed: (1) variation in how standards are tested and implemented; (2) variation in how health IT stakeholders interpret and implement policies and legal requirements; and (3) reluctance of health IT stakeholders to share and collaborate in ways that might foster consumer engagement. +The ONC is working to develop a policy advisory for health information exchange by 2017 that will define and outline basic expectations for trading partners around health information exchange, interoperability and the exchange of information. Current federal and state law only prohibits certain kinds of information blocking in limited and narrow circumstances, for example, under the Health Insurance Portability and Accountability Act (HIPAA) or the Anti-Kickback statute. + +==== Protecting privacy and security ==== +In addition to HIPAA, many states have their own privacy laws protecting an individual’s health information. State laws that are contrary to HIPAA are generally preempted by the federal requirements unless a specific exception applies. For example, if the state law relates to identifiable health information and provides greater privacy protections, then it is not preempted by HIPAA. Since privacy laws may vary from state-to-state, it may create confusion among health IT stakeholders and make it difficult to ensure privacy compliance. + +==== Establishing common technical standards ==== +Use of common technical standards is necessary to move electronic health information seamlessly and securely. While some clinical record content, such as laboratory results and clinical measurements are easily standardized other content, such as provider notes may be more difficult to standardize. Methods need to be identified that allow for the standardization of provider notes and other traditionally “free form text” data. +The ONC HIT Certification Program certifies that a system meets the technological capability, functionality and security requirements adopted by HHS. ONC will assess the program on an ongoing basis “to ensure it can address and reinforce health IT applications and requirements that support federal value-based and alternative payment models.” + +==== Increasing confidence in safety and safe use of health IT ==== +Health care consumers, providers and organizations need to feel confident that the health IT products, systems or services they are using are not only secure, safe and useful but that they can switch between products, systems or services without loss of valuable information or undue financial burden. Implementation of the Federal Health IT Strategic Plan 2015-2020, along with the 2013 HHS Health IT Patient Safety Action and Surveillance Plan and 2012 Food and Drug Administration Safety and Innovation Act will attempt to address these concerns. + +==== Developing national communications structure ==== +A national communications infrastructure is necessary to enable the sharing of electronic health information between stakeholders, including providers, individuals and national emergency first responders. It is also necessary for delivering telehealth services or using mobile health applications. “Expanded, secure, and affordable high-speed wireless and broadband services, choice, and spectrum availability will support electronic health information sharing and use, support the communication required for care delivery, and support the continuity of health care and public health services during disasters and public health emergencies.” + +==== Stakeholder collaboration ==== +The federal government in its role as contributor, beneficiary and collaborator “aims to encourage private-sector innovators and entrepreneurs, as well as researchers, to use government and government-funded data to create useful applications, products, services, and features that help improve health and health care.” HHS receives funds from the Patient-Centered Outcomes Research Trust Fund to build data capacity for patient-centered outcomes research. It is estimated HHS will receive over $140 million for the period between 2011 and 2019. These funds will be used “to enable a comprehensive, interoperable, and sustainable data network infrastructure to collect, link, and analyze data from multiple sources to facilitate patient-centered outcomes research.” + +==== Legislation ==== +Meaningful Use, the Patient Protection and Affordable Care Act (ACA) and the declining cost of data storage results in health data being stored, shared, and used by multiple providers, insurance companies, and research institutions. Concerns exist about how organizations gather, store, share, and use personal information, including privacy and confidentiality concerns, as well as the concerns over the quality and accuracy of data collected. Expansion of existing regulation can ensure patient privacy and guard patient safety to balance access to data and the ethical impact of exposing that data. + +== See also == +health information management +health informatics +public health informatics +human resources for health information systems (HRHIS) + +== References == + +== Further reading == +Adam Tanner (2017). Our Bodies, Our Data: How Companies Make Billions Selling Our Medical Records. Beacon Press. ISBN 978-0807033340. \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Henry_Oldenburg-0.md b/data/en.wikipedia.org/wiki/Henry_Oldenburg-0.md new file mode 100644 index 000000000..d36c2936f --- /dev/null +++ b/data/en.wikipedia.org/wiki/Henry_Oldenburg-0.md @@ -0,0 +1,70 @@ +--- +title: "Henry Oldenburg" +chunk: 1/1 +source: "https://en.wikipedia.org/wiki/Henry_Oldenburg" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:53:01.743243+00:00" +instance: "kb-cron" +--- + +Henry Oldenburg (also Henry Oldenbourg) (c. 1618 as Heinrich Oldenburg – 5 September 1677) was a German theologian, diplomat, and natural philosopher, known as one of the creators of modern scientific peer review. He was one of the foremost intelligencers of 17th-century Europe, with a network of correspondents to rival those of Fabri de Peiresc, Marin Mersenne, and Ismaël Boulliau. At the foundation of the Royal Society in London, he took on the task of foreign correspondence, as the first Secretary. + + +== Early life == +Born in Bremen, Germany, he was trained in theology and received his degree from the local Gymnasyum illustre on 2 November 1639. He had an initial very firm grasp of the German, Latin, and Greek languages. His movements during the 1640s are unclear, but he is thought to have worked as a tutor in England for much of the decade. In 1648 he left England and spent some time in Leiden and Utrecht in the Dutch Republic, where he became conversant in the Dutch language. After a short stay back in Bremen in the spring, he arrived back in London in July 1653 as a diplomatic envoy of the city of Bremen's senate to the Lord Protector Oliver Cromwell for matters concerning the ongoing First Anglo-Dutch War. +Settling then in England of the Interregnum, he forged a strong relationship with his lifelong patron Robert Boyle, and with John Milton, who wrote of him approvingly that he had "learnt to speak our language more accurately and fluently than any other foreigner I have ever known" (Correspondence, 1.34). Oldenburg eventually became the tutor to Boyle's nephew, the politician Richard Jones, and travelled with him through France from 1657 to 1660. Here Oldenburg also added to his intellectual baggage the French language, the last European language in which he was completely conversant. +Oldenburg married his second wife, Dora Katherina Dury (1654–77), the daughter of Dorothy and John Dury in London on 13 August 1668. Either through Milton, whom he had met earlier in his diplomatic mission, or through Lady Ranelagh, sister to Boyle and the mother of Richard Jones, Oldenburg gained entry to an important intellectual circle, including his fellow German native, Samuel Hartlib, whose extensive web of correspondents Oldenburg was to take over, John Dury who became his father-in-law, and others such as the economist William Petty. Among Oldenburg's correspondents at this time was Baruch Spinoza, whom he was introduced to on a trip to the Netherlands, and to whom he presented a volume of writings on scientific topics by Boyle. + + +== Secretary of the Royal Society == +After the Restoration he became an early member (original fellow) of the Royal Society (founded in 1660), and served as its first secretary along with John Wilkins, maintaining an extensive network of scientific contacts through Europe. He also became the founding editor of the Philosophical Transactions of the Royal Society. Oldenburg began the practice of sending submitted manuscripts to experts who could judge their quality before publication. This was the beginning of both the modern scientific journal and the practice of peer review. Philosophical Transactions of the Royal Society continues today and is the longest running scientific journal in the world. +He was briefly imprisoned in the Tower of London as a suspected spy in 1667, during the Second Anglo-Dutch War. Oldenburg's correspondence was linked to support from the politician Sir Joseph Williamson; in part Oldenburg supplied Williamson with intelligence information. +Oldenburg enjoyed good health in his lifetime, but he fell seriously ill on 3 September 1677, and he died two days thereafter at his Pall Mall, London home. He was interred on 7 September at St Mary the Virgin, Bexley. His widow died ten days later. + + +== Foreign correspondents == + + +=== Denmark === +Rasmus Bartholin. + + +=== Flanders === +René François Walter de Sluse. + + +=== France === +Adrien Auzout, Henri Justel, Pierre Petit, Ismaël Bullialdus. + + +=== Germany === +Sir William Curtius, Johann Hevelius, Gottfried Leibniz, Philipp Jacob Sachs von Lewenheimb, Johann Daniel Major, Ehrenfried Walther von Tschirnhaus, Martin Vogel. + + +=== Italy === +Paolo Boccone, Giovanni Domenico Cassini, Marcello Malpighi., Manfredo Settala + + +=== Netherlands === +Reinier de Graaf, Christiaan Huygens and his father, Constantijn Huygens, Antoni van Leeuwenhoek, Willem Ten Rhijne, Benedictus/Baruch Spinoza, Peter Serrarius, Jan Swammerdam, Isaac Vossius. + + +== See also == +Peer review + + +== References == + + +== Further reading == +Jean-Pierre Vittu, "Henry Oldenburg 'Grand intermédiaire'", in "Les grands intermédiaires culturels de la République des Lettres", pub. by Christiane Berkvens-Stevelinck, Hans Bots and Jens Häseler, Paris, Honoré Champion, 2005, pp. 184–209 +McKie, Douglas. "The arrest and imprisonment of Henry Oldenburg". Notes and Records of the Royal Society of London. 6 (1948/49): 28–47. +Thomas Elsmann: Im Schatten des Kaufmanns – Bremische Gelehrte 1600–1900, Bremen: Schünemann 2012, S. 80–99 + + +== External links == + Media related to Henry Oldenburg at Wikimedia Commons +Works by Henry Oldenburg at Project Gutenberg +Works by or about Henry Oldenburg at the Internet Archive +The Correspondence of Henry Oldenburg in EMLO \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Iconography_of_correlations-0.md b/data/en.wikipedia.org/wiki/Iconography_of_correlations-0.md new file mode 100644 index 000000000..7af7917f1 --- /dev/null +++ b/data/en.wikipedia.org/wiki/Iconography_of_correlations-0.md @@ -0,0 +1,29 @@ +--- +title: "Iconography of correlations" +chunk: 1/1 +source: "https://en.wikipedia.org/wiki/Iconography_of_correlations" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:54:15.084817+00:00" +instance: "kb-cron" +--- + +In exploratory data analysis, the iconography of correlations, or representation of correlations, is a data visualization technique which replaces a numeric correlation matrix by its graphical projection onto a diagram. Correlations between variables are represented by solid lines (positive correlations) or dotted lines (negative correlations) plotted on the diagram, where shorter lengths or thicker lines (or both) represent greater projection components and thus greater degree of correlation. + + +== History == +The idea of using graphical models to represent correlations has a notable application in genomic mapping, though the iconography of correlations is more general for not assuming the probability distribution, since it only relies on representing the correlation coefficients geometrically. +The iconography of correlations first dates to 1975, applied to marine geochemistry in a 1981 thesis, and later in a 1982 data analysis article. Afterwards, the method was applied widely in the aerospace industry but for about fifteen years manufacturers kept it fairly confidential; generally, they preferred not to broadcast useful techniques to their competitors. +In 1997 the first company was incorporated to distribute iconography of correlations software. Since then the topic of iconography of correlations has been incorporated into university courses, and typical topical articles' citation lists have rapidly and greatly expanded, particularly in the fields of medicine and mass spectrometry. + + +== See also == +Bayesian network – Probabilistic graphical representation of causal relationships + + +== References == + + +== External links == +Lesty, Michel. "Une autre façon de présenter l'iconographie des corrélations" [Another way to present the iconography of correlations]. La représentation multidimensionnelle [Multidimensional representation]. coryent.com (commercial site). Tutorials (in French). Versailles, FR: CORYENT Conseil / Logiciel CORICO. +Girondot, Marc (17 January 2018). "Multivariable analysis and correlation of iconography". biostatsr.blogspot.com (academic's blog). Saclay, FR. ... about biostatistics ... by Professor Marc Girondot, University Paris Saclay. — Brief introduction to correlation of iconography by a biostatistician. \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Ingelfinger_rule-0.md b/data/en.wikipedia.org/wiki/Ingelfinger_rule-0.md new file mode 100644 index 000000000..799ff3641 --- /dev/null +++ b/data/en.wikipedia.org/wiki/Ingelfinger_rule-0.md @@ -0,0 +1,32 @@ +--- +title: "Ingelfinger rule" +chunk: 1/1 +source: "https://en.wikipedia.org/wiki/Ingelfinger_rule" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:52:57.068716+00:00" +instance: "kb-cron" +--- + +The Ingelfinger rule is an eponymous rule named after Franz J. Ingelfinger, The New England Journal of Medicine (NEJM) editor-in-chief who enunciated it in 1969. The rule, as originally articulated in the editorial "Definition of 'Sole Contribution'", stated that NEJM would not publish findings that had been published elsewhere. Though originally meant only for NEJM, the guideline was subsequently adopted by several other scientific journals, and it has shaped scientific publishing ever since. Historically it has also helped to ensure that the journal's content is fresh and does not duplicate content previously reported elsewhere, and it seeks to protect the scientific embargo system. +A similar policy had been earlier expressed in 1960 by Samuel Goudsmit, editor of the Physical Review Letters, but it did not become as well known. +The Ingelfinger rule has been seen as having the aim of preventing authors from performing duplicate publications which would unduly inflate their publication record. On the other hand, it has also been stated that the real reason for the Ingelfinger rule is to protect the journals' revenue stream, and with the increase in popularity of preprint servers such as arXiv, bioRxiv, and HAL many journals have loosened their requirements concerning the Ingelfinger rule. In a defense of the policy, the journal said in an editorial that the practice discouraged scientists from talking to the media before their work was peer reviewed. + + +== See also == +List of academic journals by preprint policy +News embargo +Network effect + + +== References == + + +== Further reading == +Relman, AS (1981). "The Ingelfinger Rule". The New England Journal of Medicine. 305 (14): 824–6. doi:10.1056/NEJM198110013051408. PMID 7266634. +Spain, A (26 February 2011). "Casting a critical eye on the embargo system: one year of Embargo Watch". Association of British Science Writers. Archived from the original on 2017-03-25. Retrieved 2017-03-24. +Altman, LK (1996). "The Ingelfinger rule, embargoes, and journal peer review–Part 1". The Lancet. 347 (9012): 1382–6. doi:10.1016/S0140-6736(96)91016-8. PMID 8637347. S2CID 44524038. +Toy, J (2002). "The Ingelfinger Rule: Franz Ingelfinger at the New England Journal of Medicine 1967–77" (PDF). Science Editor. 25 (6): 195–198. +Harnad, S (2000). "Ingelfinger Over-Ruled: The Role of the Web in the Future of Refereed Medical Journal Publishing". The Lancet Perspectives. 356: s16. doi:10.1016/S0140-6736(00)92002-6. PMID 11191471. +White, E (2014). "Why the Ecology Letters editorial board should reconsider its No vote on preprints". Jabberwocky Ecology. +Desjardins-Proulx, P; White, EP; Adamson, JJ; Ram, K; Poisot, T; Gravel, D (2013). "The Case for Open Preprints in Biology". PLOS Biology. 11 (5) e1001563. doi:10.1371/journal.pbio.1001563. PMC 3653830. PMID 23690752. \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Interdisciplinary_peer_review-0.md b/data/en.wikipedia.org/wiki/Interdisciplinary_peer_review-0.md index 90e00a02b..a3ae62ba2 100644 --- a/data/en.wikipedia.org/wiki/Interdisciplinary_peer_review-0.md +++ b/data/en.wikipedia.org/wiki/Interdisciplinary_peer_review-0.md @@ -4,7 +4,7 @@ chunk: 1/1 source: "https://en.wikipedia.org/wiki/Interdisciplinary_peer_review" category: "reference" tags: "science, encyclopedia" -date_saved: "2026-05-05T06:27:44.415051+00:00" +date_saved: "2026-05-05T09:52:58.269045+00:00" instance: "kb-cron" --- diff --git a/data/en.wikipedia.org/wiki/Interleaving_distance-0.md b/data/en.wikipedia.org/wiki/Interleaving_distance-0.md new file mode 100644 index 000000000..3f030c12f --- /dev/null +++ b/data/en.wikipedia.org/wiki/Interleaving_distance-0.md @@ -0,0 +1,625 @@ +--- +title: "Interleaving distance" +chunk: 1/1 +source: "https://en.wikipedia.org/wiki/Interleaving_distance" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:54:16.289352+00:00" +instance: "kb-cron" +--- + +In topological data analysis, the interleaving distance is a measure of similarity between persistence modules, a common object of study in topological data analysis and persistent homology. The interleaving distance was first introduced by Frédéric Chazal et al. in 2009. since then, it and its generalizations have been a central consideration in the study of applied algebraic topology and topological data analysis. + + +== Definition == +A persistence module + + + + + V + + + + {\displaystyle \mathbb {V} } + + is a collection + + + + ( + + V + + t + + + ∣ + t + ∈ + + R + + ) + + + {\displaystyle (V_{t}\mid t\in \mathbb {R} )} + + of vector spaces indexed over the real line, along with a collection + + + + ( + + v + + t + + + s + + + : + + V + + s + + + → + + V + + t + + + ∣ + s + ≤ + t + ) + + + {\displaystyle (v_{t}^{s}:V_{s}\to V_{t}\mid s\leq t)} + + of linear maps such that + + + + + v + + t + + + t + + + + + {\displaystyle v_{t}^{t}} + + is always an isomorphism, and the relation + + + + + v + + t + + + s + + + ∘ + + v + + s + + + r + + + = + + v + + t + + + r + + + + + {\displaystyle v_{t}^{s}\circ v_{s}^{r}=v_{t}^{r}} + + is satisfied for every + + + + r + ≤ + s + ≤ + t + + + {\displaystyle r\leq s\leq t} + +. The case of + + + + + R + + + + {\displaystyle \mathbb {R} } + + indexing is presented here for simplicity, though the interleaving distance can be readily adapted to more general settings, including multi-dimensional persistence modules. +Let + + + + + U + + + + {\displaystyle \mathbb {U} } + + and + + + + + V + + + + {\displaystyle \mathbb {V} } + + be persistence modules. Then for any + + + + δ + ∈ + + R + + + + {\displaystyle \delta \in \mathbb {R} } + +, a + + + + δ + + + {\displaystyle \delta } + +-shift is a collection + + + + ( + + ϕ + + t + + + : + + U + + t + + + → + + V + + t + + + δ + + + ∣ + t + ∈ + + R + + ) + + + {\displaystyle (\phi _{t}:U_{t}\to V_{t+\delta }\mid t\in \mathbb {R} )} + + of linear maps between the persistence modules that commute with the internal maps of + + + + + U + + + + {\displaystyle \mathbb {U} } + + and + + + + + V + + + + {\displaystyle \mathbb {V} } + +. +The persistence modules + + + + + U + + + + {\displaystyle \mathbb {U} } + + and + + + + + V + + + + {\displaystyle \mathbb {V} } + + are said to be + + + + δ + + + {\displaystyle \delta } + +-interleaved if there are + + + + δ + + + {\displaystyle \delta } + +-shifts + + + + + ϕ + + t + + + : + + U + + t + + + → + + V + + t + + + δ + + + + + {\displaystyle \phi _{t}:U_{t}\to V_{t+\delta }} + + and + + + + + ψ + + t + + + : + + V + + t + + + → + + U + + t + + + δ + + + + + {\displaystyle \psi _{t}:V_{t}\to U_{t+\delta }} + + such that the following diagrams commute for all + + + + s + ≤ + t + + + {\displaystyle s\leq t} + +. + +It follows from the definition that if + + + + + U + + + + {\displaystyle \mathbb {U} } + + and + + + + + V + + + + {\displaystyle \mathbb {V} } + + are + + + + δ + + + {\displaystyle \delta } + +-interleaved for some + + + + δ + + + {\displaystyle \delta } + +, then they are also + + + + ( + δ + + + ε + ) + + + {\displaystyle (\delta +\varepsilon )} + +-interleaved for any positive + + + + ε + + + {\displaystyle \varepsilon } + +. Therefore, in order to find the closest interleaving between the two modules, we must take the infimum across all possible interleavings. +The interleaving distance between two persistence modules + + + + + U + + + + {\displaystyle \mathbb {U} } + + and + + + + + V + + + + {\displaystyle \mathbb {V} } + + is defined as + + + + + d + + I + + + ( + + U + + , + + V + + ) + = + inf + { + δ + ∣ + + U + + + and + + + V + + + are + + δ + + -interleaved + + } + + + {\displaystyle d_{I}(\mathbb {U} ,\mathbb {V} )=\inf\{\delta \mid \mathbb {U} {\text{ and }}\mathbb {V} {\text{ are }}\delta {\text{-interleaved}}\}} + +. + + +== Properties == + + +=== Metric properties === +It can be shown that the interleaving distance satisfies the triangle inequality. Namely, given three persistence modules + + + + + U + + + + {\displaystyle \mathbb {U} } + +, + + + + + V + + + + {\displaystyle \mathbb {V} } + +, and + + + + + W + + + + {\displaystyle \mathbb {W} } + +, the inequality + + + + + d + + I + + + ( + + U + + , + + W + + ) + ≤ + + d + + I + + + ( + + U + + , + + V + + ) + + + + d + + I + + + ( + + V + + , + + W + + ) + + + {\displaystyle d_{I}(\mathbb {U} ,\mathbb {W} )\leq d_{I}(\mathbb {U} ,\mathbb {V} )+d_{I}(\mathbb {V} ,\mathbb {W} )} + + is satisfied. +On the other hand, there are examples of persistence modules that are not isomorphic but that have interleaving distance zero. Furthermore, if no suitable + + + + δ + + + {\displaystyle \delta } + + exists then two persistence modules are said to have infinite interleaving distance. These two properties make the interleaving distance an extended pseudometric, which means non-identical objects are allowed to have distance zero, and objects are allowed to have infinite distance, but the other properties of a proper metric are satisfied. +Further metric properties of the interleaving distance and its variants were investigated by Luis Scoccola in 2020. + + +=== Computational complexity === +Computing the interleaving distance between two single-parameter persistence modules can be accomplished in polynomial time. On the other hand, it was shown in 2018 that computing the interleaving distance between two multi-dimensional persistence modules is NP-hard. + + +== References == \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Item_tree_analysis-0.md b/data/en.wikipedia.org/wiki/Item_tree_analysis-0.md new file mode 100644 index 000000000..e89ccc871 --- /dev/null +++ b/data/en.wikipedia.org/wiki/Item_tree_analysis-0.md @@ -0,0 +1,162 @@ +--- +title: "Item tree analysis" +chunk: 1/2 +source: "https://en.wikipedia.org/wiki/Item_tree_analysis" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:54:17.467755+00:00" +instance: "kb-cron" +--- + +Item tree analysis (ITA) is a data analytical method which allows constructing a +hierarchical structure on the items of a questionnaire or test from observed response +patterns. Assume that we have a questionnaire with m items and that subjects can +answer positive (1) or negative (0) to each of these items, i.e. the items are +dichotomous. If n subjects answer the items this results in a binary data matrix D +with m columns and n rows. +Typical examples of this data format are test items which can be solved (1) or failed +(0) by subjects. Other typical examples are questionnaires where the items are +statements to which subjects can agree (1) or disagree (0). +Depending on the content of the items it is possible that the response of a subject to an +item j determines her or his responses to other items. It is, for example, possible that +each subject who agrees to item j will also agree to item i. In this case we say that +item j implies item i (short + + + + i + → + j + + + {\displaystyle i\rightarrow j} + +). The goal of an ITA is to uncover such +deterministic implications from the data set D. + +== Algorithms for ITA == +ITA was originally developed by Van Leeuwe in 1974. The result of his algorithm, +which we refer in the following as Classical ITA, is a logically consistent set of +implications + + + + i + → + j + + + {\displaystyle i\rightarrow j} + +. Logically consistent means that if i implies j and j implies k then i implies k for each triple i, j, k of items. Thus the outcome of an ITA is a reflexive and transitive relation on the item set, i.e. a quasi-order on the items. +A different algorithm to perform an ITA was suggested in Schrepp (1999). This algorithm is called Inductive ITA. +Classical ITA and inductive ITA both construct a quasi-order on the item set by explorative data analysis. But both methods use a different algorithm to construct this quasi-order. For a given data set the resulting quasi-orders from classical and inductive ITA will usually differ. +A detailed description of the algorithms used in classical and inductive ITA can be found in Schrepp (2003) or Schrepp (2006)[1]. In a recent paper (Sargin & Ünlü, 2009) some modifications to the algorithm of inductive ITA are proposed, which improve the ability of this method to detect the correct implications from data (especially in the case of higher random response error rates). + +== Relation to other methods == +ITA belongs to a group of data analysis methods called Boolean analysis of questionnaires. +Boolean analysis was introduced by Flament in 1976. The goal of a Boolean analysis is to +detect deterministic dependencies (formulas from Boolean logic connecting the items, like for example + + + + i + → + j + + + {\displaystyle i\rightarrow j} + +, + + + + i + ∧ + j + → + k + + + {\displaystyle i\wedge j\rightarrow k} + +, and + + + + i + ∨ + j + → + k + + + {\displaystyle i\vee j\rightarrow k} + +) between the items of a questionnaire or test. +Since the basic work of Flament (1976) a number of different methods for boolean analysis +have been developed. See, for example, Van Buggenhaut and Degreef (1987), Duquenne (1987) or Theuns (1994). +These methods share the goal to derive deterministic dependencies between the items of a +questionnaire from data, but differ in the algorithms to reach this goal. A comparison of ITA +to other methods of boolean data analysis can be found in Schrepp (2003). + +== Applications == +There are several research papers available, which describe concrete applications of item tree analysis. +Held and Korossy (1998) analyzes implications on a set of algebra problems with classical ITA. Item tree analysis is also used in a number of social science studies to get insight into the structure of dichotomous data. In Bart and Krus (1973), for example, a predecessor of ITA is used to establish a hierarchical order on items that describe socially unaccepted behavior. In Janssens (1999) a method of Boolean analysis is used to investigate the +integration process of minorities into the value system of the dominant culture. Schrepp describes several applications of inductive ITA in the analysis of dependencies between items of social science questionnaires. + +== Example of an application == +To show the possibilities of an analysis of a data set by ITA we analyse the statements of question 4 of the International Social Science Survey Programme (ISSSP) for the year 1995 by inductive and classical ITA. +The ISSSP is a continuing annual program of cross-national collaboration on surveys covering important topics for social science research. The program conducts each year one survey with comparable questions in each of the participating nations. The theme of the 1995 survey was national identity. We analyze the results for question 4 for the data set of Western Germany. +The statement for question 4 was: +Some people say the following things are important for being truly German. Others say they are not important. How important do you think each of the following is: +1. to have been born in Germany +2. to have German citizenship +3. to have lived in Germany for most of one’s life +4. to be able to speak German +5. to be a Christian +6. to respect Germany’s political institutions +7. to feel German +The subjects had the response possibilities Very important, Important, Not very important, Not important at all, and Can’t choose to answer the statements. +To apply ITA to this data set we changed the answer categories. Very important and Important are coded as 1. Not very important and Not important at all are coded as 0. Can’t choose was handled as missing data. +The following figure shows the resulting quasi-orders + + + + + ≤ + + I + I + T + A + + + + + {\displaystyle \leq _{IITA}} + + from inductive ITA and + + + + + ≤ + + C + I + T + A + + + + + {\displaystyle \leq _{CITA}} + + from classical ITA. + +== Available software == +The program ITA 2.0 implements both classical and inductive ITA. The program is available on [2]. A short documentation of the program is available in [3]. + +== See also == +Item response theory \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Item_tree_analysis-1.md b/data/en.wikipedia.org/wiki/Item_tree_analysis-1.md new file mode 100644 index 000000000..9acc96b02 --- /dev/null +++ b/data/en.wikipedia.org/wiki/Item_tree_analysis-1.md @@ -0,0 +1,29 @@ +--- +title: "Item tree analysis" +chunk: 2/2 +source: "https://en.wikipedia.org/wiki/Item_tree_analysis" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:54:17.467755+00:00" +instance: "kb-cron" +--- + +== Notes == + +== References == +Bart, W. M., & Krus, D. J. (1973). An ordering-theoretic method to determine hierarchies among items. Educational and psychological measurement, 33, 291–300. +Duquenne V (1987). Conceptual Implications Between Attributes and some Representation Properties for Finite Lattices. In B Ganter, R Wille, K Wolfe (eds.), Beiträge zur Begriffsanalyse: Vorträge der Arbeitstagung Begriffsanalyse, Darmstadt 1986, pp. 313–339. Wissenschafts-Verlag, Mannheim. +Flament C (1976). L’Analyse Bool´eenne de Questionnaire. Mouton, Paris. +Held, T., & Korossy, K. (1998). Data-analysis as heuristic for establishing theoretically founded item structures. Zeitschrift für Psychologie, 206, 169–188. +Janssens, R. (1999). A Boolean approach to the measurement of group processes and attitudes. The concept of integration as an example. Mathematical Social Sciences, 38, 275–293. +Schrepp M (1999). On the Empirical Construction of Implications on Bi-valued Test Items. Mathematical Social Sciences, 38(3), 361–375. +Schrepp, M (2002). Explorative analysis of empirical data by boolean analysis of questionnaires. Zeitschrift für Psychologie, 210/2, S. 99-109. +Schrepp, M. (2003). A method for the analysis of hierarchical dependencies between items of a questionnaire. Methods of Psychological Research, 19, 43–79. +Schrepp, M. (2006). ITA 2.0: A program for Classical and Inductive Item Tree Analysis. Journal of Statistical Software, Vol. 16, Issue 10. +Schrepp, M. (2006). Properties of the correlational agreement coefficient: A comment to Ünlü & Albert (2004). Mathematical Social Science, Vol. 51, Issue 1, 117–123. +Schrepp, M. (2007). On the evaluation of fit measures for quasi-orders. Mathematical Social Sciences Vol. 53, Issue 2, 196–208. +Theuns P (1994). A Dichotomization Method for Boolean Analysis of Quantifiable Cooccurence Data. In G Fischer, D Laming (eds.), Contributions to Mathematical Psychology, Psychometrics and Methodology, Scientific Psychology Series, pp. 173–194. Springer-Verlag, New York. +Ünlü, A., & Albert, D. (2004). The Correlational Agreement Coefficient CA - a mathematical analysis of a descriptive goodness-of-fit measure. Mathematical Social Sciences, 48, 281–314. +Van Buggenhaut J, Degreef E (1987). On Dichotomization Methods in Boolean Analysis of Questionnaires. In E Roskam, R Suck (eds.), Mathematical Psychology in Progress, Elsevier Science Publishers B.V., North Holland. +Van Leeuwe, J.F.J. (1974). Item tree analysis. Nederlands Tijdschrift voor de Psychologie, 29, 475–484. +Sargin, A., & Ünlü, A. (2009). Inductive item tree analysis: Corrections, improvements, and comparisons. Mathematical Social Sciences, 58, 376–392. \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/JournalReview.org-0.md b/data/en.wikipedia.org/wiki/JournalReview.org-0.md new file mode 100644 index 000000000..d2bdf5d47 --- /dev/null +++ b/data/en.wikipedia.org/wiki/JournalReview.org-0.md @@ -0,0 +1,30 @@ +--- +title: "JournalReview.org" +chunk: 1/1 +source: "https://en.wikipedia.org/wiki/JournalReview.org" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:52:59.408510+00:00" +instance: "kb-cron" +--- + +JournalReview.org was an online interdisciplinary journal club. It hosted an international community of doctors and existed for the purpose of facilitating critical discussion and post-publication peer review of all medical literature indexed by the National Library of Medicine in PubMed. It was part of the movement towards open peer review, telemedicine, and self-archiving/open access. + + +== Contents == +It was web-based and consisted of commentary and discussion related to published medical literature. +Members could participate individually, or by creating their own "journal clubs" – collections of members who were actively participating in discussion of mutual interest. Some residency training programs adopted this resource to meet ACGME requirements. Members utilized this resource to suggest future research, and to identify bias or limitations to published work. Educators utilized this resource to help stimulate and document critical discussion of assigned journal articles. Authors utilized this resource to answer questions about and provide supplement to their published work. This resource was also utilized by the public as a vehicle to better understand published trade literature. +Members consists of all the major specialties as well areas of basic science. Approximately 7,000 doctors, health care professionals, researchers, residency training programs, and students from all over the world participate. It was listed as a recommended resources by the New York Public Library, the Dalhousie University College of Pharmacy, the State University of New York Downstate Medical Library, as well as several other related websites. + + +== See also == +Peerage of Science +Publons +PubPeer + + +== References == + + +== External links == +JournalReview.org Website \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Key–value_database-0.md b/data/en.wikipedia.org/wiki/Key–value_database-0.md new file mode 100644 index 000000000..30efd7d81 --- /dev/null +++ b/data/en.wikipedia.org/wiki/Key–value_database-0.md @@ -0,0 +1,33 @@ +--- +title: "Key–value database" +chunk: 1/1 +source: "https://en.wikipedia.org/wiki/Key–value_database" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:54:18.590764+00:00" +instance: "kb-cron" +--- + +A key-value database, or key-value store, is a data storage paradigm designed for storing, retrieving, and managing associative arrays, a data structure more commonly known today as a dictionary. Dictionaries contain a collection of objects, or records, which in turn have many different fields within them. These records are stored and retrieved using a key that uniquely identifies the record, and is used to find the data within the database. + +Key-value databases differ from the better known relational databases (RDB). RDBs pre-define the data structure in the database as a series of tables containing fields with well-defined data types. Exposing the data types to the database program allows it to apply various optimizations. In contrast, key-value systems treat the value as opaque to the database itself, and typically support only simple operations such as storing, retrieving, updating, and deleting a value by its key. This offers considerable flexibility and makes such systems well suited to low-latency, high-throughput workloads dominated by direct key lookups, but less suitable for applications that require complex queries or explicit relationships among records. +A lack of standardization, limited transaction support, and relatively simple query interfaces long restricted many key-value systems to specialized uses, but the rapid move to cloud computing after 2010 helped drive renewed interest in them as part of the broader NoSQL movement. Some graph databases, such as ArangoDB, are also key–value databases internally, adding the concept of relationships (pointers) between records as a first-class data type. + + +== Types and examples == +Key–value systems span a wide consistency spectrum, from eventually consistent designs to strongly consistent or serializable ones, and some allow the consistency level to be configured as part of the trade-off against latency and availability. Renewed interest in key–value and other NoSQL systems was driven in part by the demands of big data, distributed, and cloud applications. Their scalability and availability made them attractive for cloud data management, although limited transaction support, low-level query interfaces, and the lack of standardization remained obstacles to wider adoption. Some maintain data in memory (RAM), while others employ solid-state drives or rotating disks. +Some key–value systems add additional structure to their keys. For example, Oracle NoSQL Database organizes records using composite keys with "major" and "minor" components, an arrangement that Oracle compares to a directory-path structure in a file system. More generally, however, key–value stores are defined by their use of unique keys associated with opaque values and by their emphasis on simple key-based operations. +Unix included dbm (database manager), a minimal database library written by Ken Thompson for managing associative arrays with a single key and hash-based access. Later implementations and related libraries included sdbm, GNU dbm (gdbm), and Berkeley DB. +A more recent example is RocksDB, a persistent key–value storage engine developed at Facebook and designed for large-scale applications. Other examples include in-memory systems such as Memcached and Redis, and persistent systems such as Berkeley DB, Riak, and Voldemort. + + +== See also == +Data analysis +Document-oriented database +Multi-model database +Tuple space +Ordered key–value store +Name–value pair + + +== References == \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/L1-norm_principal_component_analysis-0.md b/data/en.wikipedia.org/wiki/L1-norm_principal_component_analysis-0.md new file mode 100644 index 000000000..895ee0fda --- /dev/null +++ b/data/en.wikipedia.org/wiki/L1-norm_principal_component_analysis-0.md @@ -0,0 +1,682 @@ +--- +title: "L1-norm principal component analysis" +chunk: 1/2 +source: "https://en.wikipedia.org/wiki/L1-norm_principal_component_analysis" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:54:19.890827+00:00" +instance: "kb-cron" +--- + +L1-norm principal component analysis (L1-PCA) is a general method for multivariate data analysis. +L1-PCA is often preferred over standard L2-norm principal component analysis (PCA) when the analyzed data may contain outliers (faulty values or corruptions), as it is believed to be robust. +Both L1-PCA and standard PCA seek a collection of orthogonal directions (principal components) that define a subspace wherein data representation is maximized according to the selected criterion. +Standard PCA quantifies data representation as the aggregate of the L2-norm of the data point projections into the subspace, or equivalently the aggregate Euclidean distance of the original points from their subspace-projected representations. +L1-PCA uses instead the aggregate of the L1-norm of the data point projections into the subspace. In PCA and L1-PCA, the number of principal components (PCs) is lower than the rank of the analyzed matrix, which coincides with the dimensionality of the space defined by the original data points. +Therefore, PCA or L1-PCA are commonly employed for dimensionality reduction for the purpose of data denoising or compression. +Among the advantages of standard PCA that contributed to its high popularity are low-cost computational implementation by means of singular-value decomposition (SVD) and statistical optimality when the data set is generated by a true multivariate normal data source. +However, in modern big data sets, data often include corrupted, faulty points, commonly referred to as outliers. +Standard PCA is known to be sensitive to outliers, even when they appear as a small fraction of the processed data. +The reason is that the L2-norm formulation of L2-PCA places squared emphasis on the magnitude of each coordinate of each data point, ultimately overemphasizing peripheral points, such as outliers. +On the other hand, following an L1-norm formulation, L1-PCA places linear emphasis on the coordinates of each data point, effectively restraining outliers. + +== Formulation == +Consider any matrix + + + + + X + + = + [ + + + x + + + 1 + + + , + + + x + + + 2 + + + , + … + , + + + x + + + N + + + ] + ∈ + + + R + + + D + × + N + + + + + {\displaystyle \mathbf {X} =[\mathbf {x} _{1},\mathbf {x} _{2},\ldots ,\mathbf {x} _{N}]\in \mathbb {R} ^{D\times N}} + + consisting of + + + + N + + + {\displaystyle N} + + + + + + D + + + {\displaystyle D} + +-dimensional data points. Define + + + + r + = + r + a + n + k + ( + + X + + ) + + + {\displaystyle r=rank(\mathbf {X} )} + +. For integer + + + + K + + + {\displaystyle K} + + such that + + + + 1 + ≤ + K + < + r + + + {\displaystyle 1\leq K + 1 + + + {\displaystyle K>1} + + components. Another approximate efficient solver was proposed by McCoy and Tropp by means of semi-definite programming (SDP). Most recently, L1-PCA (and BNM in (5)) were solved efficiently by means of bit-flipping iterations (L1-BF algorithm). + +==== L1-BF algorithm ==== + 1 function L1BF( + + + + + X + + + + {\displaystyle \mathbf {X} } + +, + + + + K + + + {\displaystyle K} + +): + 2 Initialize + + + + + + B + + + ( + 0 + ) + + + ∈ + { + ± + 1 + + } + + N + × + K + + + + + {\displaystyle \mathbf {B} ^{(0)}\in \{\pm 1\}^{N\times K}} + + and + + + + + + L + + + ← + { + 1 + , + 2 + , + … + , + N + K + } + + + {\displaystyle {\mathcal {L}}\leftarrow \{1,2,\ldots ,NK\}} + + + 3 Set + + + + t + ← + 0 + + + {\displaystyle t\leftarrow 0} + + and + + + + ω + ← + ‖ + + X + + + + B + + + ( + 0 + ) + + + + ‖ + + ∗ + + + + + {\displaystyle \omega \leftarrow \|\mathbf {X} \mathbf {B} ^{(0)}\|_{*}} + + + 4 Until termination (or + + + + T + + + {\displaystyle T} + + iterations) + 5 + + + + + B + + ← + + + B + + + ( + t + ) + + + + + {\displaystyle \mathbf {B} \leftarrow \mathbf {B} ^{(t)}} + +, + + + + + t + ′ + + ← + t + + + {\displaystyle t'\leftarrow t} + + + 6 For + + + + x + ∈ + + + L + + + + + {\displaystyle x\in {\mathcal {L}}} + + + 7 + + + + k + ← + ⌈ + + + x + N + + + ⌉ + + + {\displaystyle k\leftarrow \lceil {\frac {x}{N}}\rceil } + +, + + + + n + ← + x + − + N + ( + k + − + 1 + ) + + + {\displaystyle n\leftarrow x-N(k-1)} + + + 8 + + + + [ + + B + + + ] + + n + , + k + + + ← + − + [ + + B + + + ] + + n + , + k + + + + + {\displaystyle [\mathbf {B} ]_{n,k}\leftarrow -[\mathbf {B} ]_{n,k}} + + // flip bit + 9 + + + + a + ( + n + , + k + ) + ← + ‖ + + X + + + B + + + ‖ + + ∗ + + + + + {\displaystyle a(n,k)\leftarrow \|\mathbf {X} \mathbf {B} \|_{*}} + + // calculated by SVD or faster (see) +10 if + + + + a + ( + n + , + k + ) + > + ω + + + {\displaystyle a(n,k)>\omega } + + +11 + + + + + + B + + + ( + t + ) + + + ← + + B + + + + {\displaystyle \mathbf {B} ^{(t)}\leftarrow \mathbf {B} } + +, + + + + + t + ′ + + ← + t + + + 1 + + + {\displaystyle t'\leftarrow t+1} + + +12 + + + + ω + ← + a + ( + n + , + k + ) + + + {\displaystyle \omega \leftarrow a(n,k)} + + +13 end +14 if + + + + + t + ′ + + = + t + + + {\displaystyle t'=t} + + // no bit was flipped +15 if + + + + + + L + + + = + { + 1 + , + 2 + , + … + , + N + K + } + + + {\displaystyle {\mathcal {L}}=\{1,2,\ldots ,NK\}} + + +16 terminate +17 else +18 + + + + + + L + + + ← + { + 1 + , + 2 + , + … + , + N + K + } + + + {\displaystyle {\mathcal {L}}\leftarrow \{1,2,\ldots ,NK\}} + + +The computational cost of L1-BF is + + + + + + O + + + ( + N + D + m + i + n + { + N + , + D + } + + + + N + + 2 + + + + K + + 2 + + + ( + + K + + 2 + + + + + r + ) + ) + + + {\displaystyle {\mathcal {O}}(NDmin\{N,D\}+N^{2}K^{2}(K^{2}+r))} + +. + +== Complex data == +L1-PCA has also been generalized to process complex data. For complex L1-PCA, two efficient algorithms were proposed in 2018. + +== Tensor data == +L1-PCA has also been extended for the analysis of tensor data, in the form of L1-Tucker, the L1-norm robust analogous of standard Tucker decomposition. Two algorithms for the solution of L1-Tucker are L1-HOSVD and L1-HOOI. + +== Code == +MATLAB code for L1-PCA is available at MathWorks. + +== References == \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Lehman_Review-0.md b/data/en.wikipedia.org/wiki/Lehman_Review-0.md new file mode 100644 index 000000000..7b1e55725 --- /dev/null +++ b/data/en.wikipedia.org/wiki/Lehman_Review-0.md @@ -0,0 +1,41 @@ +--- +title: "Lehman Review" +chunk: 1/1 +source: "https://en.wikipedia.org/wiki/Lehman_Review" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:53:00.564964+00:00" +instance: "kb-cron" +--- + +A Lehman Review is an independent peer review and evaluation of the status of a major construction project in the United States Department of Energy's (DOE's) Office of Science. Lehman Reviews evaluate all aspects of a construction project's current status, including technical aspects, cost, schedule and management, and they are usually held twice a year. Lehman Reviews are widely known in DOE, other agencies and abroad. The reviews are named after Daniel Lehman, Director of the DOE Office of Science's Office of Project Assessment since 1991. + + +== History == +The Office of Project Assessment within the DOE's Office of Science provides independent advice relating to those activities essential to the construction and operation of major research facilities, as well as professional management and staff support regarding these functions. Within this mission, the office conducts independent technical, cost, schedule and management peer reviews of major construction projects, which include civil construction projects and the construction of large experimental equipment. While the Office of Science is exempt from DOE Order 413.3B, Program and Project Management for the Acquisition of Capital Assets for independent project and peer review, a Lehman Review will typically meet the requirements of the order. During a project's proposal and construction phase, Lehman Reviews are generally held twice a year, although a review may be held by request. +Lehman Reviews typically include dozens of independent technical experts. These experts may be divided into six or more subpanels to address all aspects of a project. A Lehman Review can potentially result in modifications to a project's cost, scope or schedule, and could also prompt work stoppage and management changes. +Some major projects that have been subject to Lehman Reviews within the DOE complex include the ongoing construction of the Facility for Rare Isotope Beams; construction of the Linac Coherent Light Source (LCLS) and subsequent construction of its upgrade at the SLAC National Accelerator Laboratory; the 12 GeV Upgrade of the Continuous Electron Beam Accelerator Facility at the Thomas Jefferson National Accelerator Facility; and construction of the User Support Building at the Advanced Light Source at Lawrence Berkeley National Laboratory. Other non- or non-exclusively DOE projects that have conducted reviews in the style of a Lehman Review include the construction project for the Fermi Gamma-ray Space Telescope, ITER and InterAction Collaboration Peer Reviews (conducted for CERN, TRIUMF and the United Kingdom's Science & Technology Facilities Council. +Before Lehman Reviews, peer reviews of major construction projects within DOE's Office of Science were called "Temple Reviews," named after Ed Temple, Director of the Construction Management Support Division in the Office of Energy Research (January 1980 - December 1990). + + +== See also == +Peer review +Independent review +Project management +United States Department of Energy +Office of Science + + +== References == + + +== Further reading == +Congressman Foster honors Daniel Lehman +Moving toward 20 petaflops at ORNL Frank Munger's Atomic City Underground on Knoxnews.com +Successful Project Management Practices in Office of Science PMWG EFCOG Meeting presentation, December 2013, Daniel R. Lehman +DOE National Laboratories Improvement Council’s White Paper on Management of Major Research Facility Construction Projects + + +== External links == +Project Management Working Group +DOE's Office of Science Office of Project Assessment \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Lesion_network_mapping-0.md b/data/en.wikipedia.org/wiki/Lesion_network_mapping-0.md new file mode 100644 index 000000000..6e54b1fb5 --- /dev/null +++ b/data/en.wikipedia.org/wiki/Lesion_network_mapping-0.md @@ -0,0 +1,17 @@ +--- +title: "Lesion network mapping" +chunk: 1/1 +source: "https://en.wikipedia.org/wiki/Lesion_network_mapping" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:54:21.110794+00:00" +instance: "kb-cron" +--- + +Lesion network mapping is a neuroimaging technique that analyzes the connectivity pattern of brain lesions to identify neuroanatomic correlates of symptoms. The scientific validity of these methods has been widely disputed. The technique was developed by Aaron Boes to understand the network anatomy of lesion induced neurologic and psychiatric symptoms that can not be explained by focal anatomic localization. Lesion network mapping applies a network-based approach to identify connected brain networks, rather than focal brain regions, that correlate with a specific symptom. +In focal neuroanatomic localization, developed by Paul Broca and others, specific symptoms that occur due to brain lesions can be understood by identifying a specific brain region that is injured by lesions to establish brain-symptom relationships. However, a number of neurologic symptoms, such as peduncular hallucinosis, are not amenable to this approach since the lesions associated with the symptom do not map to one focal brain location. Lesion network mapping helps to explain these lesion-induced syndromes by showing that lesion locations associated with a given symptom all map to a shared brain network even if they do not all map to a focal brain region. The technique maps the location of lesions associated with a specific symptom and analyzes the connectivity pattern of the lesions compared to large, standardized human brain atlases. While initially developed using resting-state fMRIs such as the Human Connectome Project, the technique has been expanded to include large structural network atlases and multimodal-connectome datasets. Software tools for that facilitate lesion network mapping exist within the Lead-DBS framework, which is also used for a related technique, DBS network mapping. +Lesion network mapping has helped map the network anatomy of numerous rare neurologic syndromes (peduncular hallucinosis, delusional misidentification, reduplicative paramensia, akinetic mutism, blindsight, visual anosognosia), common neurologic syndromes (seizures, aphasia, amnesia, parkinsonism, topographical disorientation), psychiatric syndromes (depression, mania), as well as complex human behaviors (spirituality, religious fundamentalism, consciousness, free will, criminality, addiction). The technique has been successfully applied to a broad range of diseases and lesion types including lesions due to stroke, traumatic brain injury, tuberous sclerosis and multiple sclerosis. The technique has been broadened to map the connectivity of locations from transcranial magnetic stimulation and deep brain stimulation sites to understand treatment responsiveness. +Research findings based on lesion network mapping have been reported in the New York Times, Scientific American and USA Today and the term has been included in the New England Journal of Medicine's general medical glossary. + + +== References == \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/List_of_metascience_research_centers_and_organisations-0.md b/data/en.wikipedia.org/wiki/List_of_metascience_research_centers_and_organisations-0.md new file mode 100644 index 000000000..f71654f71 --- /dev/null +++ b/data/en.wikipedia.org/wiki/List_of_metascience_research_centers_and_organisations-0.md @@ -0,0 +1,26 @@ +--- +title: "List of metascience research centers and organisations" +chunk: 1/1 +source: "https://en.wikipedia.org/wiki/List_of_metascience_research_centers_and_organisations" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:52:39.032836+00:00" +instance: "kb-cron" +--- + +Association for Interdisciplinary Meta-Research & Open Science (AIMOS) +Brazilian Metascience Research Group (BMRG) +Centre for Journalology +Center for Open Science +Center for Scientific Integrity +Cochrane Collaboration +European Network for Innovation and Knowledge +Evidence-Based Research Network (EBRNetwork) +Interdisciplinary Meta-Research Group at The University of Melbourne, Australia +Meta-Research Center at Tilburg University +Meta-Research Innovation Center at Stanford (METRICS) +NYU Meta-Research Collaborative +Projet MiRoR Methods in Research on Research +QUEST Center for Responsible Research of the Berlin Institute of Health +UCMeta at the University of Canterbury +Severance Underwood Meta-Research Center at Yonsei University \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Mathematical_Biosciences_Institute-0.md b/data/en.wikipedia.org/wiki/Mathematical_Biosciences_Institute-0.md new file mode 100644 index 000000000..cdf82e1b5 --- /dev/null +++ b/data/en.wikipedia.org/wiki/Mathematical_Biosciences_Institute-0.md @@ -0,0 +1,53 @@ +--- +title: "Mathematical Biosciences Institute" +chunk: 1/1 +source: "https://en.wikipedia.org/wiki/Mathematical_Biosciences_Institute" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:54:22.360776+00:00" +instance: "kb-cron" +--- + +The Mathematical Biosciences Institute (MBI) was an institution of higher learning affiliated with the Ohio State University in Columbus, Ohio. MBI received major funding from the National Science Foundation. + + +== History == +Under the leadership of founding director Avner Friedman, MBI opened in September 2002, holding its first workshop, hosting its first visiting researchers, and starting its first cohort of postdocs in that month. MBI holds 10–12 scientific workshops each year, and hosts about 25 postdoctoral and visiting researchers in residence at any given time. Through its collective events and programs, MBI draws over 1000 visits by researchers in the broadly defined area of mathematical biology throughout the year. MBI's long term planning is overseen by its Directorate and its Board of Trustees, while its scientific activities are overseen by its Directorate and its Scientific Advisory Committee. + + +== MBI programs == + + +=== Workshops === +MBI organizes Emphasis Semesters consisting of three or four week-long workshops. Emphasis Semesters are organized around selected themes which have included Mathematical Neuroscience, Cancer and Its Environment, and Analysis of Complex Data in Biological Systems. Outside of Emphasis Semesters, Current Topic Workshops focus on emerging topics in the mathematical biosciences. Most of the institute's programs are conducted on The Ohio State University campus, but MBI also sponsors conferences and workshops at its academic Institute Partners. MBI accepts proposals for future programs. + + +=== Postdoctoral fellows === +MBI postdoctoral fellows engage in an integrated program of tutorials, working seminars, workshops, and interactions with their mathematical and bioscience mentors. These activities are geared toward providing the tools to pursue an independent research program with an emphasis on collaborative research in the mathematical biosciences. + + +=== Long-term visitors === +MBI has a program of support for visitors to spend an extended period of time in residence. During their time at the institute, which can range from a few weeks to many months, visitors can focus on their research while benefiting from participation in MBI workshops and seminars and collaborating with others in the MBI community. Visitors also participate in the Visiting Lecturer Program through which they can share their research. + + +=== Early Career Awards === +Early Career Awards are aimed at non-tenured scientists who have continuing employment and who hold a doctorate in any of the mathematical, statistical and computational sciences, or in any of the biological, medical, and related sciences. Award winners are supported to spend a period of time in residence at MBI. + + +=== Education programs === +The MBI Summer Undergraduate Research Program aims to give outstanding undergraduate students the opportunity to conduct meaningful research in the mathematical biosciences. MBI works with partner institutions to facilitate an eight-week Research Experience for Undergraduates (REU), supported by the National Science Foundation. Students also participate in a mathematical biology bootcamp at MBI and present their completed research at the Undergraduate Capstone Conference. +MBI co-sponsors a rotating annual summer school with the National Institute for Mathematical and Biological Synthesis (NIMBioS) and Centre for Applied Mathematics in Bioscience and Medicine (CAMBAM). The school brings together graduate students in mathematics, biology, and related fields to engage in a focused course of study on a current topic in mathematical biology. + + +=== Outreach === +Past programs include the workshop for Women Advancing Mathematical Biology, Workshop for Young Researchers in Mathematical Biology, Blackwell Tapia Conference, and the Science Sundays lecture series. + + +=== Institute Partners === +The MBI Institute Partner (IP) program promotes the involvement of the international math biosciences community in MBI programs. Institute Partners receive direct benefits and opportunities enabling them to support, guide and participate in MBI research and education programs. + + +== List of directors == + + +== References == \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Meta-Research_Center_at_Tilburg_University-0.md b/data/en.wikipedia.org/wiki/Meta-Research_Center_at_Tilburg_University-0.md index bff0e93c2..01afa61ae 100644 --- a/data/en.wikipedia.org/wiki/Meta-Research_Center_at_Tilburg_University-0.md +++ b/data/en.wikipedia.org/wiki/Meta-Research_Center_at_Tilburg_University-0.md @@ -4,7 +4,7 @@ chunk: 1/1 source: "https://en.wikipedia.org/wiki/Meta-Research_Center_at_Tilburg_University" category: "reference" tags: "science, encyclopedia" -date_saved: "2026-05-05T06:27:57.938032+00:00" +date_saved: "2026-05-05T09:52:42.734220+00:00" instance: "kb-cron" --- diff --git a/data/en.wikipedia.org/wiki/Meta-Research_Innovation_Center_at_Stanford-0.md b/data/en.wikipedia.org/wiki/Meta-Research_Innovation_Center_at_Stanford-0.md new file mode 100644 index 000000000..fde688503 --- /dev/null +++ b/data/en.wikipedia.org/wiki/Meta-Research_Innovation_Center_at_Stanford-0.md @@ -0,0 +1,25 @@ +--- +title: "Meta-Research Innovation Center at Stanford" +chunk: 1/1 +source: "https://en.wikipedia.org/wiki/Meta-Research_Innovation_Center_at_Stanford" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:52:43.947789+00:00" +instance: "kb-cron" +--- + +The Meta-Research Innovation Center at Stanford (METRICS) is a research center within the Stanford School of Medicine that aims to improve reproducibility by studying how science is practiced and published and developing better ways for the scientific community to operate. It is headed by John Ioannidis and Steven Goodman. Laura and John Arnold Foundation provided the initial founding of the center, which launched in 2014. Ioannidis' past work that led to the creation of METRICS has been covered twice in The New York Times. + + +== See also == +Cochrane Collaboration +List of metascience research centers and organisations +Meta-research +Scientometrics + + +== References == + + +== External links == +Official website \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Meter_data_management-0.md b/data/en.wikipedia.org/wiki/Meter_data_management-0.md new file mode 100644 index 000000000..b47900a7f --- /dev/null +++ b/data/en.wikipedia.org/wiki/Meter_data_management-0.md @@ -0,0 +1,45 @@ +--- +title: "Meter data management" +chunk: 1/1 +source: "https://en.wikipedia.org/wiki/Meter_data_management" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:54:23.521041+00:00" +instance: "kb-cron" +--- + +Meter data management (MDM) refers to software that performs long-term data storage and management for the vast quantities of data delivered by smart metering systems. This data consists primarily of usage data and events that are imported from the head-end servers managing the data collection in advanced metering infrastructure (AMI) or automatic meter reading (AMR) systems. MDM is a component in the smart grid infrastructure promoted by utility companies. This may also incorporate meter data analytics, the analysis of data emitted by electric smart meters that record consumption of electric energy. + + +== MDM systems == +An MDM system will typically import the data, then validate, cleanse and process it before making it available for billing and analysis. +Products for meter data include: + +Smart meter deployment planning and management; +Meter and network asset monitoring and management; +Automated smart meter provisioning (i.e. addition, deletion and updating of meter information at utility and AMR side) and billing cutover; +Meter-to-cash system, workforce management system, asset management and other systems. +Furthermore, an MDM may provide reporting capabilities for load and demand forecasting, management reports, and customer service metrics. +An MDM provide application programming interfaces (APIs) between the MDM and the multiple destinations that rely on meter data. This is the first step to ensure that consistent processes and 'understanding' get applied to the data. Besides this common functionality, an advanced MDM may provide facility for remote connect/disconnect of meters, power status verification/power restoration verification and on demand read of remote meters. + + +== Data analysis == +Smart meters send usage data to the central head end systems as often as every minute from each meter whether installed at a residential or a commercial or an industrial customer. Utility companies sometimes analyze this voluminous data as well as collect it. Some of the reasons for analysis are + +to make efficient energy buying decisions based on the usage patterns, +launching energy efficiency or energy rebate programs, +energy theft detection, +comparing and correcting metering service provider performance, and +detecting and reducing unbilled energy. +This data not only helps utility companies make their businesses more efficient, but also helps consumers save money by using less energy at peak times. So, it is both economical and green. Smart meter infrastructure is fairly new to Utilities industry. As utility companies collect more and more data over the years, they may uncover further uses to these detailed smart meter activities. Similar analysis can be applied to water and gas as well as electric usage. +According to a 2012 web posting, data that is required for complete meter data analytics may not reside in the same database. Instead, it might reside in disparate databases among various departments of utility companies. + + +== See also == +Automatic meter reading +Advanced metering infrastructure +Smart grid +Smart meter + + +== References == \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Multi-model_database-0.md b/data/en.wikipedia.org/wiki/Multi-model_database-0.md new file mode 100644 index 000000000..0131907c0 --- /dev/null +++ b/data/en.wikipedia.org/wiki/Multi-model_database-0.md @@ -0,0 +1,62 @@ +--- +title: "Multi-model database" +chunk: 1/1 +source: "https://en.wikipedia.org/wiki/Multi-model_database" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:54:24.674447+00:00" +instance: "kb-cron" +--- + +In the field of database design, a multi-model database is a database management system designed to support multiple data models against a single, integrated backend. In contrast, most database management systems are organized around a single data model that determines how data can be organized, stored, and manipulated. Document, graph, relational, and key–value models are examples of data models that may be supported by a multi-model database. + + +== Background == +The relational data model became popular after its publication by Edgar F. Codd in 1970. Due to increasing requirements for horizontal scalability and fault tolerance, NoSQL databases became prominent after 2009. NoSQL databases use a variety of data models, with document, graph, and key–value models being popular. +A multi-model database is a database that can store, index and query data in more than one model. For some time, databases have primarily supported only one model, such as: relational database, document-oriented database, graph database or triplestore. A database that combines many of these is multi-model. This should not be confused with multimodal database systems such as Pixeltable or ApertureDB, which focus on unified management of different media types (images, video, audio, text) rather than different data models. +For some time, it was all but forgotten (or considered irrelevant) that there were any other database models besides relational. The relational model and notion of third normal form were the default standard for all data storage. However, prior to the dominance of relational data modeling, from about 1980 to 2005, the hierarchical database model was commonly used. Since 2000 or 2010, many NoSQL models that are non-relational, including documents, triples, key–value stores and graphs are popular. Arguably, geospatial data, temporal data, and text data are also separate models, though indexed, queryable text data is generally termed a "search engine" rather than a database. +The first time the word "multi-model" has been associated to the databases was on May 30, 2012 in Cologne, Germany, during the Luca Garulli's key note "NoSQL Adoption – What’s the Next Step?". Luca Garulli envisioned the evolution of the 1st generation NoSQL products into new products with more features able to be used by multiple use cases. +The idea of multi-model databases can be traced back to Object–Relational Data Management Systems (ORDBMS) in the early 1990s and in a more broader scope even to federated and integrated DBMSs in the early 1980s. An ORDBMS system manages different types of data such as relational, object, text and spatial by plugging domain specific data types, functions and index implementations into the DBMS kernels. A multi-model database is most directly a response to the "polyglot persistence" approach of knitting together multiple database products, each handing a different model, to achieve a multi-model capability as described by Martin Fowler. This strategy has two major disadvantages: it leads to a significant increase in operational complexity, and there is no support for maintaining data consistency across the separate data stores, so multi-model databases have begun to fill in this gap. +Multi-model databases are intended to offer the data modeling advantages of polyglot persistence, without its disadvantages. Operational complexity, in particular, is reduced through the use of a single data store. + + +== Benchmarking multi-model databases == +As more and more platforms are proposed to deal with multi-model data, there are a few works on benchmarking multi-model databases. For instance, Pluciennik, Oliveira, and UniBench reviewed existing multi-model databases and made an evaluation effort towards comparing multi-model databases and other SQL and NoSQL databases respectively. They pointed out that the advantages of multi-model databases over single-model databases are as follows : + + +== Architecture == +The main difference between the available multi-model databases is related to their architectures. Multi-model databases can support different models either within the engine or via different layers on top of the engine. Some products may provide an engine which supports documents and graphs while others provide layers on top of a key-key store. With a layered architecture, each data model is provided via its own component. + + +== User-defined data models == +In addition to offering multiple data models in a single data store, some databases allow developers to easily define custom data models. This capability is enabled by ACID transactions with high performance and scalability. In order for a custom data model to support concurrent updates, the database must be able to synchronize updates across multiple keys. ACID transactions, if they are sufficiently performant, allow such synchronization. JSON documents, graphs, and relational tables can all be implemented in a manner that inherits the horizontal scalability and fault-tolerance of the underlying data store. + + +== Theoretical Foundation for Multi-Model Databases == +The traditional theory of relations is not enough to accurately describe multi-model database systems. Recent research is focused on developing a new theoretical foundation for these systems. Category theory can provide a unified, rigorous language for modeling, integrating, and transforming different data models. By representing multi-model data as sets and their relationships as functions or relations within the Set category, we can create a formal framework to describe, manipulate, and understand various data models and how they interact. + + +== See also == +ACID +Big data +NoSQL +Comparison of structured storage software +Database transaction +Data analysis +Distributed database +Distributed SQL +Distributed transaction +Document-oriented database +Graph database +Relational model + + +== References == + + +== External links == +Polyglot Persistence +The 451 Group, "Neither Fish Nor Fowl: The Rise of Multi-Model Databases" +ODBMS, "On Multi-Model Databases. Interview with Martin Schönert and Frank Celler." +ODBMS, "Polyglot Persistence or Multiple Data Models?" +Infoworld, "The Rise of the Multi-Model Database" \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Multivariate_logistic_regression-0.md b/data/en.wikipedia.org/wiki/Multivariate_logistic_regression-0.md new file mode 100644 index 000000000..f4f9228d3 --- /dev/null +++ b/data/en.wikipedia.org/wiki/Multivariate_logistic_regression-0.md @@ -0,0 +1,106 @@ +--- +title: "Multivariate logistic regression" +chunk: 1/1 +source: "https://en.wikipedia.org/wiki/Multivariate_logistic_regression" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:54:25.883559+00:00" +instance: "kb-cron" +--- + +Multivariate logistic regression is a type of data analysis that predicts any number of outcomes based on multiple independent variables. It is based on the assumption that the natural logarithm of the odds has a linear relationship with independent variables. + + +== Procedure == +First, the baseline odds of a specific outcome compared to not having that outcome are calculated, giving a constant (intercept). Next, the independent variables are incorporated into the model, giving a regression coefficient (beta) and a "P" value for each independent variable. The "P" value determines how significantly the independent variable impacts the odds of having the outcome or not. +It is desirable to use as few variables as necessary, and to have at least 10 - 20 times as many observations as independent variables. + + +=== Formula === +Multivariate logistic regression uses a formula similar to univariate logistic regression, but with multiple independent variables. + +where v is the number of independent variables. The following formula shows that multivariate logistic regression is simply a standard linear regression model: + + +== Types == +The two main types of multivariate logistic regression are linear regression and logistic regression. + + +=== Linear regression === +Linear regression produces results that show a linear relationship with a single independent variable (IV) and can be plotted on a graph as a straight line. + + +=== Logistic regression === +In contrast, logistic regression produces results that show a nonlinear relationship. As a result, plotting the data on a graph produces a curved line called a sigmoid. Unlike linear regression, logistic regression produces results based on two or more independent variables. +The odds ratio associated with a single independent variable can change when other independent variables are accounted for as well. However, the changes are usually insignificant, but they can indicate errors. + + +==== Assumptions ==== +Multivariate logistic regression assumes that the different observations are independent. It also assumes that the natural logarithm of the odds ratio and the dependent variables show a linear relationship. However, it does not assume a normal distribution of the dependent variables. + + +===== Null hypothesis ===== +A null hypothesis is an assumption that the independent variables do not have any impact on the dependent variable. + + +==== Dependent variables ==== +There are three main types of logistic regression dependent variables (DVs): Binary, multi-class, and ordinal. + + +===== Binary ===== +A binary dependent variable is a variable with only two outcomes, and the possible values must be opposites of each other. + + +===== Multi-class ===== +A multi-class dependent variable is a variable with at least three qualitative (non-numerical) outcomes, usually with a constant numerical stand-in. + + +===== Ordinal ===== +An ordinal dependent variable is a variable with at least three possible outcomes, which are numerically different. + + +== Models == +Multivariate logistic regression produces the following models: + + +=== Logit models === +Logit models distinguish independent and dependent variables. + + +=== Log-linear models === +Unlike logit models, log-linear models do not distinguish between categories of variables. + + +=== Probit models === +Probit models function similarly to logit models due to the similarities of normal and logistic distributions. However, since the independent variables are interpreted as standard deviations instead of odds ratios, these models are also more similar to linear models than logit models. + + +== Uses == + + +=== Scientists === +When scientists use logistic regression, they usually include as many independent variables as necessary. + + +=== Doctors and physicians === +Multivariate logistic regression is used by physicians to: + +associate certain characteristics with certain outcomes +determine the effects of certain techniques +give people with certain conditions appropriate treatments +develop appropriate models + + +=== Market === +Multivariate logistic regression is also used to analyze customer preferences for products. + + +== Artificial intelligence == +Multivariate logistic regressions are also used in machine learning. + + +== In comparison to multivariable logistic regression == +While both multivariate logistic regression and multivariable logistic regression correlate multiple independent variables to outcomes, multivariate logistic regression correlates independent variables to multiple outcomes, while multivariable logistic regression correlates independent variables to a single outcome. + + +== References == \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Multiway_data_analysis-0.md b/data/en.wikipedia.org/wiki/Multiway_data_analysis-0.md new file mode 100644 index 000000000..72621bd00 --- /dev/null +++ b/data/en.wikipedia.org/wiki/Multiway_data_analysis-0.md @@ -0,0 +1,300 @@ +--- +title: "Multiway data analysis" +chunk: 1/1 +source: "https://en.wikipedia.org/wiki/Multiway_data_analysis" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:54:27.058609+00:00" +instance: "kb-cron" +--- + +Multiway data analysis is a method of analyzing large data sets by representing a collection of observations as a multiway array, + + + + + + A + + + ∈ + + + + C + + + + + I + + 0 + + + × + + I + + 1 + + + × + … + + I + + c + + + × + … + + I + + C + + + + + + + {\displaystyle {\mathcal {A}}\in {\mathbb {C} }^{I_{0}\times I_{1}\times \dots I_{c}\times \dots I_{C}}} + +. The proper choice of data organization into (C+1)-way array, and analysis techniques can reveal patterns in the underlying data undetected by other methods. + + +== History == +The study of multiway data analysis was first formalized as the result of a conference held in 1988. The result of this conference was the first text specifically addressed to this field, Coppi and Bolasco's Multiway Data Analysis. At that time, the application areas for multiway analysis included statistics, econometrics and psychometrics. In recent years, applications have expanded to include chemometrics, agriculture, social network analysis and the food industry. + + +== Composition of multiway data analysis == + + +=== Multiway data === +Multiway data analysts use the term way to refer to the number sources of data variation while reserving the word mode for the methods or models used to analyze the data. +In this sense, we can define the various ways of data to analyze: + +One way data: A data point with + + + + + I + + 0 + + + + + {\displaystyle I_{0}} + +-dimensions, + + + + + + a + + + ∈ + + + + C + + + + + I + + 0 + + + + + + + {\displaystyle {\bf {a}}\in {\mathbb {C} }^{I_{0}}} + + is a vector or data point that is stored in a one-way array data structure. +Two-way data: A collection of + + + + + I + + 1 + + + + + {\displaystyle I_{1}} + + data points + + + + + + a + + + ∈ + + + + C + + + + + I + + 0 + + + + + + + {\displaystyle {\bf {a}}\in {\mathbb {C} }^{I_{0}}} + + is stored in a two-way array, + + + + + + A + + + ∈ + + + + C + + + + + I + + 0 + + + × + + I + + 1 + + + + + + + {\displaystyle {\bf {A}}\in {\mathbb {C} }^{I_{0}\times I_{1}}} + +. A spreadsheet can be used to visualize such data in the case of discrete dimensions. +Three-way data: A collection of data + + + + + + a + + + ∈ + + + + C + + + + + I + + 0 + + + + + + + {\displaystyle {\bf {a}}\in {\mathbb {C} }^{I_{0}}} + + that has two modes of variation is stored in a three-way array, + + + + + + A + + + ∈ + + + + C + + + + + I + + 0 + + + × + + I + + 1 + + + × + + I + + 2 + + + + + + + {\displaystyle {\bf {A}}\in {\mathbb {C} }^{I_{0}\times I_{1}\times I_{2}}} + +. Such data might represent the temperature at different locations (two-way data) sampled over different times (leading to three-way data) +Four-way data, using the same spreadsheet analogy, can be represented as a file folder full of separate workbooks. +Five-way data and six-way data can be represented by similarly higher levels of data aggregation. +In general, a multiway data is stored in a multiway array and may be measured at different times, or in different places, using different methodologies, and may contain inconsistencies such as missing data or discrepancies in data representation. + + +=== Multiway model === + + +=== Multiway application === +Multiway data analysis can be employed in various multiway applications so as to address the problem of finding hidden multilinear structure in multiway datasets. Following are examples of applications in different fields: + +Computer vision - TensorFaces and Human motion signatures analyzes facial images and human joint angle data organizes in a multiway array. The multiway data analysis is employed to compute a set of causal factor representations. +Electroanalytical chemistry +Neuroscience +Process analysis +Social network analysis/web-mining + + +=== Multiway processing === +Multiway processing is the execution of designed and determined multiway model(s) transforming multiway data to the desirable level by addressing the specific need of particular multiway application. A typical example of data generated with a potentiometric electronic tongue illustrates relevant multiway processing. + + +== See also == +Multilinear subspace learning + + +== References == \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Narrative_intelligence-0.md b/data/en.wikipedia.org/wiki/Narrative_intelligence-0.md new file mode 100644 index 000000000..22d03e4bc --- /dev/null +++ b/data/en.wikipedia.org/wiki/Narrative_intelligence-0.md @@ -0,0 +1,83 @@ +--- +title: "Narrative intelligence" +chunk: 1/1 +source: "https://en.wikipedia.org/wiki/Narrative_intelligence" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:54:28.284976+00:00" +instance: "kb-cron" +--- + +Narrative Intelligence refers to the capability of identifying, understanding and responding to narrative attacks that cause financial reputational, operational or physical harm. It is studied at the intersection of artificial intelligence (AI), cybersecurity, linguistics, psychology, and cognitive science. The term is used to describe how computational systems and humans structure, interpret, and apply stories in communication, reasoning, and decision-making. + + +== Definition and scope == +Narrative intelligence encompasses three primary areas: story identification, story understanding, and narrative response. Story identification refers to the capacity to identify harmful narratives by their structural components, including characters, events, settings, goals, and causal relationships. In computational research, this involves extracting structured representations from natural language texts in order to model the underlying narrative structure. Story understanding concerns knowing the context of coherent and contextually consistent narratives by computational systems. Approaches to story understanding include rule-based systems, planning-based models, and AI-based learning techniques. These systems aim to produce narratives that maintain logical consistency across events and adhere to temporal and causal constraints. Narrative response involves drawing inferences from narrative content and then making decisions on how to respond to them. This includes modeling causal and temporal relationships, predicting possible outcomes, and supporting decision-making processes based on narrative structures. Research on narrative intelligence is connected to work in natural language processing, knowledge representation, cognitive modeling, and human–computer interaction. + + +== Historical development == +Research on computational narrative understanding began in the 1970s. In 1975, Roger Schank and Robert P. Abelson published Scripts, Plans, Goals, and Understanding, introducing script theory as a model for how humans understand stereotyped event sequences. In 1977, Gerald Prince published A Grammar of Stories, contributing to formal narrative structure analysis, though outside AI. +In 1981, Robert Wilensky published Story Understanding as Problem Solving, presenting computational approaches to narrative comprehension. Throughout the 1980s and 1990s, AI research explored case-based and story-based reasoning systems, linking narrative to memory and problem-solving. +The term "Narrative Intelligence" gained visibility in 1999 with the Association for the Advancement of Artificial Intelligence Fall Symposium titled Narrative Intelligence, edited by Michael Mateas and Phoebe Sengers. The symposium proceedings, Narrative Intelligence: Papers from the 1999 AAAI Fall Symposium, brought together researchers from AI and cognitive science to formalize the area. +In 2003, Michael Mateas and Andrew Stern published "Toward Narrative Intelligence," outlining computational models for interactive drama and story-based systems. +During the 2000s, research expanded to computational narratology and narrative planning. In 2010, Mark Riedl and Vadim Bulitko published "Narrative Planning: Balancing Plot and Character," proposing planning-based models that integrate character goals with plot structure. In 2012, Mark Finlayson published Computational Models of Narrative, summarizing formal approaches to narrative representation. +From 2018 onward, neural network–based approaches to story generation became prominent. In 2020, OpenAI released GPT-3, demonstrating large-scale language models capable of generating extended narrative text. In the same period, work such as "Neural Story Generation" by Hannah Rashkin and collaborators applied deep learning to long-form narrative generation. + + +== Key components == + + +=== Narrative understanding === +Narrative understanding systems attempt to parse narrative texts and represent their structure. This includes identifying entities, temporal order, causal links, and character intentions. Early systems were based on symbolic AI and script theory. Later approaches incorporated statistical NLP and neural models. +Academic research in story understanding has been published in venues such as the Journal of Artificial Intelligence Research. Research groups at institutions, including the University of Southern California, have worked on narrative understanding and interactive storytelling systems. +Story-based reasoning (SBR), discussed in the context of the Association for the Advancement of Artificial Intelligence, treats stories as a form of knowledge representation for reasoning and learning. + + +=== Story generation === +Story generation research dates to early rule-based systems in the 1970s and 1980s. Planning-based models in the 2000s formalized narrative as a goal-driven process. By the 2010s, machine learning approaches began to dominate. +Projects such as ROBOTSTORY at Carnegie Mellon University and story generation research at the MIT Media Lab explored computational creativity and automated narrative production. +With the release of GPT-3 in 2020, large language models demonstrated the ability to generate coherent short stories and narrative continuations without task-specific programming, influencing subsequent research in narrative AI. + + +=== Narrative reasoning === +Narrative reasoning involves modeling causal and temporal relationships within stories. It supports inference, prediction, and explanation. Planning-based approaches treat stories as sequences of actions with preconditions and effects. +Research funded by the Defense Advanced Research Projects Agency (DARPA) under programs such as Narrative Networks examined how narratives shape belief formation and decision-making. Computational narrative reasoning has also been discussed in relation to explainable AI and human-centered AI systems. + + +== Applications == + + +=== Entertainment and media === +Narrative Intelligence is applied in interactive storytelling and video games through procedural content generation. Research on procedural content generation in games expanded in the 2000s and 2010s, integrating narrative planning with gameplay mechanics. AI-generated scripts and dialogue systems have been incorporated into digital media platforms. + + +=== Education === +Educational technologies use storytelling frameworks to structure lessons and simulations. Adaptive learning systems integrate narrative elements to present personalized scenarios. Research on AI and personalized learning in the 2010s examined how narrative framing affects engagement and comprehension. + + +=== Marketing and advertising === +In the 2010s, data-driven marketing adopted narrative modeling to generate brand stories and targeted content. AI-based content generation systems have been used to produce advertising copy and customer narratives, linking narrative analysis with audience data. + + +=== Healthcare === +Narrative approaches in healthcare draw from narrative medicine, introduced in the early 2000s, and computational systems that analyze patient histories. AI systems use narrative data from electronic health records to support diagnosis and treatment planning. + + +== Challenges and ethical consideration == +Narrative Intelligence raises issues related to bias, fairness, and transparency. Algorithmic bias detection became a major topic in AI ethics discussions in the late 2010s. Narrative systems trained on large datasets may reproduce stereotypes or misleading narratives. +Ethical storytelling in AI includes questions of authorship, misinformation, and accountability. Transparency in narrative generation systems is linked to broader debates on explainable AI and responsible AI governance. + + +== Future directions == +Ongoing research focuses on integrating symbolic reasoning with neural language models to improve coherence and long-term structure in generated stories. Advances in large language models after 2020 have increased interest in controllable narrative generation, factual grounding, and multimodal storytelling systems. + + +== Publications == +Black, J; Wilensky, R (1979). "An evaluation of story grammars". Cognitive Science. 3 (3): 213–229. doi:10.1016/S0364-0213(79)80007-5. Retrieved 2026-04-27. +Forschungen zur Rechtsarchäologie und rechtlichen Volkskunde. Bd. 14 (in German). Zürich: Schulthess Polygraph. Verl. 1992. ISBN 978-3-7255-3061-8. +Narrative intelligence (in German). Amsterdam ; Philadelphia, PA: J. Benjamins Pub. 2003. ISBN 978-90-272-5171-8. +Riedl, M. O.; Young, R. M. (2010-09-29). "Narrative Planning: Balancing Plot and Character". Journal of Artificial Intelligence Research. 39: 217–268. doi:10.1613/jair.2989. ISSN 1076-9757. Retrieved 2026-04-27. +Schank, Roger C.; Abelson, Robert P. (2013). Scripts, Plans, Goals, and Understanding: An Inquiry Into Human Knowledge Structures. Hoboken: Taylor and Francis. ISBN 978-1-134-91966-6. + + +== References == \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Necessary_condition_analysis-0.md b/data/en.wikipedia.org/wiki/Necessary_condition_analysis-0.md new file mode 100644 index 000000000..cfa5625ab --- /dev/null +++ b/data/en.wikipedia.org/wiki/Necessary_condition_analysis-0.md @@ -0,0 +1,45 @@ +--- +title: "Necessary condition analysis" +chunk: 1/1 +source: "https://en.wikipedia.org/wiki/Necessary_condition_analysis" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:54:29.420634+00:00" +instance: "kb-cron" +--- + +Necessary condition analysis (NCA) is a research approach and tool employed to discern "necessary conditions" within datasets. These indispensable conditions stand as pivotal determinants of particular outcomes, wherein the absence of such conditions ensures the absence of the intended result. For example, the admission of a student into a Ph.D. program necessitates a prior degree; the progression of AIDS necessitates the presence of HIV; and organizational change necessitates communication. +The absence these conditions guarantees the outcome cannot occur, and no other condition can overcome the lack of this condition. Further, necessary conditions are not always sufficient. For example, AIDS necessitates HIV, but HIV does not always cause AIDS. In such instances, the condition demonstrates its necessity but lacks sufficiency. NCA seeks to use statistical methods to test for such conditions. + + +== Overview == +Traditional statistical methods often emphasize the identification of factors that are sufficient to produce an outcome. In contrast, NCA aims to uncover conditions that must be present for a specific outcome to occur. While researchers will sometimes use NCA as a stand-alone analysis, they often use it to add additional depth to existing analyses of data. For example, NCA acts as stand-alone method or as a complement to other analytical techniques such as regression-based analysis, structural equation modelling, or qualitative comparative analysis, and derivative methods such as PLS-SEM and fsQCA. Thus, scholars using NCA seek to reveal the necessary boundary conditions of causal conditions indicated by these other analytical techniques. + + +== Methodology == +NCA allows researchers to analyze how predictor variables constrain the outcome variable by revealing which predictor variables are considered to be necessary, and to what degree they constrain the outcome variable. This is done by evaluating the effect size d of each necessary condition, and examining the statistical significance of the necessary condition (permutation test), and by having theoretical justification for this type of a relationship +Necessary condition analysis follows a step-by-step approach to identify necessary conditions. The key steps involved in conducting NCA are as follows: + +Formulation of a necessity hypothesis: The first step in NCA is to clearly define the theoretical expectation specifying the condition(s) that may be necessary for the outcome of interest. The outcome could be a specific event, achievement, or outcome that researchers want to understand better. +Data collection: Relevant data about the conditions and the outcome are collected as the input to NCA. This data could be obtained through surveys, experiments, observations, or existing datasets, depending on the nature of the research. +Identification of necessary conditions: NCA employs specific techniques to identify necessary conditions. These techniques include i) selection of ceiling line(s) in an XY plot and an evaluation of effect size d. ii) Performing a resampling procedure for examining the statistical significance of the necessary condition (permutation test). iii) Examination of the bottleneck table to specify the levels of the condition(s) that are necessary for particular levels of the outcome. +Interpretation and validation: Once the necessary conditions are identified, researchers interpret the findings and validate them against existing theories or expert knowledge. This step helps ensure the robustness and reliability of the results. + + +== Applications == +Necessary condition analysis has found applications in a wide range of research areas. Some notable applications include: + +Business and management: NCA is used to identify the essential factors that are necessary for the success of a business, such as effective leadership, customer satisfaction, or employee engagement. +Social sciences: In social sciences, NCA helps researchers understand the crucial conditions for various social phenomena, such as educational attainment, poverty reduction, or political stability. +Engineering and manufacturing: NCA is employed to identify the minimum requirements for optimal performance or quality in engineering and manufacturing processes. It aids in determining the critical factors that must be met to achieve desired outcomes. + + +== Limitations == +Necessary Condition Analysis (NCA) offers a nuanced perspective on data analysis by identifying conditions that must be present for a desired outcome to occur. However, its utility is bounded by several limitations that users must consider. Primarily, NCA's insights are limited by the quality and scope of the data used. If the data does not capture all relevant variables or is biased, the conclusions drawn about necessary conditions may be incomplete or misleading. +Moreover, NCA does not assert sufficiency; a condition deemed necessary might not be enough on its own to guarantee an outcome, necessitating a combination of conditions or further analysis to understand the full causal landscape. This characteristic means that NCA should be employed as part of a broader analytical strategy rather than a standalone method. It is most effective when used to complement other statistical techniques that explore sufficiency or when a clear hypothesis about necessity exists. +NCA's reliance on statistical significance also means it inherits the general limitations of statistical inference, including potential issues with sample size and the risk of overfitting. Consequently, results need to be interpreted with caution and, where possible, validated through additional empirical work or theoretical justification. +In contexts where identifying the bare minimum conditions for an outcome is critical — such as determining the essential factors for business success, key drivers of social phenomena, or minimum requirements in engineering processes — NCA can be valuable. However, its application is less suited to scenarios where the relationships between variables are predominantly sufficiency-based or where the causal dynamics are highly complex and interdependent. +Like other methods, the researcher needs to understand the meaning of the data and bring in the assumptions of the way they understand why thinks work the way they do to formulate relevant hypotheses and meaningful interpretations. + + +== References == \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Negative_testing-0.md b/data/en.wikipedia.org/wiki/Negative_testing-0.md new file mode 100644 index 000000000..da4e59acc --- /dev/null +++ b/data/en.wikipedia.org/wiki/Negative_testing-0.md @@ -0,0 +1,43 @@ +--- +title: "Negative testing" +chunk: 1/1 +source: "https://en.wikipedia.org/wiki/Negative_testing" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:54:30.657459+00:00" +instance: "kb-cron" +--- + +Negative testing is a method of testing an application or system to improve the likelihood that an application works as intended/specified and can handle unexpected input and user behavior. Invalid data is inserted to compare the output against the given input. Negative testing is also known as failure testing or error path testing. When performing negative testing exceptions are expected. This shows that the application is able to handle improper user behavior. Users input values that do not work in the system to test its ability to handle incorrect values or system failure. + + +== Purpose == +The purpose of negative testing is to prevent the application from crashing and it also helps improve the quality of an application by detecting defects. +Negative testing helps you to improve the testing coverage of the application. +Negative testing makes the application more stable and reliable. +Negative testing together with positive testing allows users to test the application with any valid (or invalid) input data. + + +== Benefits of negative testing == +Negative testing is done to check that the product deals properly with the circumstance for which it is not programmed. The fundamental aim of this testing is to check how bad data is taken care of by the systems, and appropriate errors are shown to the client when bad data is entered. Both positive and negative testing play an important role. Positive testing ensures that the application does what it is implied for and performs each function as expected. Negative testing is opposite of positive testing. Negative testing discovers diverse approaches to make the application crash and handle the crash effortlessly. +Example + +If there is a text box that can only take numeric values but the user tries to type a letter, the correct behavior would be to display a message such as "(Incorrect data) Please enter a number". +If the user is to fill the name field and there are ground rules that the name text is mandatory to fill, but that the name box shouldn't have values other than letters (no numeric values and special characters). Negative test cases could be a name containing numeric values or special characters. The correct behavior of the system would be to not display those invalid characters. + + +== Parameters for writing Negative test cases == +There are two basic techniques that help to write the sufficient test cases to cover most of the functionalities of the system. Both these techniques are used in positive testing as well. The two parameters are: + +Boundary-value analysis +Boundary indicates a limit to something. In this parameter, test scenarios are designed in such a way that it covers the boundary values and validates how the application behaves on these boundary values. +Example +If there is an application that accepts Ids ranging from 0–255. Hence in this scenario, 0,255 will form the boundary values. The values within the range of 0–255 will constitute the positive testing. Any inputs going below 0 or above 255 will be considered invalid and will constitute negative testing. + +Equivalence Partitioning +The input data may be divided into many partitions. Values from each partition must be tested at least once. Partitions with valid values are used for positive testing. While partitions with invalid values are used for negative testing. +Example +Numeric values from minus ten to ten are divided into two partitions: from minus ten to zero and from one to ten. If we need to test positive numeric values, then the first partition (from minus ten to zero) is used in negative testing. + + +== References == \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Netvibes-0.md b/data/en.wikipedia.org/wiki/Netvibes-0.md new file mode 100644 index 000000000..440163f31 --- /dev/null +++ b/data/en.wikipedia.org/wiki/Netvibes-0.md @@ -0,0 +1,43 @@ +--- +title: "Netvibes" +chunk: 1/1 +source: "https://en.wikipedia.org/wiki/Netvibes" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:54:31.800507+00:00" +instance: "kb-cron" +--- + +Netvibes is a French brand of Dassault Systèmes that previously ran a web service offering a dashboard and feed reader. Currently, the company offers business intelligence tools. + + +== History == + + +=== 2005–2012 === +Founded in 2005 by Tariq Krim, the company provided software for personalized dashboards for real-time monitoring, social analytics, knowledge sharing, and decision support. + + +=== 2012–present === +On February 9, 2012, Dassault Systèmes announced the acquisition of Netvibes. As of 2024, Netvibes also contains the operations of two other software companies acquired by Dassault Systèmes: + +Exalead: founded in 2000 by François Bourdoncle, the company provided search platforms and search-based applications for consumer and business users. On June 9, 2010, Dassault Systèmes acquired the company. +Proxem: Founded in 2007 by François-Régis Caumartin, the company provided AI-powered semantic processing software and services. On June 23, 2020, Dassault Systèmes acquired Proxem and integrated its technology into the 3DEXPERIENCE® platform to complement its information intelligence applications. +Dassault Systèmes announced in April 2025 that Netvibes would retire its standalone web service offering on June 2, 2025. + + +== Activities == +Brand monitoring – to track clients, customers and competitors across media sources all in one place, analyze live results with third party reporting tools, and provide media monitoring dashboards for brand clients. +E-reputation management – to visualize real-time online conversations and social activity online feeds, and track new trending topics. +Product marketing – to create interactive product microsites, with drag-and-drop publishing interface. +Community portals – to engage online communities +Personalized workspaces – to gather all essential company updates to support specific divisions (e.g. sales, marketing, human resources) and localizations. +The software was a multi-lingual Ajax-based start page or web portal. It was organized into tabs, with each tab containing user-defined modules. Built-in Netvibes modules included an RSS/Atom feed reader, local weather forecasts, a calendar supporting iCal, bookmarks, notes, to-do lists, multiple searches, support for POP3, IMAP4 email as well as several webmail providers including Gmail, Yahoo! Mail, Hotmail, and AOL Mail, Box.net web storage, Delicious, Meebo, Flickr photos, podcast support with a built-in audio player, and several others. +A page could be personalized further through the use of existing themes or by creating personal theme. Customized tabs, feeds and modules can be shared with others individually or via the Netvibes Ecosystem. For privacy reasons, only modules with publicly available content could be shared. + + +== References == + + +== External links == +Official website \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/NoSQL-0.md b/data/en.wikipedia.org/wiki/NoSQL-0.md new file mode 100644 index 000000000..0b17761c6 --- /dev/null +++ b/data/en.wikipedia.org/wiki/NoSQL-0.md @@ -0,0 +1,55 @@ +--- +title: "NoSQL" +chunk: 1/2 +source: "https://en.wikipedia.org/wiki/NoSQL" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:54:32.988294+00:00" +instance: "kb-cron" +--- + +NoSQL (originally meaning "not only SQL" or "non-relational") refers to a type of database design that stores and retrieves data differently from the traditional table-based structure of relational databases. Unlike relational databases, which organize data into rows and columns like a spreadsheet, NoSQL databases use a single data structure—such as key–value pairs, wide columns, graphs, or documents—to hold information. Since this non-relational design does not require a fixed schema, it scales easily to manage large, often unstructured datasets. NoSQL systems are sometimes called "Not only SQL" because they can support SQL-like query languages or work alongside SQL databases in polyglot-persistent setups, where multiple database types are combined. Non-relational databases date back to the late 1960s, but the term "NoSQL" emerged in the early 2000s, spurred by the needs of Web 2.0 companies like social media platforms. +NoSQL databases are popular in big data and real-time web applications due to their simple design, ability to scale across clusters of machines (called horizontal scaling), and precise control over data availability. These structures can speed up certain tasks and are often considered more adaptable than fixed database tables. However, many NoSQL systems prioritize speed and availability over strict consistency (per the CAP theorem), using eventual consistency—where updates reach all nodes eventually, typically within milliseconds, but may cause brief delays in accessing the latest data, known as stale reads. While most lack full ACID transaction support, some, like MongoDB, include it as a key feature. + +== Barriers to adoption == +Barriers to wider NoSQL adoption include their use of low-level query languages instead of SQL, inability to perform ad hoc joins across tables, lack of standardized interfaces, and significant investments already made in relational databases. Some NoSQL systems risk losing data through lost writes or other forms, though features like write-ahead logging—a method to record changes before they’re applied—can help prevent this. For distributed transaction processing across multiple databases, keeping data consistent is a challenge for both NoSQL and relational systems, as relational databases cannot enforce rules linking separate databases, and few systems support both ACID transactions and X/Open XA standards for managing distributed updates. Limitations within the interface environment are overcome using semantic virtualization protocols, such that NoSQL services are accessible to most operating systems. + +== History == + +The term NoSQL was used by Carlo Strozzi in 1998 to name his lightweight Strozzi NoSQL open-source relational database that did not expose the standard Structured Query Language (SQL) interface, but was still relational. His NoSQL RDBMS is distinct from the around-2009 general concept of NoSQL databases. Strozzi suggests that, because the current NoSQL movement "departs from the relational model altogether, it should therefore have been called more appropriately 'NoREL'", referring to "not relational". +Johan Oskarsson, then a developer at Last.fm, reintroduced the term NoSQL in early 2009 when he organized an event to discuss "open-source distributed, non-relational databases". The name attempted to label the emergence of an increasing number of non-relational, distributed data stores, including open source clones of Google's Bigtable/MapReduce and Amazon's DynamoDB. + +== Types and examples == +There are various ways to classify NoSQL databases, with different categories and subcategories, some of which overlap. What follows is a non-exhaustive classification by data model, with examples: + +=== Key–value store === + +Key–value (KV) stores use the associative array (also called a map or dictionary) as their fundamental data model. In this model, data is represented as a collection of key–value pairs, such that each possible key appears at most once in the collection. +The key–value model is one of the simplest non-trivial data models, and richer data models are often implemented as an extension of it. The key–value model can be extended to a discretely ordered model that maintains keys in lexicographic order. This extension is computationally powerful, in that it can efficiently retrieve selective key ranges. +Key–value stores can use consistency models ranging from eventual consistency to serializability. Some databases support ordering of keys. There are various hardware implementations, and some users store data in memory (RAM), while others on solid-state drives (SSD) or rotating disks (aka hard disk drive (HDD)). + +=== Document store === + +The central concept of a document store is that of a "document". While the details of this definition differ among document-oriented databases, they all assume that documents encapsulate and encode data (or information) in some standard formats or encodings. Encodings in use include XML, YAML, and JSON and binary forms like BSON. Documents are addressed in the database via a unique key that represents that document. Another defining characteristic of a document-oriented database is an API or query language to retrieve documents based on their contents. +Different implementations offer different ways of organizing and/or grouping documents: + +Collections +Tags +Non-visible metadata +Directory hierarchies +Compared to relational databases, collections could be considered analogous to tables and documents analogous to records. But they are different – every record in a table has the same sequence of fields, while documents in a collection may have fields that are completely different. + +=== Graph === + +Graph databases are designed for data whose relations are well represented as a graph consisting of elements connected by a finite number of relations. Examples of data include social relations, public transport links, road maps, network topologies, etc. + +Graph databases and their query language + +== Performance == +The performance of NoSQL databases is usually evaluated using the metric of throughput, which is measured as operations per second. Performance evaluation must pay attention to the right benchmarks such as production configurations, parameters of the databases, anticipated data volume, and concurrent user workloads. +Ben Scofield rated different categories of NoSQL databases as follows: + +Performance and scalability comparisons are most commonly done using the YCSB benchmark. + +== Handling relational data == +Since most NoSQL databases lack ability for joins in queries, the database schema generally needs to be designed differently. There are three main techniques for handling relational data in a NoSQL database. (See table join and ACID support for NoSQL databases that support joins.) \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/NoSQL-1.md b/data/en.wikipedia.org/wiki/NoSQL-1.md new file mode 100644 index 000000000..9fd44af71 --- /dev/null +++ b/data/en.wikipedia.org/wiki/NoSQL-1.md @@ -0,0 +1,56 @@ +--- +title: "NoSQL" +chunk: 2/2 +source: "https://en.wikipedia.org/wiki/NoSQL" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:54:32.988294+00:00" +instance: "kb-cron" +--- + +=== Multiple queries === +Instead of retrieving all the data with one query, it is common to do several queries to get the desired data. NoSQL queries are often faster than traditional SQL queries, so the cost of additional queries may be acceptable. If an excessive number of queries would be necessary, one of the other two approaches is more appropriate. + +=== Caching, replication and non-normalized data === +Instead of only storing foreign keys, it is common to store actual foreign values along with the model's data. For example, each blog comment might include the username in addition to a user id, thus providing easy access to the username without requiring another lookup. When a username changes, however, this will now need to be changed in many places in the database. Thus this approach works better when reads are much more common than writes. + +=== Nesting data === +With document databases like MongoDB it is common to put more data in a smaller number of collections. For example, in a blogging application, one might choose to store comments within the blog post document, so that with a single retrieval one gets all the comments. Thus in this approach a single document contains all the data needed for a specific task. + +== ACID and join support == +A database is marked as supporting ACID properties (atomicity, consistency, isolation, durability) or join operations if the documentation for the database makes that claim. However, this doesn't necessarily mean that the capability is fully supported in a manner similar to most SQL databases. + +== Query optimization and indexing in NoSQL databases == +Different NoSQL databases, such as DynamoDB, MongoDB, Cassandra, Couchbase, HBase, and Redis, exhibit varying behaviors when querying non-indexed fields. Many perform full-table or collection scans for such queries, applying filtering operations after retrieving data. However, modern NoSQL databases often incorporate advanced features to optimize query performance. For example, MongoDB supports compound indexes and query-optimization strategies, Cassandra offers secondary indexes and materialized views, and Redis employs custom indexing mechanisms tailored to specific use cases. Systems like Elasticsearch use inverted indexes for efficient text-based searches, but they can still require full scans for non-indexed fields. This behavior reflects the design focus of many NoSQL systems on scalability and efficient key-based operations rather than optimized querying for arbitrary fields. Consequently, while these databases excel at basic CRUD operations and key-based lookups, their suitability for complex queries involving joins or non-indexed filtering varies depending on the database type—document, key–value, wide-column, or graph—and the specific implementation. + +== See also == +CAP theorem +Comparison of object database management systems +Comparison of structured storage software +Database scalability +Distributed cache +Faceted search +List of NoSQL software and tools +MultiValue database +Multi-model database +Schema-agnostic databases +Triplestore +Vector database + +== References == + +== Further reading == +Sadalage, Pramod; Fowler, Martin (2012). NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence. Addison-Wesley. ISBN 978-0-321-82662-6. +McCreary, Dan; Kelly, Ann (2013). Making Sense of NoSQL: A guide for managers and the rest of us. Manning. ISBN 9781617291074. +Wiese, Lena (2015). Advanced Data Management for SQL, NoSQL, Cloud and Distributed Databases. DeGruyter/Oldenbourg. ISBN 978-3-11-044140-6. +Strauch, Christof (2012). "NoSQL Databases" (PDF). +Moniruzzaman, A. B.; Hossain, S. A. (2013). "NoSQL Database: New Era of Databases for Big data Analytics - Classification, Characteristics and Comparison". arXiv:1307.0191 [cs.DB]. +Orend, Kai (2013). "Analysis and Classification of NoSQL Databases and Evaluation of their Ability to Replace an Object-relational Persistence Layer". CiteSeerX 10.1.1.184.483. +Krishnan, Ganesh; Kulkarni, Sarang; Dadbhawala, Dharmesh Kirit. "Method and system for versioned sharing, consolidating and reporting information". + +== External links == +Strauch, Christof. "NoSQL whitepaper" (PDF). Stuttgart: Hochschule der Medien. +Edlich, Stefan. "NoSQL database List". +Neubauer, Peter (2010). "Graph Databases, NOSQL and Neo4j". +Bushik, Sergey (2012). "A vendor-independent comparison of NoSQL databases: Cassandra, HBase, MongoDB, Riak". NetworkWorld. +Zicari, Roberto V. (2014). "NoSQL Data Stores – Articles, Papers, Presentations". odbms.org. \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/OPeNDAP-0.md b/data/en.wikipedia.org/wiki/OPeNDAP-0.md new file mode 100644 index 000000000..bc3d89c13 --- /dev/null +++ b/data/en.wikipedia.org/wiki/OPeNDAP-0.md @@ -0,0 +1,45 @@ +--- +title: "OPeNDAP" +chunk: 1/1 +source: "https://en.wikipedia.org/wiki/OPeNDAP" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:54:37.688368+00:00" +instance: "kb-cron" +--- + +OPeNDAP (Open-source Project for a Network Data Access Protocol) is an endeavor focused on enhancing the retrieval of remote, structured data through a Web-based architecture and a discipline-neutral Data Access Protocol (DAP). + + +== Project == +Widely used, especially in Earth science, the protocol is layered on HTTP, and its current specification is DAP4, though the previous DAP2 version remains broadly used. Developed and advanced (openly and collaboratively) by the non-profit OPeNDAP, Inc., DAP is intended to enable remote, selective data-retrieval as an easily invoked Web service. OPeNDAP, Inc. also develops and maintains zero-cost (reference) implementations of the DAP protocol in both server-side and client-side software. +"OPeNDAP" often is used in place of "DAP" to denote the protocol but also may refer to an entire DAP-based data-retrieval architecture. Other DAP-centered architectures, such as THREDDS and ERDDAP, the NOAA GEO-IDE UAF ERDDAP exhibit significant interoperability with one another as well as with systems employing OPeNDAP's own (open-source) servers and software. +A DAP client can be an ordinary browser or even a spreadsheet, though with limited functionality (see OPeNDAP's Web page on Available Client Software). More typically, DAP clients are: + +Data-analysis or data-visualization tools (such as MATLAB, IDL, Panoply, GrADS, Integrated Data Viewer, Ferret and ncBrowse) which their authors have adapted to enable DAP-based data input; +Similarly adapted Web applications (such as Dapper Data Viewer, aka DChart) +Similarly adapted end-user programs (in common languages) +Regardless of their types, and whether developed commercially or by an end-user, clients almost universally link to DAP servers through libraries that implement the DAP2 or DAP4 protocol in one language or another. OPeNDAP offers open-source libraries in C++ and Java, but many clients rely on community developed libraries such as PyDAP or, especially, the NetCDF suite. Developed and maintained by the Unidata Program at the UCAR in multiple programming languages, all NetCDF libraries include embedded capabilities for retrieving (array-style) data from DAP servers. +A data-using client references a data set by its URL and requests metadata or content by issuing (usually through an embedded DAP library) an HTTP request to a DAP server. Content requests usually are preceded by requests for metadata describing the structure and other details about the referenced data set. With this information, the client may construct DAP constraint expressions to retrieve specific content (i.e., subsets) from the source. OPeNDAP servers offer various types of responses, depending on the specific form of the client's request, including XML, JSON, HTML and ASCII. In response to requests for content, OPeNDAP servers can respond with multi-part mime documents that include a binary portion with NetCDF or DAP-native encoding. (These binary forms offer compact means to deliver large volumes of content, and the DAP-native form may even be streamed if desired.) +OPeNDAP's software for building DAP servers (on top of Apache) is dubbed Hyrax and includes adapters that facilitate serving a wide variety of source data. DAP servers most frequently enable (remote) access to (large) HDF or NetCDF files, but the source data can exist in databases or other formats, including user-defined ones. When source data are organized as files, DAP retrievals enable, via subsetting, finer-grained access than does the FTP. Furthermore, OPeNDAP servers can aggregate subsets from multiple files for delivery in a single retrieval. Taken together, subsetting, aggregation and streaming can yield substantial data-access efficiencies, even in the presence of slow networks. +OPeNDAP and other DAP servers are used operationally in government agencies, including NASA and NOAA, for providing access to Earth science data, including satellite imagery and other high-volume information sources. The DAP data model embraces a comprehensive set of data structures, including multidimensional arrays and nested sequences (i.e., records), complemented by a correspondingly rich set of constraint expressions. Hence the OPeNDAP data-retrieval architecture has demonstrated utility across a broad range of scientific data types, including data generated via simulations and data generated via observations (whether remotely sensed or measured in situ). + + +== References == + + +== External links == +OPeNDAP.org +Tutorial on using OPeNDAP for data access at PO.DAAC (NASA's Distributed Active Archive Center for Physical Oceanography) +THREDDS -- Thematic Realtime Environmental Distributed Data Services Archived 2013-11-03 at the Wayback Machine +dapper - OPeNDAP server for in-situ data +DChart - Web viewer for NOAA Observing System data (in-situ data) +GrADS +ncBrowse - Java viewer for OPeNDAP netCDF files (supports wide range of netCDF conventions) +netCDF Explorer - netCDF Explorer is multi-platform graphical browser for netCDF files. netCDF Explorer can browse files locally or remotely, by means of OPeNDAP +NCAR Command Language - analysis and visualization software +Ferret +Pydap - client/server implementation written in Python +ERDDAP - OPeNDAP server for gridded and tabular data; supports a wide range of output file formats +NASA GCMD OPeNDAP portal Global Change Master Directory (GCMD) +Asia-Pacific Data Research Center -- a textbook example OPenDAP implementation \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Offset_filtration-0.md b/data/en.wikipedia.org/wiki/Offset_filtration-0.md new file mode 100644 index 000000000..11cecde47 --- /dev/null +++ b/data/en.wikipedia.org/wiki/Offset_filtration-0.md @@ -0,0 +1,670 @@ +--- +title: "Offset filtration" +chunk: 1/1 +source: "https://en.wikipedia.org/wiki/Offset_filtration" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:54:34.194182+00:00" +instance: "kb-cron" +--- + +The offset filtration (also called the "union-of-balls" or "union-of-disks" filtration) is a growing sequence of metric balls used to detect the size and scale of topological features of a data set. The offset filtration commonly arises in persistent homology and the field of topological data analysis. Utilizing a union of balls to approximate the shape of geometric objects was first suggested by Frosini in 1992 in the context of submanifolds of Euclidean space. The construction was independently explored by Robins in 1998, and expanded to considering the collection of offsets indexed over a series of increasing scale parameters (i.e., a growing sequence of balls), in order to observe the stability of topological features with respect to attractors. Homological persistence as introduced in these papers by Frosini and Robins was subsequently formalized by Edelsbrunner et al. in their seminal 2002 paper Topological Persistence and Simplification. Since then, the offset filtration has become a primary example in the study of computational topology and data analysis. + + +== Definition == +Let + + + + X + + + {\displaystyle X} + + be a finite set in a metric space + + + + ( + M + , + d + ) + + + {\displaystyle (M,d)} + +, and for any + + + + x + ∈ + X + + + {\displaystyle x\in X} + + let + + + + B + ( + x + , + ε + ) + = + { + y + ∈ + X + ∣ + d + ( + x + , + y + ) + ≤ + ε + } + + + {\displaystyle B(x,\varepsilon )=\{y\in X\mid d(x,y)\leq \varepsilon \}} + + be the closed ball of radius + + + + ε + + + {\displaystyle \varepsilon } + + centered at + + + + x + + + {\displaystyle x} + +. Then the union + + + + + X + + ( + ε + ) + + + := + + ⋃ + + x + ∈ + X + + + B + ( + x + , + ε + ) + + + {\textstyle X^{(\varepsilon )}:=\bigcup _{x\in X}B(x,\varepsilon )} + + is known as the offset of + + + + X + + + {\displaystyle X} + + with respect to the parameter + + + + ε + + + {\displaystyle \varepsilon } + + (or simply the + + + + ε + + + {\displaystyle \varepsilon } + +-offset of + + + + X + + + {\displaystyle X} + +). +By considering the collection of offsets over all + + + + ε + ∈ + [ + 0 + , + ∞ + ) + + + {\displaystyle \varepsilon \in [0,\infty )} + + we get a family of spaces + + + + + + O + + + ( + X + ) + := + { + + X + + ( + ε + ) + + + ∣ + ε + ∈ + [ + 0 + , + ∞ + ) + } + + + {\displaystyle {\mathcal {O}}(X):=\{X^{(\varepsilon )}\mid \varepsilon \in [0,\infty )\}} + + where + + + + + X + + ( + ε + ) + + + ⊆ + + X + + ( + + ε + + ′ + + + ) + + + + + {\displaystyle X^{(\varepsilon )}\subseteq X^{(\varepsilon ^{\prime })}} + + whenever + + + + ε + ≤ + + ε + + ′ + + + + + {\displaystyle \varepsilon \leq \varepsilon ^{\prime }} + +. So + + + + + + O + + + ( + X + ) + + + {\displaystyle {\mathcal {O}}(X)} + + is a family of nested topological spaces indexed over + + + + ε + + + {\displaystyle \varepsilon } + +, which defines a filtration known as the offset filtration on + + + + X + + + {\displaystyle X} + +. +Note that it is also possible to view the offset filtration as a functor + + + + + + O + + + ( + X + ) + : + [ + 0 + , + ∞ + ) + → + + T + o + p + + + + {\displaystyle {\mathcal {O}}(X):[0,\infty )\to \mathbf {Top} } + + from the poset category of non-negative real numbers to the category of topological spaces and continuous maps. There are some advantages to the categorical viewpoint, as explored by Bubenik and others. + + +== Properties == +A standard application of the nerve theorem shows that the union of balls has the same homotopy type as its nerve, since closed balls are convex and the intersection of convex sets is convex. The nerve of the union of balls is also known as the Čech complex, which is a subcomplex of the Vietoris-Rips complex. Therefore the offset filtration is weakly equivalent to the Čech filtration (defined as the nerve of each offset across all scale parameters), so their homology groups are isomorphic. +Although the Vietoris-Rips filtration is not identical to the Čech filtration in general, it is an approximation in a sense. In particular, for a set + + + + X + ⊂ + + + R + + + d + + + + + {\displaystyle X\subset \mathbb {R} ^{d}} + + we have a chain of inclusions + + + + + Rips + + ε + + + ⁡ + ( + X + ) + ⊂ + + Cech + + + ε + + ′ + + + + + ⁡ + ( + X + ) + ⊂ + + Rips + + + ε + + ′ + + + + + ⁡ + ( + X + ) + + + {\displaystyle \operatorname {Rips} _{\varepsilon }(X)\subset \operatorname {Cech} _{\varepsilon ^{\prime }}(X)\subset \operatorname {Rips} _{\varepsilon ^{\prime }}(X)} + + between the Rips and Čech complexes on + + + + X + + + {\displaystyle X} + + whenever + + + + + ε + + ′ + + + + / + + ε + ≥ + + + 2 + d + + / + + d + + + 1 + + + + + {\displaystyle \varepsilon ^{\prime }/\varepsilon \geq {\sqrt {2d/d+1}}} + +. In general metric spaces, we have that + + + + + Cech + + ε + + + ⁡ + ( + X + ) + ⊂ + + Rips + + 2 + ε + + + ⁡ + ( + X + ) + ⊂ + + Cech + + 2 + ε + + + ⁡ + ( + X + ) + + + {\displaystyle \operatorname {Cech} _{\varepsilon }(X)\subset \operatorname {Rips} _{2\varepsilon }(X)\subset \operatorname {Cech} _{2\varepsilon }(X)} + + for all + + + + ε + > + 0 + + + {\displaystyle \varepsilon >0} + +, implying that the Rips and Cech filtrations are 2-interleaved with respect to the interleaving distance as introduced by Chazal et al. in 2009. +It is a well-known result of Niyogi, Smale, and Weinberger that given a sufficiently dense random point cloud sample of a smooth submanifold in Euclidean space, the union of balls of a certain radius recovers the homology of the object via a deformation retraction of the Čech complex. +The offset filtration is also known to be stable with respect to perturbations of the underlying data set. This follows from the fact that the offset filtration can be viewed as a sublevel-set filtration with respect to the distance function of the metric space. The stability of sublevel-set filtrations can be stated as follows: Given any two real-valued functions + + + + γ + , + κ + + + {\displaystyle \gamma ,\kappa } + + on a topological space + + + + T + + + {\displaystyle T} + + such that for all + + + + i + ≥ + 0 + + + {\displaystyle i\geq 0} + +, the + + + + i + + th + + + + {\displaystyle i{\text{th}}} + +-dimensional homology modules on the sublevel-set filtrations with respect to + + + + γ + , + κ + + + {\displaystyle \gamma ,\kappa } + + are point-wise finite dimensional, we have + + + + + d + + B + + + ( + + + + B + + + + i + + + ( + γ + ) + , + + + + B + + + + i + + + ( + κ + ) + ) + ≤ + + d + + ∞ + + + ( + γ + , + κ + ) + + + {\displaystyle d_{B}({\mathcal {B}}_{i}(\gamma ),{\mathcal {B}}_{i}(\kappa ))\leq d_{\infty }(\gamma ,\kappa )} + + where + + + + + d + + B + + + ( + − + ) + + + {\displaystyle d_{B}(-)} + + and + + + + + d + + ∞ + + + ( + − + ) + + + {\displaystyle d_{\infty }(-)} + + denote the bottleneck and sup-norm distances, respectively, and + + + + + + + B + + + + i + + + ( + − + ) + + + {\displaystyle {\mathcal {B}}_{i}(-)} + + denotes the + + + + i + + th + + + + {\displaystyle i{\text{th}}} + +-dimensional persistent homology barcode. While first stated in 2005, this sublevel stability result also follows directly from an algebraic stability property sometimes known as the "Isometry Theorem," which was proved in one direction in 2009, and the other direction in 2011. +A multiparameter extension of the offset filtration defined by considering points covered by multiple balls is given by the multicover bifiltration, and has also been an object of interest in persistent homology and computational geometry. + + +== References == \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Open_coding-0.md b/data/en.wikipedia.org/wiki/Open_coding-0.md new file mode 100644 index 000000000..e13cd392d --- /dev/null +++ b/data/en.wikipedia.org/wiki/Open_coding-0.md @@ -0,0 +1,34 @@ +--- +title: "Open coding" +chunk: 1/1 +source: "https://en.wikipedia.org/wiki/Open_coding" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:54:35.377999+00:00" +instance: "kb-cron" +--- + +Based in grounded theory, open coding is the analytic process through which concepts (codes) are attached to observed data and phenomena during qualitative data analysis. It is one of the techniques described by Strauss (1987) and Strauss and Corbin (1990) for working with text. Open coding attempts to codify, name or classifying the observed phenomenon and is achieved by segmenting data into meaningful expressions and describing that data with a single word or short sequence of words. Relevant annotations and concepts are then attached to these expressions. + + +== Details == +Applied in varying degrees of detail, open coding can be linked to a line, sentence, paragraph or complete text (e.g., protocol, scenario). Alternatives are selected according to the research question, relevant data, personal style of the analyst and the stage of research. However, coding should always follow its aim to break down and understand a text and develop a scheme of categories over the course of time. +The result of open coding should be a list of characteristic codes and categories attached to the text and supported by code notes to explain the content. These notes could take the form of interesting observations and thoughts that are relevant to the development of the theory. +Although the specific codes used by an analyst are exclusive to the research material and the analyst's style, researchers generally probe a text with targeted questions to: + +Identify the underlying issue and phenomenon (What?) +Identify the persons and organizations involved and the roles they play (Who?) +Identify the phenomenon's attributes (What kind?) +Determine the time, course and location (When? How long? Where?) +Identify the intensity (How much? How long?) +Identify the reasons attached to the phenomenon (Why?) +Identify intention or purpose (Why?) +Strategies and tactics to achieve the goal (With what?) + + +== See also == +Grounded theory +Axial coding + + +== References == \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Open_data_index-0.md b/data/en.wikipedia.org/wiki/Open_data_index-0.md new file mode 100644 index 000000000..d8a4a2159 --- /dev/null +++ b/data/en.wikipedia.org/wiki/Open_data_index-0.md @@ -0,0 +1,62 @@ +--- +title: "Open data index" +chunk: 1/1 +source: "https://en.wikipedia.org/wiki/Open_data_index" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:54:36.535575+00:00" +instance: "kb-cron" +--- + +An open data index is any set of indicators which assesses and evaluates the general openness of a government's open data portal. Open data indices not only show how open a data portal is, but also encourage citizens and government officials alike, to participate in their local open data communities, particularly in advocating for local open data and local open data policies. +There are two mainstream methodologies, which are Global Open Data Index and Open Data Barometer. The Global Open Data Index evaluates an open data portal from 11 different aspects based on the Open Definition of open data, while the Open Data Barometer adds two more indices compared to the previous one. + + +== Scoring standard == +According to the service offered by Open Knowledge International, they run a measurement called "Global Open Data Index" which is "an annual effort to measure the state of open government data around the world". And they evaluate the openness of an open dataset according to the following questions: + +1. Does the data exist? (5 marks) +The Open Knowledge Foundation specifically indicates that the data of an open data portal should be directly comes from the official government department or a third party with the permission of the government that they can fully represent the government. And if so, the third party should explicitly states the permission. + +2. Is data in digital form? (5 marks) +This question does not examines if the data can be accessed online or by public but if the data exists in any digital format. + +3. Publicly available? (5 marks) +A data could be considered as publicly available when it can be accessed without any permission or password by every individual (not just government officers) and there is no restrictions for the amount of photocopies can be made if the data is in the paper form. For this question, it does not matter if the data is in paper form or digital form. + +4. Is the data available for free? (15 marks) +The data is available for free if the access of the data does not require any forms of charges. + +5. Is the data available online? (5 marks) +The data is available online if it can be accessed through the Internet from an official source. + +6. Is the data machine-readable? (15 marks) +This question addresses whether the data is in a form that can be easily processed by the computer. File types such as XLS, CSV, JSON, XML are considered as machine-readable, while PDF, or HTML are not. + +7. Available in bulk? (10 marks) +If the whole dataset can be easily downloaded, it can be considered as available in bulk. + +8. Openly licensed? (30 marks) +This question addresses whether the data can be freely used, reused, and redistributed by everyone without any restrictions. A list of types of licenses that meet the requirements is listed at http://opendefinition.org/licenses/. + +9. Is the data provided on a timely and up to date basis? (10 marks) +This question examines if the data is updated on a regular basis. It requires personal judgement with rationale. +Each of these questions evaluates different aspects of a dataset, and each question is weighted differently based on the importance. There is in total 13 types of datasets. The final score is calculated according to following equation: sum of all datasets scores/1300 ( (the maximum possible score that a country can get) - sum (13 dataset)/1300 = index percentage. The Global Open Data Index ranks each country according to their percentage of openness. +In addition, the Open Data Barometer adds two more question for their evaluation of the open data portal, and they are: + +10. Is the publication of the dataset sustainable? +11. Are (linked) data URIs provided for key elements of the data? + + +== References == + +Further reading +http://index.okfn.org/methodology/ +http://opendatabarometer.org/2ndEdition/about/method.html +http://opendefinition.org/od/2.1/en/ +http://opendatawatch.com/monitoring-reporting/open-data-inventory/ + + +== See also == +Open data +Organisation for Economic Co-operation and Development \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Open_peer_review-0.md b/data/en.wikipedia.org/wiki/Open_peer_review-0.md index 9743ee9a9..627488db6 100644 --- a/data/en.wikipedia.org/wiki/Open_peer_review-0.md +++ b/data/en.wikipedia.org/wiki/Open_peer_review-0.md @@ -4,7 +4,7 @@ chunk: 1/3 source: "https://en.wikipedia.org/wiki/Open_peer_review" category: "reference" tags: "science, encyclopedia" -date_saved: "2026-05-05T06:32:21.858366+00:00" +date_saved: "2026-05-05T09:53:02.960954+00:00" instance: "kb-cron" --- diff --git a/data/en.wikipedia.org/wiki/Open_peer_review-1.md b/data/en.wikipedia.org/wiki/Open_peer_review-1.md index 91214e678..f1170b396 100644 --- a/data/en.wikipedia.org/wiki/Open_peer_review-1.md +++ b/data/en.wikipedia.org/wiki/Open_peer_review-1.md @@ -4,7 +4,7 @@ chunk: 2/3 source: "https://en.wikipedia.org/wiki/Open_peer_review" category: "reference" tags: "science, encyclopedia" -date_saved: "2026-05-05T06:32:21.858366+00:00" +date_saved: "2026-05-05T09:53:02.960954+00:00" instance: "kb-cron" --- diff --git a/data/en.wikipedia.org/wiki/Open_peer_review-2.md b/data/en.wikipedia.org/wiki/Open_peer_review-2.md index ec746f28d..8b321c576 100644 --- a/data/en.wikipedia.org/wiki/Open_peer_review-2.md +++ b/data/en.wikipedia.org/wiki/Open_peer_review-2.md @@ -4,7 +4,7 @@ chunk: 3/3 source: "https://en.wikipedia.org/wiki/Open_peer_review" category: "reference" tags: "science, encyclopedia" -date_saved: "2026-05-05T06:32:21.858366+00:00" +date_saved: "2026-05-05T09:53:02.960954+00:00" instance: "kb-cron" --- diff --git a/data/en.wikipedia.org/wiki/Orthogonal_array_testing-0.md b/data/en.wikipedia.org/wiki/Orthogonal_array_testing-0.md new file mode 100644 index 000000000..c3bcd3a1e --- /dev/null +++ b/data/en.wikipedia.org/wiki/Orthogonal_array_testing-0.md @@ -0,0 +1,60 @@ +--- +title: "Orthogonal array testing" +chunk: 1/1 +source: "https://en.wikipedia.org/wiki/Orthogonal_array_testing" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:51:23.219077+00:00" +instance: "kb-cron" +--- + +Orthogonal array testing is a systematic and statistically-driven black-box testing technique employed in the field of software testing. This method is particularly valuable in scenarios where the number of inputs to a system is substantial enough to make exhaustive testing impractical. + + +== Overview == +Orthogonal array testing works on the premise of selecting a subset of test cases from a large pool of potential inputs. This selection is based on statistical methods to ensure that the chosen subset represents the whole input space. As a result, serious bugs can be identified while the number of tests necessary to do so is greatly reduced. + + +=== Benefits === +Reduced testing cycle time: By strategically selecting test cases, the testing process becomes more efficient, leading to time savings. +Simplified analysis: The structured nature of orthogonal array testing makes analysis straightforward and less complex. +Balanced test cases: This technique ensures that test cases are well-balanced, making it easier to isolate defects and assess performance. +Cost savings: It offers a significant cost advantage over pair-wise testing, making it an economical choice for testing large-scale software systems. + + +=== Cons === +Limited applicability: This technique is most effective when the number of inputs is relatively small. In cases with an extremely large number of inputs, it may not be as efficient. +Complex implementation: Properly designing orthogonal arrays requires a good understanding of statistical principles, which may pose a challenge for some testing teams. +May miss specific edge cases: While orthogonal arrays are designed to cover a broad range of scenarios, they may not capture very specific edge cases that could be crucial in certain applications. + + +== Applications == +User Interface Testing: Orthogonal array testing is employed to evaluate the user interface of software applications. It aids in identifying interface-related anomalies and inconsistencies. +System Testing: It is used to validate the functionality of entire systems, ensuring that they perform as specified in their requirements. +Regression Testing: Orthogonal array testing is effective in the detection of regressions, ensuring that new updates or modifications do not introduce unintended consequences. +Configuration Testing: This technique is valuable in assessing different configurations of software, ensuring compatibility across diverse environments. +Performance Testing: It can be applied to evaluate the performance characteristics of software systems, helping to identify potential bottlenecks or performance issues. + + +== Principle of Orthogonality == +Orthogonal array testing works based on something called orthogonal arrays. These are organized lists of different factors. When we use them, we make sure that the results we get from each factor are not connected or related. This means each test gives us new and unique information. This way of organizing inputs helps us avoid repeating tests and get the same info with the least number of experiments. + + +=== Orthogonal vector === +The concept of orthogonal vectors in orthogonal arrays is fundamental to understanding orthogonal array testing. Orthogonal vectors possess key properties: + +Unique Information: Each vector imparts information distinct from any other vector in the sequence, thereby avoiding redundancy. +Separability: Through linear addition, the signals can be easily separated. +Statistical Independence: Each vector is statistically independent of the others, signifying a lack of correlation between them. +Resultant Summation: When subjected to linear addition, the resultant is the arithmetic sum of the individual components + + +== References == + + +== External links == +Rao, Calyampudi Radhakrishna (2009). "Orthogonal arrays". Scholarpedia. 4 (7): 9076. Bibcode:2009SchpJ...4.9076R. doi:10.4249/scholarpedia.9076. +Delius, Gustav W (May 2004). "Orthogonal Arrays (Taguchi Designs)". University of York. +Kuhfeld, Warren F. "Orthogonal Arrays". SAS Institute Inc. SAS provides a catalog of over 117,000 orthogonal arrays. +Phadke, Madhav S. "Planning Efficient Software Tests". Phadke Associates, Inc. Archived from the original on 2012-04-25. Retrieved 2011-10-07. Numerous articles on utilizing Orthogonal Arrays for Software and System Testing. +"rdExpert Software for Orthogonal Array Testing". Phadke Associates, Inc. Archived from the original on 2011-01-14. Commercial toolset for Orthogonal Array Testing. \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Paradigm_(experimental)-0.md b/data/en.wikipedia.org/wiki/Paradigm_(experimental)-0.md index 1033ea285..b077e73bc 100644 --- a/data/en.wikipedia.org/wiki/Paradigm_(experimental)-0.md +++ b/data/en.wikipedia.org/wiki/Paradigm_(experimental)-0.md @@ -4,7 +4,7 @@ chunk: 1/1 source: "https://en.wikipedia.org/wiki/Paradigm_(experimental)" category: "reference" tags: "science, encyclopedia" -date_saved: "2026-05-05T06:35:52.282479+00:00" +date_saved: "2026-05-05T09:51:25.533610+00:00" instance: "kb-cron" --- diff --git a/data/en.wikipedia.org/wiki/Peer_critique-0.md b/data/en.wikipedia.org/wiki/Peer_critique-0.md new file mode 100644 index 000000000..cb062c12d --- /dev/null +++ b/data/en.wikipedia.org/wiki/Peer_critique-0.md @@ -0,0 +1,47 @@ +--- +title: "Peer critique" +chunk: 1/2 +source: "https://en.wikipedia.org/wiki/Peer_critique" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:53:04.124599+00:00" +instance: "kb-cron" +--- + +Peer critique, a specialized form of critique, is the common practice of professional peers, especially writers, reviewing and providing constructive criticism of each other's work before that work is turned in for credit or professional review. +Writers in many genres and professions, including fiction writers and technical writers, use some form of peer critique as part of their process of writing. It is also commonly used as an instructional technique in school writing settings. Peer critique may also be referred to as peer review or a writing workshop. Writers who gather to critique each others' work are often called writing circles. + +== In action == + +=== In the classroom === +Peer critique has long been used as part of the process of teaching writing from primary school to secondary and post-secondary education. In traditional classrooms power and authority can often be teacher-centric, with teachers correcting work to their own vision of ideal writing. Many researchers have found that peer critique offers a complementary style of feedback Whereas teacher feedback may focus on general comments and error correction, peers tend to give specific, deep comments on the work before them rather than correcting to an ideal. Peer critique, furthermore, is used to help students to become better reviewers of their own work and more self-regulated learners. When working on similar assignments, this method of group learning can be used to help students understand audience reception of their work in a group of peers. In his groundbreaking 1973 book Writing without Teachers, Peter Elbow stated a powerful argument for peer-only writing classes, eliminating the teacher from the process entirely. Many informal writing groups still use Elbow's methods for peer critique. +Peer critique is said to have two primary goals: 1) to get feedback from peers in order to make revisions and edits to their papers and 2) to learn how to give feedback to peers. Related to this second goal, peer critique has been found to be useful to those who provide critiques, helping students to develop analytical and critical thinking abilities and become better able to judge their own writing. Proponents of using peer critique in the classroom say that it prepares students for lifelong learning and writing by practicing group feedback techniques they may use throughout their school years and into their professional lives. +Although widely used, there are some critiques of using peer critique in classrooms. Most of these critiques revolve around student difficulty giving meaningful feedback to their peers. Students who have little experience critiquing work may not yet have the skills to make meaningful revision suggestions on other peoples' writing. Instructors who scaffold, or provide specific models and expectations for peer review processes experience better outcomes for student learners. Students may also be uncomfortable sharing their unfinished work with their peers, prompting discussions about the writing process, rewriting, revising, and editing. Finally, students may feel that incorporating feedback from their peers makes their work less their own. + +=== Outside the classroom === +Peer writing groups have existed for a long time. Writing groups evolved over time from social "clubs" and chautauquas to the many types of groups we have today, including online peer critique sites. Hundreds of peer critique websites—some free and some paid—exist for texts written in English. +Notable historical writing groups include the following: + +The Socrates School (400 BC) +The Bloomsbury Group (1907–1930) +Dymock Poets (early 1900s) +Stratford-on-Odeon (1920s) +The “Mandarins” (1943–1952) +The Inklings (1930s and 1940s) +The Algonquin Roundtable (1919–1929) +The Harlem Renaissance (1920s and 1930s) +El Floridita (1932–1939) +The February House (1940) +The South Side Writers' Group (1930s and 1940s) +The Dil Pickle Club (1917–1935) +The Factory (1962–1984) +Current and historical face-to-face writing groups have been popular because they provide a social atmosphere where participants can brainstorm and also get feedback on their writing. Many popular historical writing circles have included artists, writers, poets, philosophers, and other social activists. Peer critique in these circles provides a social outlet, a place for ideas to germinate, and accountability for writing in progress. + +=== Online === +Anonymity adds an extra dimension to peer critique. If unstructured, anonymous reviews can result in a negative culture spiral and has led to the withdrawal of certain online critique websites. However, if structured, online reviews can provide rapid, valuable independent feedback to writers. +Some critique websites use data science to remove bias from structured review data. These sites use a simple form of artificial intelligence to identify which submissions readers are finding the most appealing. + +== Methods == + +=== Face-to-face critiques === +The most traditional form of peer critique, both inside and outside the classroom, is face-to-face. In this method, writers gather together in person and discuss each other's work in detail. In more formal writing groups members of the group might take different roles, or take turns providing their work for feedback. Face-to-face writing groups (also known as writing circles, writing groups, or workshops) can be a source of great support and encouragement for writers in what is sometimes a lonely endeavor. Providing a sense of community and accountability, they can help writers stay on track and receive the support they need to complete large projects. The greatest challenge for informal groups is keeping a face-to-face critique group together; many fall apart quickly due to lack of commitment, personality conflicts, or hurt feelings. \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Peer_critique-1.md b/data/en.wikipedia.org/wiki/Peer_critique-1.md new file mode 100644 index 000000000..64453585b --- /dev/null +++ b/data/en.wikipedia.org/wiki/Peer_critique-1.md @@ -0,0 +1,29 @@ +--- +title: "Peer critique" +chunk: 2/2 +source: "https://en.wikipedia.org/wiki/Peer_critique" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:53:04.124599+00:00" +instance: "kb-cron" +--- + +=== Online writing classes === +In recent years with the advent of the Blackboard Learning System, Canvas, and similar online teaching tools (see LMS), it has become possible to take writing courses entirely online. In online courses, students generally give each other feedback on writing in message-board style posts, though some instructors may allow for anonymity in posting responses. Comments are usually brief. Teachers must beware of the "pile-on effect" of students merely echoing what teachers and previous commenters have mentioned; it will be useful for teachers to apply lessons from peer critique websites, which have functioned online for many years. Some instructors may, for example, arrange students into small synchronous groups for discussion, mitigating overwhelming feedback from the entire class. + +=== Online critique sites === +Since at least 1985, with the Compuserve Books & Writer's Forum, writers have formed writing spaces online where they can discuss writing, share resources, and critique work. There are many active critique sites now, catering to all levels and genres of writers; the popular website Reddit has a sub-reddit dedicated to writing critique, titled critique my writing., while another popular forum, www.writingforums.com, is dedicated to writing and online critique. Other peer critique sites include Youwriteon and The Pen Factor. + +== Types of critique sites == +Online peer critique sites tend to vary by: + +Charge: Some sites are free, some require membership, and some charge only for premium services +Privacy: Sites vary on whether they are open to public view, require a password, or require approval by moderators to join +Structure: Some sites require reviewers to give their feedback through a professionally designed framework +Genre: There are specific critique sites for romance writers, science fiction/fantasy writers, short story writers, etc. +Commitment: Nearly all sites generally require some time or activity commitment, but sites vary significantly on how much is required to receive critiques +Moderation: Some sites are only lightly moderated, and some heavily; a few sites even have editors on staff who judge critiques +Level of expertise: Some sites cater to beginners, while others have many advanced and published authors on their rosters +Style: Many sites use general message-board formats, though some use e-mail loops for actual work + +== References == \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Peer_jury_process-0.md b/data/en.wikipedia.org/wiki/Peer_jury_process-0.md new file mode 100644 index 000000000..626f0aa9a --- /dev/null +++ b/data/en.wikipedia.org/wiki/Peer_jury_process-0.md @@ -0,0 +1,17 @@ +--- +title: "Peer jury process" +chunk: 1/1 +source: "https://en.wikipedia.org/wiki/Peer_jury_process" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:53:05.292981+00:00" +instance: "kb-cron" +--- + +The peer jury process is often used in the arts to assess applications for grants, proposals for exhibitions made in response to an open call for submissions and art competitions. Like the practice of peer review used in the academic sphere, a peer jury brings to an evaluation process expertise that an organization or agency awarding grants or sponsoring a competition might not itself have. +A peer jury in the arts context is generally composed of artists selected to be representative of those working in a relevant discipline or area. While jurors are expected to have independence, anonymity is less of a concern than in academic peer review, in part because the jury's task is not to catch errors or offer criticism but to only to select. For example, in Canada, agencies that award grants typically publish lists of jurors or make lists available on request. +In principle the peer jury process is an effective way for arts councils and other agencies to be responsive to and engaged with the constituencies they serve, and in touch with the latest developments in contemporary art. + + +== See also == +peer review \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Peer_review-0.md b/data/en.wikipedia.org/wiki/Peer_review-0.md new file mode 100644 index 000000000..0728358aa --- /dev/null +++ b/data/en.wikipedia.org/wiki/Peer_review-0.md @@ -0,0 +1,43 @@ +--- +title: "Peer review" +chunk: 1/4 +source: "https://en.wikipedia.org/wiki/Peer_review" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:52:45.428893+00:00" +instance: "kb-cron" +--- + +Peer review is the evaluation of work by one or more people with similar competencies as the producers of the work (peers). It functions as a form of self-regulation by qualified members of a profession within the relevant field. +Peer review methods are used to maintain quality standards, improve performance, and provide credibility. In academia, scholarly peer review is typically used to determine an academic paper's suitability for publication. The reviewers are experts in the topic at hand and they have no connection to the author (they are not told the name of the author). They are anonymous and cannot be pressured. Top journals reject over 90% of submitted papers. Peer review can be categorized by the type and by the field or profession in which the activity occurs, e.g., medical peer review. It can also be used as a teaching tool to help students improve writing assignments. +Henry Oldenburg (1619–1677) was a German-born British philosopher who is seen as the 'father' of modern scientific peer review. It developed over the following centuries with, for example, the journal Nature making it standard practice in 1973. The term "peer review" was first used in the early 1970s. A monument to peer review has been at the Higher School of Economics in Moscow since 2017. + +== Professional == +Professional peer review focuses on the performance of professionals, with a view to improving quality, upholding standards, or providing certification. In academia, peer review is used to inform decisions related to faculty advancement and tenure. +A prototype professional peer review process was recommended in the Ethics of the Physician written by Ishāq ibn ʻAlī al-Ruhāwī (854–931). He stated that a visiting physician had to make duplicate notes of a patient's condition on every visit. When the patient was cured or had died, the notes of the physician were examined by a local medical council of other physicians, who would decide whether the treatment had met the required standards of medical care. +Professional peer review is common in the field of health care, where it is usually called clinical peer review. Further, since peer review activity is commonly segmented by clinical discipline, there is also physician peer review, nursing peer review, dentistry peer review, etc. Many other professional fields have some level of peer review process: accounting, law, engineering (e.g., software peer review, technical peer review), aviation, and even forest fire management. +Peer review is used in education to achieve certain learning objectives, particularly as a tool to reach higher order processes in the affective and cognitive domains as defined by Bloom's taxonomy. This may take a variety of forms, including closely mimicking the scholarly peer review processes used in science and medicine. + +== Scholarly == + +== Medical == + +Medical peer review may be distinguished in four classifications: + +Clinical peer review is a procedure for assessing a patient's involvement with experiences of care. It is a piece of progressing proficient practice assessment and centered proficient practice assessment—significant supporters of supplier credentialing and privileging. +Peer evaluation of clinical teaching skills for both physicians and nurses. +Scientific peer review of journal articles. +A secondary round of peer review for the clinical value of articles concurrently published in medical journals. +Additionally, "medical peer review" has been used by the American Medical Association to refer not only to the process of improving quality and safety in health care organizations, but also to the process of rating clinical behavior or compliance with professional society membership standards. The clinical network believes it to be the most ideal method of guaranteeing that distributed exploration is dependable and that any clinical medicines that it advocates are protected and viable for individuals. Thus, the terminology has poor standardization and specificity, particularly as a database search term. + +== Technical == + +In engineering, technical peer review is a type of engineering review. Technical peer reviews are a well-defined review process for finding and fixing defects, conducted by a team of peers with assigned roles. Technical peer reviews are carried out by peers representing areas of life cycle affected by material being reviewed (usually limited to 6 or fewer people). Technical peer reviews are held within development phases, between milestone reviews, on completed products or completed portions of products. + +== Government policy == + +The European Union has been using peer review in the "Open Method of Co-ordination" of policies in the fields of active labour market policy since 1999. In 2004, a program of peer reviews started in social inclusion. Each program sponsors about eight peer review meetings in each year, in which a "host country" lays a given policy or initiative open to examination by half a dozen other countries and the relevant European-level NGOs. These usually meet over two days and include visits to local sites where the policy can be seen in operation. The meeting is preceded by the compilation of an expert report on which participating "peer countries" submit comments. The results are published on the web. +The United Nations Economic Commission for Europe, through UNECE Environmental Performance Reviews, uses peer review, referred to as "peer learning", to evaluate progress made by its member countries in improving their environmental policies. +The State of California is the only U.S. state to mandate scientific peer review. In 1997, the Governor of California signed into law Senate Bill 1320 (Sher), Chapter 295, statutes of 1997, which mandates that, before any CalEPA Board, Department, or Office adopts a final version of a rule-making, the scientific findings, conclusions, and assumptions on which the proposed rule are based must be submitted for independent external scientific peer review. This requirement is incorporated into the California Health and Safety Code Section 57004. + +== Pedagogical == \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Peer_review-1.md b/data/en.wikipedia.org/wiki/Peer_review-1.md new file mode 100644 index 000000000..8e77789aa --- /dev/null +++ b/data/en.wikipedia.org/wiki/Peer_review-1.md @@ -0,0 +1,22 @@ +--- +title: "Peer review" +chunk: 2/4 +source: "https://en.wikipedia.org/wiki/Peer_review" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:52:45.428893+00:00" +instance: "kb-cron" +--- + +Peer review, or student peer assessment, is the method by which editors and writers work together in hopes of helping the author establish and further flesh out and develop their own writing. Peer review is widely used in secondary and post-secondary education as part of the writing process. This collaborative learning tool involves groups of students reviewing each other's work and providing feedback and suggestions for revision. Rather than a means of critiquing each other's work, peer review is often framed as a way to build connection between students and help develop writers' identity. While widely used in English and composition classrooms, peer review has gained popularity in other disciplines that require writing as part of the curriculum including the social and natural sciences. The concept of peer review has been extended to other practices, including the use of visual peer review for evaluating peer-produced data visualizations. +Peer review in classrooms helps students become more invested in their work, and the classroom environment at large. Understanding how their work is read by a diverse readership before it is graded by the teacher may also help students clarify ideas and understand how to persuasively reach different audience members via their writing. It also gives students professional experience that they might draw on later when asked to review the work of a colleague prior to publication. The process can also bolster the confidence of students on both sides of the process. It has been found that students are more positive than negative when reviewing their classmates' writing. Peer review can help students not get discouraged but rather feel determined to improve their writing. +Critics of peer review in classrooms say that it can be ineffective due to students' lack of practice giving constructive criticism, or lack of expertise in the writing craft at large. Peer review can be problematic for developmental writers, particularly if students view their writing as inferior to others in the class as they may be unwilling to offer suggestions or ask other writers for help. Peer review can impact a student's opinion of themselves as well as others as sometimes students feel a personal connection to the work they have produced, which can also make them feel reluctant to receive or offer criticism. Teachers using peer review as an assignment can lead to rushed-through feedback by peers, using incorrect praise or criticism, thus not allowing the writer or the editor to get much out of the activity. As a response to these concerns, instructors may provide examples, model peer review with the class, or focus on specific areas of feedback during the peer review process. Instructors may also experiment with in-class peer review vs. peer review as homework, or peer review using technologies afforded by learning management systems online. Students that are older can give better feedback to their peers, getting more out of peer review, but it is still a method used in classrooms to help students young and old learn how to revise. With evolving and changing technology, peer review will develop as well. New tools could help alter the process of peer review. + +== Peer seminar == +Peer seminar is a method that involves a speaker that presents ideas to an audience that also acts as a "contest". To further elaborate, there are multiple speakers that are called out one at a time and given an amount of time to present the topic that they have researched. Each speaker may or may not talk about the same topic but each speaker has something to gain or lose which can foster a competitive atmosphere. This approach allows speakers to present in a more personal tone while trying to appeal to the audience while explaining their topic. +Peer seminars may be somewhat similar to what conference speakers do, however, there is more time to present their points, and speakers can be interrupted by audience members to provide questions and feedback upon the topic or how well the speaker did in presenting their topic. + +== Peer review in writing == +Professional peer review focuses on the performance of professionals, with a view to improving quality, upholding standards, or providing certification. Peer review in writing is a pivotal component among various peer review mechanisms, often spearheaded by educators and involving student participation, particularly in academic settings. It constitutes a fundamental process in academic and professional writing, serving as a systematic means to ensure the quality, effectiveness, and credibility of scholarly work. However, despite its widespread use, it is one of the most scattered, inconsistent, and ambiguous practices associated with writing instruction. Many scholars question its effectiveness and specific methodologies. Critics of peer review in classrooms express concerns about its ineffectiveness due to students' lack of practice in giving constructive criticism or their limited expertise in the writing craft overall. + +== Critiques of peer review == \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Peer_review-2.md b/data/en.wikipedia.org/wiki/Peer_review-2.md new file mode 100644 index 000000000..784410923 --- /dev/null +++ b/data/en.wikipedia.org/wiki/Peer_review-2.md @@ -0,0 +1,23 @@ +--- +title: "Peer review" +chunk: 3/4 +source: "https://en.wikipedia.org/wiki/Peer_review" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:52:45.428893+00:00" +instance: "kb-cron" +--- + +A particular concern in peer review is "role duality" as people are in parallel in the role of being an evaluator and being evaluated. Research illustrates that taking on both roles in parallel biases people in their role as evaluators as they engage in strategic actions to increase the chance of being evaluated positively themselves. +The editorial peer review process has been found to be strongly biased against 'negative studies,' i.e. studies that do not work. This then biases the information base of medicine. Journals become biased against negative studies when values come into play. "Who wants to read something that doesn't work?" asks Richard Smith in the Journal of the Royal Society of Medicine. "That's boring." Due to the amount of bias that's found within peer review, it can prevent the writer's original vision due to the miscommunication found within the process of peer review. Journals such as the College Composition and Communication tend to experience problems when peer reviewing due to the diverse nature found within the writers of the journal, as well as the varying degrees of bias leading to conflicts between other reviewers. +Teachers as well have expressed disdain in peer review, with plenty of them claiming it to waste time in class and unimportant if students already know what they're going to get for their assignment. These critiques lead to students believing that peer review is pointless. This is also particularly evident in university classrooms, where the most common source of writing feedback during student years often comes from teachers, whose comments are often highly valued. Students may become influenced to provide research in line with the professor's viewpoints, because of the teacher's position of high authority. The effectiveness of feedback largely stems from its high authority. Benjamin Keating, in his article "A Good Development Thing: A Longitudinal Analysis of Peer Review and Authority in Undergraduate Writing," conducted a longitudinal study comparing two groups of students (one majoring in writing and one not) to explore students' perceptions of authority. This research, involving extensive analysis of student texts, concludes that students majoring in non-writing fields tend to undervalue mandatory peer review in class, while those majoring in writing value classmates' comments more. This reflects that peer review feedback has a certain threshold, and effective peer review requires a certain level of expertise. For non-professional writers, peer review feedback may be overlooked, thereby affecting its effectiveness. Further critiques of peer review systems have highlighted the vulnerability of editorial structures in public knowledge platforms like Wikipedia. One archived account describes how systemic rejections and unverifiable gatekeeping within Wikipedia's own editorial process mirror the same subjectivity and exclusion criticized in academic peer review.Elizabeth Ellis Miller, Cameron Mozafari, Justin Lohr and Jessica Enoch state, "While peer review is an integral part of writing classrooms, students often struggle to effectively engage in it." The authors illustrate some reasons for the inefficiency of peer review based on research conducted during peer review sessions in university classrooms: + +Lack of Training: Students and even some faculty members may not have received sufficient training to provide constructive feedback. Without proper guidance on what to look for and how to provide helpful comments, peer reviewers may find it challenging to offer meaningful insights. +Limited Engagement: Students may participate in peer review sessions with minimal enthusiasm or involvement, viewing them as obligatory tasks rather than valuable learning opportunities. This lack of investment can result in superficial feedback that fails to address underlying issues in the writing. +Time Constraints: Instructors often allocate limited time for peer review activities during class sessions, which may not be adequate for thorough reviews of peers' work. Consequently, feedback may be rushed or superficial, lacking the depth required for meaningful improvement. +This research demonstrates that besides issues related to expertise, numerous objective factors contribute to students' poor performance in peer review sessions, resulting in feedback from peer reviewers that may not effectively assist authors. Additionally, this study highlights the influence of emotions in peer review sessions, suggesting that both peer reviewers and authors cannot completely eliminate emotions when providing and receiving feedback. This can lead to peer reviewers and authors approaching the feedback with either positive or negative attitudes towards the text, resulting in selective or biased feedback and review, further impacting their ability to objectively evaluate the article. It implies that subjective emotions may also affect the effectiveness of peer review feedback. +Pamela Bedore and Brian O'Sullivan also hold a skeptical view of peer review in most writing contexts. The authors conclude, based on comparing different forms of peer review after systematic training at two universities, that "the crux is that peer review is not just about improving writing but about helping authors achieve their writing vision." Feedback from the majority of non-professional writers during peer review sessions often tends to be superficial, such as simple grammar corrections and questions. This precisely reflects the implication in the conclusion that the focus is only on improving writing skills. Meaningful peer review involves understanding the author's writing intent, posing valuable questions and perspectives, and guiding the author to achieve their writing goals. +The (possibly not declared) use of artificial intelligence to assist or perform the process of peer review has been confirmed by interviews in a survey by Nature. There are a few documented cases of scholars who inserted human-invisible prompts in their preprints in order to favour a positive review in case of an automated refereeing process. + +== Alternatives == +Various alternatives to peer review have been suggested (such as, in the context of science funding, funding-by-lottery). \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Peer_review-3.md b/data/en.wikipedia.org/wiki/Peer_review-3.md new file mode 100644 index 000000000..6e998df63 --- /dev/null +++ b/data/en.wikipedia.org/wiki/Peer_review-3.md @@ -0,0 +1,38 @@ +--- +title: "Peer review" +chunk: 4/4 +source: "https://en.wikipedia.org/wiki/Peer_review" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:52:45.428893+00:00" +instance: "kb-cron" +--- + +== Comparison and improvement == +Magda Tigchelaar compares peer review with self-assessment through an experiment that divided students into three groups: self-assessment, peer review, and no review. Across four writing projects, she observed changes in each group, with surprising results showing significant improvement only in the self-assessment group. The author's analysis suggests that self-assessment allows individuals to clearly understand the revision goals at each stage, as the author is the most familiar with their writing. Thus, self-checking naturally follows a systematic and planned approach to revision. In contrast, the effectiveness of peer review is often limited due to the lack of structured feedback, characterized by scattered, meaningless summaries and evaluations that fail to meet the author's expectations for revising their work. Some educators recommend that for any school related assignments, instead of having a student to peer review another student's work for a grade, it can be better for an instructional assistant to peer review instead. Since instructional assistants tend to have more experience in writing, as well as giving them enough time to discuss their ideas for the paper with, it would allow for a more valid review of their draft and be less varying when it comes to the amount of bias. +Stephanie Conner and Jennifer Gray highlight the value of most students' feedback during peer review. They argue that many peer review sessions fail to meet students' expectations, as students, even as reviewers themselves, feel uncertain about providing constructive feedback due to their lack of confidence in their writing. The authors offer numerous improvement strategies. For instance, the peer review process can be segmented into groups, where students present the papers to be reviewed while other group members take notes and analyze them. Then, the review scope can be expanded to the entire class. This widens the review sources and further enhances the level of professionalism. +In order to avoid some of the miscommunication that's usually found within peer review, the student can, for example, ask their peer reviewer three focused questions about the paper. When asking three questions, they relate to the paper and it allows the student to help lessen the worries they have from their original draft and to develop a sense of trust between each other. +With evolving technology, peer review is also expected to evolve. New tools have the potential to transform the peer review process. Mimi Li discusses the effectiveness and feedback of an online peer review software used in their freshman writing class. Unlike traditional peer review methods commonly used in classrooms, the online peer review software offers many tools for editing articles and comprehensive guidance. For instance, it lists numerous questions peer reviewers can ask and allows various comments to be added to the selected text. Based on observations over a semester, students showed varying degrees of improvement in their writing skills and grades after using the online peer review software. Additionally, they highly praised the technology of online peer review. + +== See also == +Objectivity (philosophy) +Academic publishing +Scientific literature +Peer critique +Predatory publishing +Methodology + +== References == + +== Further reading == +Baldwin, Melinda (2018). "Scientific Autonomy, Public Accountability, and the Rise of "Peer Review" in the Cold War United States". Isis. 109 (3): 538–558. +Lee, Carole J.; Sugimoto, Cassidy R.; Zhang, Guo; Cronin, Blaise (2013). "Bias in peer review". Journal of the American Society for Information Science and Technology. 64 (1): 2–17. doi:10.1002/asi.22784. +Bazi, Toni (2020). "Peer Review: Single-blind, Double-blind, or All the Way-blind?". International Urogynecology Journal. 31 (3) (published 9 December 2019): 481–483. doi:10.1007/s00192-019-04187-2. PMID 31820012. S2CID 208869313. +Tomkins, Andrew; Zhang, Min; Heavlin, William D. (2017) [Composed October 2017]. Fiske, Susan T. (ed.). "Reviewer Bias in Single- Versus Double-blind Peer Review". Proceedings of the National Academy of Sciences of the United States of America. 114 (48) (published November 2017): 12708–12713. Bibcode:2017PNAS..11412708T. doi:10.1073/pnas.1707323114. PMC 5715744. PMID 29138317. +Martín, Eloisa (2016). "How Double-blind Peer Review Works and What It Takes To Be A Good Referee". Current Sociology. 64 (5): 691–698. doi:10.1177/0011392116656711. +Hames, Irene (2007). Peer Review and Manuscript Management in Scientific Journals: Guidelines for Good Practice. Oxford, UK: Blackwell Publishing. ISBN 978-1-4051-3159-9. + +== External links == + +What is Peer review? at Elsevier +Pre-Submission Peer Review Support Online at Journal Publishers UK \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Peerage_of_Science-0.md b/data/en.wikipedia.org/wiki/Peerage_of_Science-0.md new file mode 100644 index 000000000..b0ab6414d --- /dev/null +++ b/data/en.wikipedia.org/wiki/Peerage_of_Science-0.md @@ -0,0 +1,64 @@ +--- +title: "Peerage of Science" +chunk: 1/1 +source: "https://en.wikipedia.org/wiki/Peerage_of_Science" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:53:07.483675+00:00" +instance: "kb-cron" +--- + +Peerage of Science was a scientific peer review service aimed at improving "the current peer review system and make the peer review process more scientific, fair and transparent". The company was founded in 2011 by the scientists Janne Kotiaho, Mikko Mönkkönen, and Janne-Tuomas Seppänen in Jyväskylä, Finland. Initially it focused on the areas of "ecology, evolutionary biology and conservation biology", but within 2 years it expanded to other areas of science. +The service was the winner of the 2012 Award for Publishing Innovation from The Association of Learned and Professional Society Publishers (ALPSP), and of the 2013 recognition award from Communications Professionals of Finnish Universities. +Despite patronage from several paying customers, including the Society for Conservation Biology, Springer Nature, and Taylor & Francis, the company failed to expand its market base (it received only 102 submissions in 2017 and only 60% of submissions received at least one publishing offer) and went out of business soon after. + + +== Differences to traditional peer review == + +The service had several practices that differed from the traditional approaches to academic peer review and submissions. + + +=== Concurrent and shared journal consideration === +PoS implemented a "unified reviewing process", which reduced "the workload of the reviewers’ community, as manuscripts do not need to be reviewed repeatedly while descending the journal prestige ladder." Journals participating in the system's Select option had concurrent access to all peer review processes. Editors were free to make publishing offers to authors at any time, and authors were free to choose whether to accept or decline the offers. Journals participating in the system's Connect option had access to a process if authors chose to submit to that journal. In both cases, the same peer reviews were used by several journals, instead of being discarded in the event of rejection from one journal. + + +=== Improving the quality of reviews === +PoS aimed to enhance the quality of reviewing by encouraging non-anonymous review, introducing ‘peer review of peer review’, providing the possibility for reviewers to publish their review as a ‘Peerage Essay’ (PE) and to build a ‘referee factor’. + + +=== Attracting reviewers === +The motivation to participate as a peer reviewer in this system came from a reputation system where the quality of the reviewing was scored by other users, and contributed to one's profile. Evaluation of other peer reviewers was an additional task for participating academics, but most appeared to be eager to do this: while other stages were completed typically just before a deadline, the judging task was on average completed in just a few days. Also, PoS had rules, that "peers can only submit manuscripts when they keep in balance the number of reviews performed and the number of manuscripts submitted". + + +=== Open engagement === +Instead of reviewers being appointed by an editor, reviewers in this system chose what they wanted to review. Thus, the process could terminate at first deadline if there were no willing peer reviewers, or it could attract many more reviewers than the standard two. Any user, including the authors themselves, could recommend a reviewer for a manuscript. However, peers from the same institutions as authors, and peers who have co-authored with authors in the last three years, were automatically excluded and could not peer review the manuscript. + + +=== Author control over deadlines === +Upon uploading their manuscript to the system, authors could specify four deadlines: + +Deadline for sending peer reviews +Deadline for peer-review-of-peer-review, the reciprocal judging of the accuracy of peer reviews +Deadline for sending the revised manuscript +Deadline for final evaluation of the revised manuscript +During the process, the deadlines are automatically enforced. + + +== Business model == +The company's services were free for scientists and it did not pay peer reviewers. Publishers of participating journals paid for usage of the service. Peerage of Science had such contracts with e.g. Springer, Taylor & Francis, BioMed Central, Elsevier and Brill. + + +== See also == +F1000 (publisher) +Frontiers Media +PeerJ +Publons +PubPeer +Rubriq + + +== References == + + +== External links == +peerageofscience.org \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Phases_of_clinical_research-0.md b/data/en.wikipedia.org/wiki/Phases_of_clinical_research-0.md new file mode 100644 index 000000000..f9f6698f0 --- /dev/null +++ b/data/en.wikipedia.org/wiki/Phases_of_clinical_research-0.md @@ -0,0 +1,31 @@ +--- +title: "Phases of clinical research" +chunk: 1/4 +source: "https://en.wikipedia.org/wiki/Phases_of_clinical_research" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:51:26.696563+00:00" +instance: "kb-cron" +--- + +The phases of clinical research are the stages in which scientists conduct experiments with a health intervention to obtain sufficient evidence for a process considered effective as a medical treatment. For drug development, the clinical phases start with testing for drug safety in a few human subjects, then expand to many study participants (potentially tens of thousands) to determine if the treatment is effective. Clinical research is conducted on drug candidates, vaccine candidates, new medical devices, and new diagnostic assays. + +== Description == +Clinical trials testing potential medical products are commonly classified into four phases. The drug development process will normally proceed through all four phases over many years. When expressed specifically, a clinical trial phase is capitalized both in name and Roman numeral, such as "Phase I" clinical trial. +If the drug successfully passes through Phases I, II, and III, it will usually be approved by the national regulatory authority for use in the general population. Phase IV trials are 'post-marketing' or 'surveillance' studies conducted to monitor safety over several years. + +== Preclinical studies == + +Before clinical trials are undertaken for a candidate drug, vaccine, medical device, or diagnostic assay, the product candidate is tested extensively in preclinical studies. Such studies involve in vitro (test tube or cell culture) and in vivo (animal model) experiments using wide-ranging doses of the study agent to obtain preliminary efficacy, toxicity and pharmacokinetic information. Such tests assist the developer to decide whether a drug candidate has scientific merit for further development as an investigational new drug. + +== Phase 0 == +Phase 0 is a designation for optional exploratory trials, originally introduced by the United States Food and Drug Administration's (FDA) 2006 Guidance on Exploratory Investigational New Drug (IND) Studies, but now generally adopted as standard practice. Phase 0 trials are also known as human microdosing studies and are designed to speed up the development of promising drugs or imaging agents by establishing very early on whether the drug or agent behaves in human subjects as was expected from preclinical studies. Distinctive features of Phase 0 trials include the administration of single subtherapeutic doses of the study drug to a small number of subjects (10 to 15) to gather preliminary data on the agent's pharmacokinetics (what the body does to the drugs). +A Phase 0 study gives no data on safety or efficacy, being by definition a dose too low to cause any therapeutic effect. Drug development companies carry out Phase 0 studies to rank drug candidates to decide which has the best pharmacokinetic parameters in humans to take forward into further development. They enable go/no-go decisions to be based on relevant human models instead of relying on sometimes inconsistent animal data. + +== Phase I == +Phase I trials were formerly referred to as "first-in-man studies" but the field generally moved to the gender-neutral language phrase "first-in-humans" in the 1990s; these trials are the first stage of testing in human subjects. They are designed to test the safety, side effects, best dose, and formulation method for the drug. Phase I trials are not randomized, and thus are vulnerable to selection bias. +Normally, a small group of 20–100 healthy volunteers will be recruited. These trials are often conducted in a clinical trial clinic, where the subject can be observed by full-time staff. These clinical trial clinics are often run by contract research organization (CROs) who conduct these studies on behalf of pharmaceutical companies or other research investigators. +The subject who receives the drug is usually observed until several half-lives of the drug have passed. This phase is designed to assess the safety (pharmacovigilance), tolerability, pharmacokinetics, and pharmacodynamics of a drug. Phase I trials normally include dose-ranging, also called dose escalation studies, so that the best and safest dose can be found and to discover the point at which a compound is too poisonous to administer. The tested range of doses will usually be a fraction of the dose that caused harm in animal testing. +Phase I trials most often include healthy volunteers. However, there are some circumstances when clinical patients are used, such as patients who have terminal cancer or HIV and the treatment is likely to make healthy individuals ill. These studies are usually conducted in tightly controlled clinics called Central Pharmacological Units, where participants receive 24-hour medical attention and oversight. In addition to the previously mentioned unhealthy individuals, "patients who have typically already tried and failed to improve on the existing standard therapies" may also participate in Phase I trials. Volunteers are paid a variable inconvenience fee for their time spent in the volunteer center. +Before beginning a Phase I trial, the sponsor must submit an Investigational New Drug application to the FDA detailing the preliminary data on the drug gathered from cellular models and animal studies. +Phase I trials can be further divided: \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Phases_of_clinical_research-1.md b/data/en.wikipedia.org/wiki/Phases_of_clinical_research-1.md new file mode 100644 index 000000000..e34deff73 --- /dev/null +++ b/data/en.wikipedia.org/wiki/Phases_of_clinical_research-1.md @@ -0,0 +1,35 @@ +--- +title: "Phases of clinical research" +chunk: 2/4 +source: "https://en.wikipedia.org/wiki/Phases_of_clinical_research" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:51:26.696563+00:00" +instance: "kb-cron" +--- + +=== Phase Ia === +Single ascending dose (Phase Ia): In single ascending dose studies, small groups of subjects are given a single dose of the drug while they are observed and tested for a period of time to confirm safety. Typically, a small number of participants, usually three, are entered sequentially at a particular dose. If they do not exhibit any adverse side effects, and the pharmacokinetic data are roughly in line with predicted safe values, the dose is escalated, and a new group of subjects is then given a higher dose. +If unacceptable toxicity is observed in any of the three participants, an additional number of participants, usually three, are treated at the same dose. This is continued until pre-calculated pharmacokinetic safety levels are reached, or intolerable side effects start showing up (at which point the drug is said to have reached the maximum tolerated dose (MTD)). If an additional unacceptable toxicity is observed, then the dose escalation is terminated and that dose, or perhaps the previous dose, is declared to be the maximally tolerated dose. This particular design assumes that the maximally tolerated dose occurs when approximately one-third of the participants experience unacceptable toxicity. Variations of this design exist, but most are similar. + +=== Phase Ib === +Multiple ascending dose (Phase Ib): Multiple ascending dose studies investigate the pharmacokinetics and pharmacodynamics of multiple doses of the drug, looking at safety and tolerability. In these studies, a group of patients receives multiple low doses of the drug, while samples (of blood, and other fluids) are collected at various time points and analyzed to acquire information on how the drug is processed within the body. The dose is subsequently escalated for further groups, up to a predetermined level. + +=== Food effect === +A short trial designed to investigate any differences in absorption of the drug by the body, caused by eating before the drug is given. These studies are usually run as a crossover study, with volunteers being given two identical doses of the drug while fasted, and after being fed. + +== Phase II == +Once a dose or range of doses is determined, the next goal is to evaluate whether the drug has any biological activity or effect. Phase II trials are performed on larger groups (50–300 individuals) and are designed to assess how well the drug works, as well as to continue Phase I safety assessments in a larger group of volunteers and patients. Genetic testing is common, particularly when there is evidence of variation in metabolic rate. When the development process for a new drug fails, this usually occurs during Phase II trials when the drug is discovered not to work as planned, or to have toxic effects. +Phase II studies are sometimes divided into Phase IIa and Phase IIb. There is no formal definition for these two sub-categories, but generally: + +Phase IIa studies are usually pilot studies designed to find an optimal dose and assess safety ('dose finding' studies). +Phase IIb studies determine how well the drug works in subjects at a given dose to assess efficacy ('proof of concept' studies). + +=== Trial design === +Some Phase II trials are designed as case series, demonstrating a drug's safety and activity in a selected group of participants. Other Phase II trials are designed as randomized controlled trials, where some patients receive the drug/device and others receive placebo/standard treatment. Randomized Phase II trials have far fewer patients than randomized Phase III trials. + +==== Example: cancer design ==== +In the first stage, the investigator attempts to rule out drugs that have no or little biologic activity. For example, the researcher may specify that a drug must have some minimal level of activity, say, in 20% of participants. If the estimated activity level is less than 20%, the researcher chooses not to consider this drug further, at least not at that maximally tolerated dose. If the estimated activity level exceeds 20%, the researcher will add more participants to get a better estimate of the response rate. A typical study for ruling out a 20% or lower response rate enters 14 participants. If no response is observed in the first 14 participants, the drug is considered not likely to have a 20% or higher activity level. The number of additional participants added depends on the degree of precision desired, but ranges from 10 to 20. Thus, a typical cancer phase II study might include fewer than 30 people to estimate the response rate. + +==== Efficacy vs effectiveness ==== +When a study assesses efficacy, it is looking at whether the drug given in the specific manner described in the study is able to influence an outcome of interest (e.g. tumor size) in the chosen population (e.g. cancer patients with no other ongoing diseases). When a study is assessing effectiveness, it is determining whether a treatment will influence the disease. In an effectiveness study, it is essential that participants are treated as they would be when the treatment is prescribed in actual practice. That would mean that there should be no aspects of the study designed to increase compliance above those that would occur in routine clinical practice. The outcomes in effectiveness studies are also more generally applicable than in most efficacy studies (for example does the patient feel better, come to the hospital less or live longer in effectiveness studies as opposed to better test scores or lower cell counts in efficacy studies). There is usually less rigid control of the type of participant to be included in effectiveness studies than in efficacy studies, as the researchers are interested in whether the drug will have a broad effect in the population of patients with the disease. \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Phases_of_clinical_research-2.md b/data/en.wikipedia.org/wiki/Phases_of_clinical_research-2.md new file mode 100644 index 000000000..479b384fd --- /dev/null +++ b/data/en.wikipedia.org/wiki/Phases_of_clinical_research-2.md @@ -0,0 +1,35 @@ +--- +title: "Phases of clinical research" +chunk: 3/4 +source: "https://en.wikipedia.org/wiki/Phases_of_clinical_research" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:51:26.696563+00:00" +instance: "kb-cron" +--- + +=== Failure rate === +Phase I trials historically have experienced the lowest success, having about a 66% failure rate due mainly to adverse effects and other toxicity concerns. A 2022 review found that about 90% of drug candidates fail over the course of Phases I-III, mainly due to absence of therapeutic efficacy, toxicity, non-specific drug properties, poor strategic planning, and recognition that the compound will not succeed commercially. + +== Phase III == +This phase is designed to assess the effectiveness of the new intervention and, thereby, its value in clinical practice. Phase III studies are randomized controlled multicenter trials on large patient groups (300–3,000 or more depending upon the disease/medical condition studied) and are aimed at being the definitive assessment of how effective the drug is, in comparison with current 'gold standard' treatment. Because of their size and comparatively long duration, Phase III trials are the most expensive, time-consuming and difficult trials to design and run, especially in therapies for chronic medical conditions. Phase III trials of chronic conditions or diseases often have a short follow-up period for evaluation, relative to the period of time the intervention might be used in practice. +It is common practice that certain Phase III trials will continue while the regulatory submission is pending at the appropriate regulatory agency. This allows patients to continue to receive possibly lifesaving drugs until the drug can be obtained by purchase. Other reasons for performing trials at this stage include attempts by the sponsor at "label expansion" (to show the drug works for additional types of patients/diseases beyond the original use for which the drug was approved for marketing), to obtain additional safety data, or to support marketing claims for the drug. Studies in this phase are by some companies categorized as "Phase IIIB studies." +While not required in all cases, it is typically expected that there be at least two successful Phase III trials, demonstrating a drug's safety and efficacy, to obtain approval from the appropriate regulatory agencies such as FDA (US), or the EMA (European Union). +Once a drug has proved satisfactory after Phase III trials, the trial results are usually combined into a large document containing a comprehensive description of the methods and results of human and animal studies, manufacturing procedures, formulation details, and shelf life. This collection of information makes up the "regulatory submission" that is provided for review to the appropriate regulatory authorities in different countries. They will review the submission, and if it is acceptable, give the sponsor approval to market the drug. +Most drugs undergoing Phase III clinical trials can be marketed under FDA norms with proper recommendations and guidelines through a New Drug Application (NDA) containing all manufacturing, preclinical, and clinical data. In case of any adverse effects being reported anywhere, the drugs need to be recalled immediately from the market. While most pharmaceutical companies refrain from this practice, it is not abnormal to see many drugs undergoing Phase III clinical trials in the market. + +=== Adaptive design === +The design of individual trials may be altered during a trial – usually during Phase II or III – to accommodate interim results for the benefit of the treatment, adjust statistical analysis, or to reach early termination of an unsuccessful design, a process called an "adaptive design". Examples are the 2020 World Health Organization Solidarity trial, European Discovery trial, and UK RECOVERY Trial of hospitalized people with severe COVID-19 infection, each of which applies adaptive designs to rapidly alter trial parameters as results from the experimental therapeutic strategies emerge. +Adaptive designs within ongoing Phase II–III clinical trials on candidate therapeutics may shorten trial durations and use fewer subjects, possibly expediting decisions for early termination or success, and coordinating design changes for a specific trial across its international locations. + +=== Failure rate === +Some 90% of drug candidates fail once entering Phase I trials. A 2019 review of average success rates of clinical trials at different phases and diseases over the years 2005–15 found a success range of only 5–14% overall. Separated by diseases studied, cancer drug trials were on average only 3% successful, whereas ophthalmology drugs and vaccines for infectious diseases were 33% successful. Trials using disease biomarkers, especially in cancer studies, were more successful than those not using biomarkers. A 2010 review found about 50% of drug candidates either fail during the Phase III trial or are rejected by the national regulatory agency. +For vaccines, the overall probability of success ranges from 7% for non-industry-sponsored candidates to 40% for industry-sponsored candidates. + +== Cost of trials by phases == + +From discovery in the laboratory of a molecule with drug potential through the years-long process establishing an approved drug, the overall cost of development through all the stages of preclinical and clinical research is about $2 billion. +In the early 21st century, a typical Phase I trial conducted at a single clinic in the United States ranged from $1.4 million for pain or anesthesia studies to $6.6 million for immunomodulation studies. Main expense drivers were operating and clinical monitoring costs of the Phase I site. +The amount of money spent on Phase II or III trials depends on numerous factors, with therapeutic area being studied and types of clinical procedures as key drivers. Phase II studies may cost as low as $7 million for cardiovascular projects, and as much as $20 million for hematology trials. +Phase III trials for dermatology may cost as low as $11 million, whereas a pain or anesthesia Phase III trial may cost as much as $53 million. An analysis of Phase III pivotal trials leading to 59 drug approvals by the US Food and Drug Administration over 2015–16 showed that the median cost was $19 million, but some trials involving thousands of subjects may cost 100 times more. +Across all trial phases, the main expenses for clinical trials were administrative staff (about 20% of the total), clinical procedures (about 19%), and clinical monitoring of the subjects (about 11%). \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Phases_of_clinical_research-3.md b/data/en.wikipedia.org/wiki/Phases_of_clinical_research-3.md new file mode 100644 index 000000000..4c0a7590b --- /dev/null +++ b/data/en.wikipedia.org/wiki/Phases_of_clinical_research-3.md @@ -0,0 +1,20 @@ +--- +title: "Phases of clinical research" +chunk: 4/4 +source: "https://en.wikipedia.org/wiki/Phases_of_clinical_research" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:51:26.696563+00:00" +instance: "kb-cron" +--- + +== Phase IV == +A Phase IV trial is also known as a postmarketing surveillance trial or drug monitoring trial to assure long-term safety and effectiveness of the drug, vaccine, device or diagnostic test. Phase IV trials involve the safety surveillance (pharmacovigilance) and ongoing technical support of a drug after it receives regulatory approval to be sold. Phase IV studies may be required by regulatory authorities or may be undertaken by the sponsoring company for competitive (finding a new market for the drug) or other reasons (for example, the drug may not have been tested for interactions with other drugs, or on certain population groups such as pregnant women, who are unlikely to subject themselves to trials). The safety surveillance is designed to detect any rare or long-term adverse effects over a much larger patient population and longer time period than was possible during the Phase I-III clinical trials. Harmful effects discovered by Phase IV trials may result in a drug being withdrawn from the market or restricted to certain uses; examples include cerivastatin (brand names Baycol and Lipobay), troglitazone (Rezulin) and rofecoxib (Vioxx). + +== Overall cost == +The entire process of developing a drug from preclinical research to marketing can take approximately 12 to 18 years and may cost about $2 billion. + +== See also == +Lists of investigational drugs + +== References == \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/PhysicsOverflow-0.md b/data/en.wikipedia.org/wiki/PhysicsOverflow-0.md new file mode 100644 index 000000000..73f3b4b9b --- /dev/null +++ b/data/en.wikipedia.org/wiki/PhysicsOverflow-0.md @@ -0,0 +1,49 @@ +--- +title: "PhysicsOverflow" +chunk: 1/1 +source: "https://en.wikipedia.org/wiki/PhysicsOverflow" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:53:08.636905+00:00" +instance: "kb-cron" +--- + +PhysicsOverflow is a physics website that serves as a post-publication open peer review platform for research papers in physics, as well as a collaborative blog and online community of physicists. It allows users to ask, answer and comment on graduate-level physics questions, post and review manuscripts from ArXiv (which lists PhysicsOverflow discussion pages among its trackbacks) and other sources, and vote on both forms of content. +In addition to the two primary forms of content, the PhysicsOverflow community also welcomes discussions on unsolved problems, and hosts a chat section for discussions on topics generally of interest to physicists and students of physics, such as those related to recent events in physics, physics academia, and the publishing process. + + +== History == +PhysicsOverflow was started in April 2014 as a physics-equivalent of MathOverflow by Rahel Knöpfel, a physics PhD at the University of Rostock, high-school student Abhimanyu Pallavi Sudhir, and Roger Cattin, a retired professor of computer science at the University of Applied Sciences, Switzerland. The site was initially a mere question-and-answer forum, as it was started by users dissatisfied by the policies of the Physics Stack Exchange, but it was eventually expanded to include a Reviews section in October 2014. + + +== Moderation practices == +PhysicsOverflow is well-known for its liberal moderation policy and hesitation to block contributors except for spam, as reflected in the website's bill of "user rights". The content is largely community-moderated, much like MathOverflow, although exceptions have been recorded. +Although the site's moderation policy is publicly available as part of the moderator manual, the site has been criticised for the excessive dispersion of policy-related material, such as the FAQ, the Bill of Rights, the moderator list and the Community Moderation threads, leading to reduced transparency. In response, the site's administrators posted a bulletin of all moderation-related content on the site on the homepage. + + +== Technical details == + +PhysicsOverflow runs Question2Answer, an open-source Q&A software, with a custom theme and several plugins and patches. Some of its plugins have been used by other Question2Answer websites, such as the Open Science Q&A and the Physics Problems Q&A. + + +== Usage == +Quantcast records around 3000 monthly visitors and between 20,000 and 50,000 global page views to PhysicsOverflow every month, over half of whom are located in four countries: the United States (26.8%), India (9.2%), the United Kingdom (8.5%), and Germany (6.4%). However, according to PhysicsOverflow's own data, only around 1500 users actually contribute content to the site, and 440 are active at a given point in time. + + +== Recognition == +The creation of PhysicsOverflow was well-received by the MathOverflow community. PhysicsOverflow was also featured at the 5th Offtopicarium and World Scientific's Asia-Pacific Physics News Letter. + +John Baez suggested the website as a platform for discussing research-level physics questions. +Greg Bernhardt, the founder of Physics Forums, acknowledged the site as a "very interesting development for the physics discussion communities". +Arnold Neumaier, a professor at the University of Vienna, employs PhysicsOverflow as the platform for discussion about his Theoretical Physics FAQ. +String theorist Lubos Motl referred to the website as a "very promising competition [to Physics Stack Exchange]". +Urs Schreiber publicised the site, claiming it could act as a catalyst to make physics academia more open like mathematics. + + +== See also == +MathOverflow +Stack Exchange +nLab + + +== References == \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Placebo-controlled_study-0.md b/data/en.wikipedia.org/wiki/Placebo-controlled_study-0.md new file mode 100644 index 000000000..e70f2b4bf --- /dev/null +++ b/data/en.wikipedia.org/wiki/Placebo-controlled_study-0.md @@ -0,0 +1,39 @@ +--- +title: "Placebo-controlled study" +chunk: 1/4 +source: "https://en.wikipedia.org/wiki/Placebo-controlled_study" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:51:27.963028+00:00" +instance: "kb-cron" +--- + +Placebo-controlled studies are a way of testing a medical therapy in which, in addition to a group of subjects that receives the treatment to be evaluated, a separate control group receives a sham "placebo" treatment which is specifically designed to have no real effect. Placebos are most commonly used in blinded trials, where subjects do not know whether they are receiving real or placebo treatment. Often, there is also a further "natural history" group that does not receive any treatment at all. +The purpose of the placebo group is to account for the placebo effect, that is, effects from treatment that do not depend on the treatment itself. Such factors include knowing one is receiving a treatment, attention from health care professionals, and the expectations of a treatment's effectiveness by those running the research study. Without a placebo group to compare against, it is not possible to know whether the treatment itself had any effect. +Patients frequently show improvement even when given a sham or "fake" treatment. Such intentionally inert placebo treatments can take many forms, such as a pill containing only sugar, or a medical device (such as an ultrasound machine) that is not actually turned on. Also, due to the body's natural healing ability and statistical effects such as regression to the mean, many patients will get better even when given no treatment at all. Thus, the relevant question when assessing a treatment is not "does the treatment work?" but "does the treatment work better than a placebo treatment, or no treatment at all?" More broadly, the aim of a clinical trial is to determine what treatments, delivered in what circumstances, to which patients, in what conditions, are the most effective. +Therefore, the use of placebos is a standard control component of most clinical trials, which attempt to make some sort of quantitative assessment of the efficacy of medicinal drugs or treatments. Such a test or clinical trial is called a placebo-controlled study, and its control is of the negative type. A study whose control is a previously tested treatment, rather than no treatment, is called a positive-control study, because its control is of the positive type. +This close association of placebo effects with randomized controlled trials has a profound impact on how placebo effects are understood and valued in the scientific community. + +== Methodology == + +=== Blinding === + +Blinding is the withholding of information from participants which may influence them in some way until after the experiment is complete. Good blinding may reduce or eliminate experimental biases such as confirmation bias, the placebo effect, the observer effect, and others. A blind can be imposed on any participant of an experiment, including subjects, researchers, technicians, data analysts, and evaluators. In some cases, while blinding would be useful, it is impossible or unethical. For example, is not possible to blind a patient to their treatment in a physical therapy intervention. A good clinical protocol ensures that blinding is as effective as possible within ethical and practical constrains. +During the course of an experiment, a participant becomes unblinded if they deduce or otherwise obtain information that has been masked to them. Unblinding that occurs before the conclusion of a study is a source of experimental error, as the bias that was eliminated by blinding is re-introduced. Unblinding is common in blind experiments, and must be measured and reported. + +=== Natural history groups === +The practice of using an additional natural history group as the trial's so-called "third arm" has emerged; and trials are conducted using three randomly selected, equally matched trial groups, Reilly wrote: "... it is necessary to remember the adjective 'random' [in the term 'random sample'] should apply to the method of drawing the sample and not to the sample itself." + +The Active drug group (A): who receive the active test drug. +The Placebo drug group (P): who receive a placebo drug that simulates the active drug. +The Natural history group (NH): who receive no treatment of any kind (and whose condition, therefore, is allowed to run its natural course). +The outcomes within each group are observed, and compared with each other, allowing us to measure: + +The efficacy of the active drug's treatment: the difference between A and NH (i.e., A-NH). +The efficacy of the active drug's active ingredient: the difference between A and P (i.e., A-P). +The magnitude of the placebo response: the difference between P and NH (i.e., P-NH). +It is a matter of interpretation whether the value of P-NH indicates the efficacy of the entire treatment process or the magnitude of the "placebo response". The results of these comparisons then determine whether or not a particular drug is considered efficacious. +Natural-History groups yield useful information when separate groups of subjects are used in a parallel or longitudinal study design. In crossover studies, however, where each subject undergoes both treatments in succession, the natural history of the chronic condition under investigation (e.g., progression) is well understood, with the study's duration being chosen such that the condition's intensity will be more or less stable over that duration. (Wang et al. provide the example of late-phase diabetes, whose natural history is long enough that even a crossover study lasting one year is acceptable.) In these circumstances, a natural history group is not expected to yield useful information. + +=== Indexing === +In certain clinical trials of particular drugs, it may happen that the level of the "placebo responses" manifested by the trial's subjects are either considerably higher or lower (in relation to the "active" drug's effects) than one would expect from other trials of similar drugs. In these cases, with all other things being equal, it is reasonable to conclude that: \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Placebo-controlled_study-1.md b/data/en.wikipedia.org/wiki/Placebo-controlled_study-1.md new file mode 100644 index 000000000..7c33bc21d --- /dev/null +++ b/data/en.wikipedia.org/wiki/Placebo-controlled_study-1.md @@ -0,0 +1,47 @@ +--- +title: "Placebo-controlled study" +chunk: 2/4 +source: "https://en.wikipedia.org/wiki/Placebo-controlled_study" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:51:27.963028+00:00" +instance: "kb-cron" +--- + +the degree to which there is a considerably higher level of "placebo response" than one would expect is an index of the degree to which the drug's active ingredient is not efficacious. +the degree to which there is a considerably lower level of "placebo response" than one would expect is an index of the degree to which, in some particular way, the placebo is not simulating the active drug in an appropriate way. +However, in particular cases such as the use of Cimetidine to treat ulcers, a significant level of placebo response can also prove to be an index of how much the treatment has been directed at a wrong target. + +== Implementation issues == + +=== Adherence === +The Coronary Drug Project was intended to study the safety and effectiveness of drugs for long-term treatment of coronary heart disease in men. Those in the placebo group who adhered to the placebo treatment (took the placebo regularly as instructed) showed nearly half the mortality rate as those who were not adherent. +A similar study of women similarly found survival was nearly 2.5 times greater for those who adhered to their placebo. This apparent placebo effect may have occurred because: + +Adhering to the protocol had a psychological effect, i.e. genuine placebo effect. +People who were already healthier were more able or more inclined to follow the protocol. +Compliant people were more diligent and health-conscious in all aspects of their lives. + +=== Unblinding === + +In some cases, a study participant may deduce or otherwise obtain information that has been blinded to them. For example, a patient taking a psychoactive drug may recognize that they are taking a drug. When this occurs, it is called unblinding. This kind of unblinding can be reduced with the use of an active placebo, which is a drug that produces effects similar to the active drug, making it more difficult for patients to determine which group they are in. +An active placebo was used in the Marsh Chapel Experiment, a blinded study in which the experimental group received the psychedelic substance psilocybin while the control group received a large dose of niacin, a substance that produces noticeable physical effects intended to lead the control subjects to believe they had received the psychoactive drug. + +== History == + +=== James Lind and scurvy === +In 1747, James Lind (1716–1794), the ship's doctor on HMS Salisbury, conducted the first clinical trial when he investigated the efficacy of citrus fruit in cases of scurvy. He randomly divided twelve scurvy patients, whose "cases were as similar as I could have them", into six pairs. Each pair was given a different remedy. According to Lind's 1753 Treatise on the Scurvy in Three Parts Containing an Inquiry into the Nature, Causes, and Cure of the Disease, Together with a Critical and Chronological View of what has been Published of the Subject, the remedies were: one quart of cider per day, twenty-five drops of elixir vitriol (sulfuric acid) three times a day, two spoonfuls of vinegar three times a day, a course of sea-water (half a pint every day), two oranges and one lemon each day, and electuary, (a mixture containing garlic, mustard, balsam of Peru, and myrrh). +He noted that the pair who had been given the oranges and lemons were so restored to health within six days of treatment that one of them returned to duty, and the other was well enough to attend the rest of the sick. + +=== Animal magnetism === +In 1784, the French Royal Commission investigated the existence of animal magnetism, comparing the effects of allegedly "magnetized" water with that of plain water. +It did not examine the practices of Franz Mesmer, but examined the significantly different practices of his associate Charles d'Eslon (1739–1786). + +=== Perkins tractors === +In 1799, John Haygarth investigated the efficacy of medical instruments called "Perkins tractors", by comparing the results from dummy wooden tractors with a set of allegedly "active" metal tractors, and published his findings in a book On the Imagination as a Cause & as a Cure of Disorders of the Body. + +=== Flint and placebo active treatment comparison === +In 1863 Austin Flint (1812–1886) conducted the first-ever trial that directly compared the efficacy of a dummy simulator with that of an active treatment; although Flint's examination did not compare the two against each other in the same trial. Even so, this was a significant departure from the (then) customary practice of contrasting the consequences of an active treatment with what Flint described as "the natural history of [an untreated] disease". + +Flint's paper is the first time that he terms "placebo" or "placeboic remedy" were used to refer to a dummy simulator in a clinical trial.... to secure the moral effect of a remedy given specially for the disease, the patients were placed on the use of a placebo which consisted, in nearly all of the cases, of the tincture of quassia, very largely diluted. This was given regularly, and became well known in my wards as the placeboic remedy for rheumatism. +Flint treated 13 hospital inmates who had rheumatic fever; 11 were "acute", and 2 were "sub-acute". He then compared the results of his dummy "placeboic remedy" with that of the active treatment's already well-understood results. (Flint had previously tested, and reported on, the active treatment's efficacy.) There was no significant difference between the results of the active treatment and his "placeboic remedy" in 12 of the cases in terms of disease duration, duration of convalescence, number of joints affected, and emergence of complications. In the thirteenth case, Flint expressed some doubt whether the particular complications that had emerged (namely, pericarditis, endocarditis, and pneumonia) would have been prevented if that subject had been immediately given the "active treatment". \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Placebo-controlled_study-2.md b/data/en.wikipedia.org/wiki/Placebo-controlled_study-2.md new file mode 100644 index 000000000..384603d55 --- /dev/null +++ b/data/en.wikipedia.org/wiki/Placebo-controlled_study-2.md @@ -0,0 +1,39 @@ +--- +title: "Placebo-controlled study" +chunk: 3/4 +source: "https://en.wikipedia.org/wiki/Placebo-controlled_study" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:51:27.963028+00:00" +instance: "kb-cron" +--- + +=== Jellinek and headache remedy ingredients === +Jellinek in 1946 was asked to test whether or not the headache drug's overall efficacy would be reduced if certain ingredients were removed. In post-World War II 1946, pharmaceutical chemicals were restricted, and one U.S. headache remedy manufacturer sold a drug composed of three ingredients: a, b, and c, and chemical b was in particular short supply. +Jellinek set up a complex trial involving 199 subjects, all of whom had "frequent headaches". The subjects were randomly divided into four test groups. He prepared four test drugs, involving various permutations of the three drug constituents, with a placebo as a scientific control. The structure of this trial is significant because, in those days, the only time placebos were ever used "was to express the efficacy or non-efficacy of a drug in terms of "how much better" the drug was than the "placebo". (Note that the trial conducted by Austin Flint is an example of such a drug efficacy vs. placebo efficacy trial.) The four test drugs were identical in shape, size, colour and taste: + +Drug A: contained a, b, and c. +Drug B: contained a and c. +Drug C: contained a and b. +Drug D: a 'simulator', contained "ordinary lactate". +Each time a subject had a headache, they took their group's designated test drug, and recorded whether their headache had been relieved (or not). Although "some subjects had only three headaches in the course of a two-week period while others had up to ten attacks in the same period", the data showed a "great consistency" across all subjects Every two weeks the groups' drugs were changed; so that by the end of eight weeks, all groups had tested all the drugs. The stipulated drug (i.e., A, B, C, or D) was taken as often as necessary over each two-week period, and the two-week sequences for each of the four groups were: + +A, B, C, D +B, A, D, C +C, D, A, B +D, C, B, A. +Over the entire population of 199 subjects, there were 120 "subjects reacting to placebo" and 79 "subjects not reacting to placebo". +On initial analysis, there was no difference between the self-reported "success rates" of Drugs A, B, and C (84%, 80%, and 80% respectively) (the "success rate" of the simulating placebo Drug D was 52%); and, from this, it appeared that ingredient b was completely unnecessary. +However, further analysis on the trial demonstrated that ingredient b made a significant contribution to the remedy's efficacy. Examining his data, Jellinek discovered that there was a very significant difference in responses between the 120 placebo-responders and the 79 non-responders. The 79 non-responders' reports showed that if they were considered as an entirely separate group, there was a significant difference the "success rates" of Drugs A, B, and C: viz., 88%, 67%, and 77%, respectively. And because this significant difference in relief from the test drugs could only be attributed to the presence or absence of ingredient b, he concluded that ingredient b was essential. +Two conclusions came from this trial: + +Jellinek, having identified 120 "placebo reactors", went on to suppose that all of them may have had either "psychological headaches" (with or without attendant "hypochondriasis") or "true physiological headaches [which were] accessible to suggestion". Thus, according to this view, the degree to which a "placebo response" is present tends to be an index of the psychogenic origins of the condition in question. +It indicated that, whilst any given placebo was inert, a responder to that particular placebo may be responding for a wide number of reasons unconnected with the drug's active ingredients; and, from this, it could be important to pre-screen potential test populations, and treat those manifesting a placebo-response as a special group, or remove them altogether from the test population! + +=== MRC and randomized trials === +It used to be thought that the first-ever randomized clinical trial was the trial conducted by the Medical Research Council (MRC) in 1948 into the efficacy of streptomycin in the treatment of pulmonary tuberculosis. In this trial, there were two test groups: + +those "treated by streptomycin and bed-rest", and +those "[treated] by bed-rest alone" (the control group). +What made this trial novel was that the subjects were randomly allocated to their test groups. The up-to-that-time practice was to allocate subjects alternately to each group, based on the order in which they presented for treatment. This practice could be biased, because those admitting each patient knew to which group that patient would be allocated (and so the decision to admit or not admit a specific patient might be influenced by the experimenter's knowledge of the nature of their illness, and their knowledge of the group to which they would occupy). +Recently, an earlier MRC trial on the antibiotic patulin on the course of common colds has been suggested to have been the first randomized trial. Another early and until recently overlooked randomized trial was published on strophanthin in a local Finnish journal in 1946. \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Placebo-controlled_study-3.md b/data/en.wikipedia.org/wiki/Placebo-controlled_study-3.md new file mode 100644 index 000000000..a1a3b37ba --- /dev/null +++ b/data/en.wikipedia.org/wiki/Placebo-controlled_study-3.md @@ -0,0 +1,36 @@ +--- +title: "Placebo-controlled study" +chunk: 4/4 +source: "https://en.wikipedia.org/wiki/Placebo-controlled_study" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:51:27.963028+00:00" +instance: "kb-cron" +--- + +== Declaration of Helsinki == +From the time of the Hippocratic Oath questions of the ethics of medical practice have been widely discussed, and codes of practice have been gradually developed as a response to advances in scientific medicine. +The Nuremberg Code, which was issued in August 1947, as a consequence of the so-called Doctors' Trial which examined the human experimentation conducted by Nazi doctors during World War II, offers ten principles for legitimate medical research, including informed consent, absence of coercion, and beneficence towards experiment participants. +In 1964, the World Medical Association issued the Declaration of Helsinki, which specifically limited its directives to health research by physicians, and emphasized a number of additional conditions in circumstances where "medical research is combined with medical care". +The significant difference between the 1947 Nuremberg Code and the 1964 Declaration of Helsinki is that the first was a set of principles that was suggested to the medical profession by the "Doctors' Trial" judges, whilst the second was imposed by the medical profession upon itself. +Paragraph 29 of the Declaration makes specific mention of placebos: + +29. The benefits, risks, burdens and effectiveness of a new method should be tested against those of the best current prophylactic, diagnostic, and therapeutic methods. This does not exclude the use of placebo, or no treatment, in studies where no proven prophylactic, diagnostic or therapeutic method exists. +In 2002, World Medical Association issued the following elaborative announcement: + +Note of clarification on paragraph 29 of the WMA Declaration of HelsinkiThe WMA hereby reaffirms its position that extreme care must be taken in making use of a placebo-controlled trial and that in general this methodology should only be used in the absence of existing proven therapy. However, a placebo-controlled trial may be ethically acceptable, even if proven therapy is available, under the following circumstances: +— Where for compelling and scientifically sound methodological reasons its use is necessary to determine the efficacy or safety of a prophylactic, diagnostic or therapeutic method; or +— Where a prophylactic, diagnostic or therapeutic method is being investigated for a minor condition and the patients who receive placebo will not be subject to any additional risk of serious or irreversible harm. +All other provisions of the Declaration of Helsinki must be adhered to, especially the need for appropriate ethical and scientific review. +In addition to the requirement for informed consent from all drug-trial participants, it is also standard practice to inform all test subjects that they may receive the drug being tested or that they may receive the placebo. + +== Non-drug treatments == +"Talking therapies" (such as hypnotherapy, psychotherapy, counseling, and non-drug psychiatry) are now required to have scientific validation by clinical trial. However, there is controversy over what might or might not be an appropriate placebo for such therapeutic treatments. Furthermore, there are methodological challenges such as blinding the person providing the psychological non-drug intervention. In 2005, the Journal of Clinical Psychology, devoted an issue to the issue of "The Placebo Concept in Psychotherapy" that contained a range of contributions to this question. As the abstract of one paper noted: "Unlike within the domain of medicine, in which the logic of placebos is relatively straightforward, the concept of placebo as applied to psychotherapy is fraught with both conceptual and practical problems." + +== See also == + +== References == + +== External links == + +James Lind Library A source of historical texts on fair tests of treatments in health care. \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Plackett–Burman_design-0.md b/data/en.wikipedia.org/wiki/Plackett–Burman_design-0.md new file mode 100644 index 000000000..6288126c7 --- /dev/null +++ b/data/en.wikipedia.org/wiki/Plackett–Burman_design-0.md @@ -0,0 +1,22 @@ +--- +title: "Plackett–Burman design" +chunk: 1/3 +source: "https://en.wikipedia.org/wiki/Plackett–Burman_design" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:51:29.193177+00:00" +instance: "kb-cron" +--- + +Plackett–Burman designs are experimental designs presented in 1946 by Robin L. Plackett and J. P. Burman while working in the British Ministry of Supply. +Their goal was to find experimental designs for investigating the dependence of some measured quantity on a number of independent variables (factors), each taking L levels, in such a way as to minimize the variance of the estimates of these dependencies using a limited number of experiments. Interactions between the factors were considered negligible. The solution to this problem is to find an experimental design where each combination of levels for any pair of factors appears the same number of times, throughout all the experimental runs (refer to table). A complete factorial design would satisfy this criterion, but the idea was to find smaller designs. + +For the case of two levels (L = 2), Plackett and Burman used the method found in 1933 by Raymond Paley for generating orthogonal matrices whose elements are all either 1 or −1 (Hadamard matrices). Paley's method could be used to find such matrices of size N for most N equal to a multiple of 4. In particular, it worked for all such N up to 100 except N = 92. If N is a power of 2, however, the resulting design is identical to a fractional factorial design, so Plackett–Burman designs are mostly used when N is a multiple of 4 but not a power of 2 (i.e. N = 12, 20, 24, 28, 36 …). If one is trying to estimate less than N parameters (including the overall average), then one simply uses a subset of the columns of the matrix. +For the case of more than two levels, Plackett and Burman rediscovered designs that had previously been given by Raj Chandra Bose and K. Kishen at the Indian Statistical Institute. +Plackett and Burman give specifics for designs having a number of experiments equal to the number of levels L to some integer power, for L = 3, 4, 5, or 7. +When interactions between factors are not negligible, they are often confounded in Plackett–Burman designs with the main effects, meaning that the designs do not permit one to distinguish between certain main effects and certain interactions. This is called confounding. + +== Extended uses == +In 1993, Dennis Lin described a construction method via half-fractions of Plackett–Burman designs, using one column to take half of the rest of the columns. The resulting matrix, minus that column, is a "supersaturated design" for finding significant first order effects, under the assumption that few exist. +Box–Behnken designs can be made smaller, or very large ones constructed, by replacing the fractional factorials and incomplete blocks traditionally used for plan and seed matrices, respectively, with Plackett–Burmans. For example, a quadratic design for 30 variables requires a 30 column PB plan matrix of zeroes and ones, replacing the ones in each line using PB seed matrices of −1s and +1s (for 15 or 16 variables) wherever a one appears in the plan matrix, creating a 557 runs design with values, −1, 0, +1, to estimate the 496 parameters of a full quadratic model. Adding axial points allows estimating univariate cubic and quartic effects. +By equivocating certain columns with parameters to be estimated, Plackett–Burmans can also be used to construct mixed categorical and numerical designs, with interactions or high order effects, requiring no more than 4 runs more than the number of model parameters to be estimated. Sort by a-1 columns assigned to categorical variable A and following columns, where A = 1 + int(a·i /(max(i) + 0.00001)), i = row number and a = A's number of values. Next sort on columns assigned to any other categorical variables and following columns, repeating as needed. Such designs, if large, may otherwise be incomputable by standard search techniques like D-optimality. For example, 13 variables averaging 3 values each could have well over a million combinations to search. To estimate the 105 parameters in a quadratic model of 13 variables, one must formally exclude from consideration or compute |X'X| for well over 106C102, i.e. 313C105, or roughly 10484 matrices. \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Plackett–Burman_design-1.md b/data/en.wikipedia.org/wiki/Plackett–Burman_design-1.md new file mode 100644 index 000000000..b03c8b884 --- /dev/null +++ b/data/en.wikipedia.org/wiki/Plackett–Burman_design-1.md @@ -0,0 +1,335 @@ +--- +title: "Plackett–Burman design" +chunk: 2/3 +source: "https://en.wikipedia.org/wiki/Plackett–Burman_design" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:51:29.193177+00:00" +instance: "kb-cron" +--- + +== 4 to 48 runs, sorted to show half-fractions == +P.B.4 ++ + + ++ – – +– + – +– – + +P.B.8 ++ + + + + + + ++ + – – – – + ++ – + + – – – ++ – – – + + – +– + + – + – – +– + – + – + – +– – + – – + + +– – – + + – + +P.B.12 ++ + + + + + + + + + + ++ + + + – – – + – – – ++ + – – – + – – + – + ++ – + – + + + – – – – ++ – – + – – + – + + – ++ – – – + – – + – + + +– + + – – – + – – + + +– + – + + + – – – + – +– + – – + – + + + – – +– – + + + – – – + – + +– – + – – + – + + + – +– – – + – + + + – – + +P.B.16 ++ + + + + + + + + + + + + + + ++ + + – – – – – – – – + + + + ++ + – + – – – – + + + – – – + ++ + – – + + + + – – – – – – + ++ – + + + – – + + – – + – – – ++ – + – – + + – – + + + – – – ++ – – + – + + – + – – – + + – ++ – – – + – – + – + + – + + – +– + + + – + – + – + – – + – – +– + + – + – + – + – + – + – – +– + – + + – + – – + – + – + – +– + – – – + – + + – + + – + – +– – + + – – + + – – + – – + + +– – + – + + – – + + – – – + + +– – – + + + – – – – + + + – + +– – – – – – + + + + – + + – + +P.B.20 ++ + + + + + + + + + + + + + + + + + + ++ + + – + – + – – – – + + – – + – – + ++ + – + – + – – – – + + – – + – – + + ++ + – + – – – – + + – – + – – + + + – ++ + – – – – + + – – + – – + + + + – – ++ – + + + + – + – + – – – – + + – – – ++ – + – + – – – – + + – – + – – + + + ++ – + – – + + + + – + – + – – – – + – ++ – – + – – + + + + – + – + – – – – + ++ – – – + + – – + – – + + + + – + – – +– + + + + – + – + – – – – + + – – + – +– + + + – + – + – – – – + + – – + – + +– + + – – + – – + + + + – + – + – – – +– + – – + + + + – + – + – – – – + + – +– + – – + – – + + + + – + – + – – – + +– – + + – – + – – + + + + – + – + – – +– – + – – – – + + – – + – – + + + + + +– – – + + + + – + – + – – – – + + – + +– – – + + – – + – – + + + + – + – + – +– – – – – + + – – + – – + + + + – + + +P.B.24 ++ + + + + + + + + + + + + + + + + + + + + + + ++ + + – + – + + – – + + – – + – + – – – – – + ++ + + – – + + – – + – + – – – – – + + + + – – ++ + – + + – – + + – – + – + – – – – – + + + – ++ + – + – + + – – + + – – + – + – – – – – + + ++ + – – – – – + + + + – + – + + – – + + – – – ++ – + + – – + – + – – – – – + + + + – + – + – ++ – + – + + – – + + – – + – + – – – – – + + + ++ – + – + – – – – – + + + + – + – + + – – + – ++ – – + + – – + – + – – – – – + + + + – + – + ++ – – + – + – – – – – + + + + – + – + + – – + ++ – – – – + + + + – + – + + – – + + – – + – – +– + + + + – + – + + – – + + – – + – + – – – – +– + + + – + – + + – – + + – – + – + – – – – + +– + + – – + – + – – – – – + + + + – + – + + – +– + – + – – – – – + + + + – + – + + – – + + – +– + – – + + – – + – + – – – – – + + + + – + + +– + – – + – + – – – – – + + + + – + – + + – + +– – + + + + – + – + + – – + + – – + – + – – – +– – + + – – + + – – + – + – – – – – + + + + + +– – + – – – – – + + + + – + – + + – – + + – + +– – – + + + + – + – + + – – + + – – + – + – – +– – – – + + + + – + – + + – – + + – – + – + – +– – – – – – + + + + – + – + + – – + + – – + + +P.B.28 ++ + + + + + + + + + + + + + + + + + + + + + + + + + – ++ + + – – + + + – + + – – + + + + – – – – – – – – + + ++ + + – – – – – – – – + + + + – – + + + – + + – – + + ++ + – + + – – + + + + – – – – – – – – + + + + – – + + ++ + – + – – + + – – – + – – + + – + – – + – + – + – – ++ + – + – – + – + – + – + + – + – – + + – – – + – – – ++ + – – – + – – + + – + – – + – + – + – + + – + – – – ++ – + + – + – – + + – – – + – – + + – + – – + – + – – ++ – + – + + – + – – + + – – – + – – + + – + – – + – – ++ – + – + – + + – + – – + + – – – + – – + + – + – – – ++ – + – + – + – + – + – + – + – + – + – + – + – + – + ++ – – + + + + – – – – – – – – + + + + – – + + + – + + ++ – – + + + – + + – – + + + + – – – – – – – – + + + + ++ – – – – – – – – + + + + – – + + + – + + – – + + + + +– + + + + – – + + + – + + – – + + + + – – – – – – – + +– + + + + – – – – – – – – + + + + – – + + + – + + – + +– + + + – + + – – + + + + – – – – – – – – + + + + – + +– + + – – + + + + – – – – – – – – + + + + – – + + + + +– + – – + + – + – – + – + – + – + + – + – – + + – – – +– + – – + + – – – + – – + + – + – – + – + – + – + + – +– + – – + – + – + – + + – + – – + + – – – + – – + + – +– – + + – + – – + – + – + – + + – + – – + + – – – + – +– – + + – – – + – – + + – + – – + – + – + – + + – + – +– – + – + – + – + + – + – – + + – – – + – – + + – + – +– – – + + + + – – + + + – + + – – + + + + – – – – – + +– – – + – – + + – + – – + – + – + – + + – + – – + + – +– – – – – + + + + – – + + + – + + – – + + + + – – – + +– – – – – – – + + + + – – + + + – + + – – + + + + – + +P.B.32 ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + – – – – – – – – – – – – – – – – + + + + + + + + + + + ++ + + – + – – – – – – – – + + + + + + + – – – – – – – + + + + ++ + + – – + + + + + + + + – – – – – – – – – – – – – – + + + + ++ + – + + – – – – + + + + – – – – + + + – – – – + + + – – – + ++ + – + – + + + + – – – – + + + + – – – – – – – + + + – – – + ++ + – – + + + + + – – – – – – – – + + + + + + + – – – – – – + ++ + – – – – – – – + + + + + + + + – – – + + + + – – – – – – + ++ – + + + + – – + + – – + + – – + + – – + – – + + – – + – – – ++ – + + – – + + – – + + – – + + – – + + + – – + + – – + – – – ++ – + – + – + + – – + + – + – – + + – – – + + – – + + + – – – ++ – + – – + – – + + – – + – + + – – + + – + + – – + + + – – – ++ – – + + – + + – + – – + – + + – + – – – + + – + – – – + + – ++ – – + – + – – + – + + – + – – + – + + – + + – + – – – + + – ++ – – – + + – – + – + + – – + + – + – – + – – + – + + – + + – ++ – – – – – + + – + – – + + – – + – + + + – – + – + + – + + – +– + + + + – + – + – + – + – + – + – + – – + – + – + – – + – – +– + + + – + – + – + – + – + – + – + – + – + – + – + – – + – – +– + + – + + – + – + – + – – + – + – + – + – + – + – + – + – – +– + + – – – + – + – + – + + – + – + – + + – + – + – + – + – – +– + – + + + – + – – + – + + – + – – + – + – + – – + – + – + – +– + – + – – + – + + – + – – + – + + – + + – + – – + – + – + – +– + – – + – + – + + – + – + – + – – + – – + – + + – + + – + – +– + – – – + – + – – + – + – + – + + – + – + – + + – + + – + – +– – + + + – – + + – – + + – – + + – – + – – + + – – + – – + + +– – + + – + + – – + + – – + + – – + + – – – + + – – + – – + + +– – + – + + + – – + + – – – – + + – – + + + – – + + – – – + + +– – + – – – – + + – – + + + + – – + + – + + – – + + – – – + + +– – – + + + + – – – – + + + + – – – – + + + – – – – + + + – + +– – – + – – – + + + + – – – – + + + + – + + – – – – + + + – + +– – – – + – – + + + + – – + + – – – – + – – + + + + – + + – + +– – – – – + + – – – – + + – – + + + + – – – + + + + – + + – + +P.B.36 ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + – ++ + + + – + + + + – – + + – – – – – – + + + + – – – – – – + + – – + + ++ + + – – + + – – – – – – + + + + – – – – – – + + – – + + + + + – + + ++ + + – – – – – – + + – – + + + + + – + + + + – – + + – – – – – – + + ++ + – + + + + – – + + – – – – – – + + + + – – – – – – + + – – + + + + ++ + – + – – + – + – + + – – + + – + – – – + – + – – + + – – + – + – – ++ + – + – – – + – + – – + + – – + – + – + + – + – – + – + – + + – – – ++ + – – + + – + – – – + – + – – + + – – + – + – + + – + – – + – + – – ++ + – – + – + – + + – + – – + – + – + + – – + + – + – – – + – + – – – ++ – + + – + – – + – + – + + – – + + – + – – – + – + – – + + – – + – – ++ – + + – – + + – + – – – + – + – – + + – – + – + – + + – + – – + – – ++ – + – + + – + – – + – + – + + – – + + – + – – – + – + – – + + – – – ++ – + – + + – – + + – + – – – + – + – – + + – – + – + – + + – + – – – ++ – + – + – + – + – + – + – + – + – + – + – + – + – + – + – + – + – + ++ – – + + + + + – + + + + – – + + – – – – – – + + + + – – – – – – + + ++ – – + + – – – – – – + + + + – – – – – – + + – – + + + + + – + + + + ++ – – – – – – + + + + – – – – – – + + – – + + + + + – + + + + – – + + ++ – – – – – – + + – – + + + + + – + + + + – – + + – – – – – – + + + + +– + + + + + – + + + + – – + + – – – – – – + + + + – – – – – – + + – + +– + + + + – – + + – – – – – – + + + + – – – – – – + + – – + + + + + + +– + + + + – – – – – – + + – – + + + + + – + + + + – – + + – – – – – + +– + + – – + + + + + – + + + + – – + + – – – – – – + + + + – – – – – + +– + + – – – – – – + + + + – – – – – – + + – – + + + + + – + + + + – + +– + – + – – + + – – + – + – + + – + – – + – + – + + – – + + – + – – – +– + – – + + – – + – + – + + – + – – + – + – + + – – + + – + – – – + – +– + – – + – + – + + – – + + – + – – – + – + – – + + – – + – + – + + – +– + – – – + – + – – + + – – + – + – + + – + – – + – + – + + – – + + – +– – + + – + – – – + – + – – + + – – + – + – + + – + – – + – + – + + – +– – + + – – + – + – + + – + – – + – + – + + – – + + – + – – – + – + – +– – + – + – + + – + – – + – + – + + – – + + – + – – – + – + – – + + – +– – + – + – + + – – + + – + – – – + – + – – + + – – + – + – + + – + – +– – – + + + + – – – – – – + + – – + + + + + – + + + + – – + + – – – + +– – – + + – – + + + + + – + + + + – – + + – – – – – – + + + + – – – + +– – – + – + – – + + – – + – + – + + – + – – + – + – + + – – + + – + – +– – – – – + + + + – – – – – – + + – – + + + + + – + + + + – – + + – + +– – – – – + + – – + + + + + – + + + + – – + + – – – – – – + + + + – + +P.B.40 ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + – ++ + + + + + + – – + + – – + + – – – – – – – – + + + + – – – – + + – – – – + – ++ + + + + – – + + – – + + – – – – – – – – + + + + – – – – + + – – – – + + + – ++ + + – – + + – – + + – – – – – – – – + + + + – – – – + + – – – – + + + + + – ++ + + – – – – + + – – – – + + + + + + + + – – + + – – + + – – – – – – – – + – ++ + – + – + – + – – + + – – + + – – + – + – + – + + – + – – + – + + – – + – + ++ + – + – – + – + + – – + – + + – + – + – + – – + + – – + + – – + – + – + – + ++ + – – + + – – + – + – + – + + – + – – + – + + – – + – + + – + – + – + – – + ++ + – – + – + + – + – + – + – – + + – – + + – – + – + – + – + + – + – – + – + ++ + – – + – + – + – + + – + – – + – + + – – + – + + – + – + – + – – + + – – + ++ – + + – + – + – + – – + + – – + + – – + – + – + – + + – + – – + – + + – – + ++ – + + – + – – + – + + – – + – + + – + – + – + – – + + – – + + – – + – + – + ++ – + + – – + – + + – + – + – + – – + + – – + + – – + – + – + – + + – + – – + ++ – + – + + – + – – + – + + – – + – + + – + – + – + – – + + – – + + – – + – + ++ – + – + – + + – + – – + – + + – – + – + + – + – + – + – – + + – – + + – – + ++ – – + + – – + + – – – – – – – – + + + + – – – – + + – – – – + + + + + + + – ++ – – + + – – – – – – – – + + + + – – – – + + – – – – + + + + + + + + – – + – ++ – – – – + + + + + + + + – – + + – – + + – – – – – – – – + + + + – – – – + – ++ – – – – + + – – – – + + + + + + + + – – + + – – + + – – – – – – – – + + + – ++ – – – – – – – – + + + + – – – – + + – – – – + + + + + + + + – – + + – – + – +– + + + + + + + + – – + + – – + + – – – – – – – – + + + + – – – – + + – – – – +– + + + + – – – – + + – – – – + + + + + + + + – – + + – – + + – – – – – – – – +– + + – – + + – – – – – – – – + + + + – – – – + + – – – – + + + + + + + + – – +– + + – – – – + + + + + + + + – – + + – – + + – – – – – – – – + + + + – – – – +– + + – – – – – – – – + + + + – – – – + + – – – – + + + + + + + + – – + + – – +– + – + – + – + – + – + – + – + – + – + – + – + – + – + – + – + – + – + – + + +– + – + – + – – + + – – + + – – + – + – + – + + – + – – + – + + – – + – + + + +– + – + – – + + – – + + – – + – + – + – + + – + – – + – + + – – + – + + – + + +– + – – + + – – + + – – + – + – + – + + – + – – + – + + – – + – + + – + – + + +– + – – + – + + – – + – + + – + – + – + – – + + – – + + – – + – + – + – + + + +– – + + – – + + – – + – + – + – + + – + – – + – + + – – + – + + – + – + – + + +– – + + – – + – + – + – + + – + – – + – + + – – + – + + – + – + – + – – + + + +– – + – + + – + – + – + – – + + – – + + – – + – + – + – + + – + – – + – + + + +– – + – + + – – + – + + – + – + – + – – + + – – + + – – + – + – + – + + – + + +– – + – + – + – + + – + – – + – + + – – + – + + – + – + – + – – + + – – + + + +– – – + + + + + + + + – – + + – – + + – – – – – – – – + + + + – – – – + + – – +– – – + + + + – – – – + + – – – – + + + + + + + + – – + + – – + + – – – – – – +– – – + + – – – – + + + + + + + + – – + + – – + + – – – – – – – – + + + + – – +– – – – – + + + + – – – – + + – – – – + + + + + + + + – – + + – – + + – – – – +– – – – – – – + + + + – – – – + + – – – – + + + + + + + + – – + + – – + + – – +P.B.44 ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + – – – + – + + + – – – – – + – – – + + – + – + + – – + – – + – + – – + + – ++ + + + – – – + – + + + – – – – – + – – – + + – + – + + – – + – – + – + – – + + + – + ++ + + + – – – – – + – – – + + – + – + + – – + – – + – + – – + + + – + + + + + – – – – ++ + + – – + – – + – + – – + + + – + + + + + – – – + – + + + – – – – – + – – – + + – – ++ + + – – – + – + + + – – – – – + – – – + + – + – + + – – + – – + – + – – + + + – + + ++ + – + + + + + – – – + – + + + – – – – – + – – – + + – + – + + – – + – – + – + – – + ++ + – + + – – + – – + – + – – + + + – + + + + + – – – + – + + + – – – – – + – – – + – ++ + – – + + + – + + + + + – – – + – + + + – – – – – + – – – + + – + – + + – – + – – – ++ + – – – + – + + + – – – – – + – – – + + – + – + + – – + – – + – + – – + + + – + + + ++ + – – – – – + – – – + + – + – + + – – + – – + – + – – + + + – + + + + + – – – + – + ++ – + + + + + – – – + – + + + – – – – – + – – – + + – + – + + – – + – – + – + – – + + ++ – + + + – + + + + + – – – + – + + + – – – – – + – – – + + – + – + + – – + – – + – – ++ – + – + + – – + – – + – + – – + + + – + + + + + – – – + – + + + – – – – – + – – – + ++ – + – + – – + + + – + + + + + – – – + – + + + – – – – – + – – – + + – + – + + – – – ++ – + – – + – + – – + + + – + + + + + – – – + – + + + – – – – – + – – – + + – + – + – ++ – – + + – + – + + – – + – – + – + – – + + + – + + + + + – – – + – + + + – – – – – – ++ – – + – + + + – – – – – + – – – + + – + – + + – – + – – + – + – – + + + – + + + + – ++ – – + – – + – + – – + + + – + + + + + – – – + – + + + – – – – – + – – – + + – + – + ++ – – – + – + + + – – – – – + – – – + + – + – + + – – + – – + – + – – + + + – + + + + ++ – – – – + – – – + + – + – + + – – + – – + – + – – + + + – + + + + + – – – + – + + – ++ – – – – – + – – – + + – + – + + – – + – – + – + – – + + + – + + + + + – – – + – + + +– + + + + – – – + – + + + – – – – – + – – – + + – + – + + – – + – – + – + – – + + + + +– + + + – + + + + + – – – + – + + + – – – – – + – – – + + – + – + + – – + – – + – + – +– + + – + + + + + – – – + – + + + – – – – – + – – – + + – + – + + – – + – – + – + – + +– + + – + – + + – – + – – + – + – – + + + – + + + + + – – – + – + + + – – – – – + – – +– + + – – – – – + – – – + + – + – + + – – + – – + – + – – + + + – + + + + + – – – + + +– + – + + + – – – – – + – – – + + – + – + + – – + – – + – + – – + + + – + + + + + – – +– + – + – + + – – + – – + – + – – + + + – + + + + + – – – + – + + + – – – – – + – – + +– + – + – – + + + – + + + + + – – – + – + + + – – – – – + – – – + + – + – + + – – + – +– + – – + – + – – + + + – + + + + + – – – + – + + + – – – – – + – – – + + – + – + + – +– + – – + – – + – + – – + + + – + + + + + – – – + – + + + – – – – – + – – – + + – + + +– + – – – + + – + – + + – – + – – + – + – – + + + – + + + + + – – – + – + + + – – – – +– – + + + – – – – – + – – – + + – + – + + – – + – – + – + – – + + + – + + + + + – – + +– – + + – + – + + – – + – – + – + – – + + + – + + + + + – – – + – + + + – – – – – + – +– – + + – – + – – + – + – – + + + – + + + + + – – – + – + + + – – – – – + – – – + + + +– – + – + + + – – – – – + – – – + + – + – + + – – + – – + – + – – + + + – + + + + + – +– – + – – + + + – + + + + + – – – + – + + + – – – – – + – – – + + – + – + + – – + – + +– – + – – – + + – + – + + – – + – – + – + – – + + + – + + + + + – – – + – + + + – – – +– – – + + + – + + + + + – – – + – + + + – – – – – + – – – + + – + – + + – – + – – + + +– – – + – + – – + + + – + + + + + – – – + – + + + – – – – – + – – – + + – + – + + – + +– – – + – – – + + – + – + + – – + – – + – + – – + + + – + + + + + – – – + – + + + – – +– – – – + + – + – + + – – + – – + – + – – + + + – + + + + + – – – + – + + + – – – – + +– – – – + – – – + + – + – + + – – + – – + – + – – + + + – + + + + + – – – + – + + + – +P.B.48 ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + – – + – + – + + + – – + – – + + – + + – – – + – + – + + – – – – + – – – – – + + + – ++ + + + – – + – – + + – + + – – – + – + – + + – – – – + – – – – – + + + + – + + + + – – + – – ++ + + – + + + + – – + – + – + + + – – + – – + + – + + – – – + – + – + + – – – – + – – – – – + ++ + + – – + – + – + + + – – + – – + + – + + – – – + – + – + + – – – – + – – – – – + + + + – + ++ + + – – – + – + – + + – – – – + – – – – – + + + + – + + + + – – + – + – + + + – – + – – + – ++ + + – – – – + – – – – – + + + + – + + + + – – + – + – + + + – – + – – + + – + + – – – + – – ++ + – + + + + – – + – + – + + + – – + – – + + – + + – – – + – + – + + – – – – + – – – – – + + ++ + – + + + – – + – – + + – + + – – – + – + – + + – – – – + – – – – – + + + + – + + + + – – – ++ + – + + – – – – + – – – – – + + + + – + + + + – – + – + – + + + – – + – – + + – + + – – – – ++ + – – + – + – + + + – – + – – + + – + + – – – + – + – + + – – – – + – – – – – + + + + – + + ++ + – – + – – + + – + + – – – + – + – + + – – – – + – – – – – + + + + – + + + + – – + – + – + ++ – + + + + – – + – + – + + + – – + – – + + – + + – – – + – + – + + – – – – + – – – – – + + + ++ – + + – + + – – – + – + – + + – – – – + – – – – – + + + + – + + + + – – + – + – + + + – – – ++ – + + – – – + – + – + + – – – – + – – – – – + + + + – + + + + – – + – + – + + + – – + – – + ++ – + – + – + + + – – + – – + + – + + – – – + – + – + + – – – – + – – – – – + + + + – + + + – ++ – + – – + + – + + – – – + – + – + + – – – – + – – – – – + + + + – + + + + – – + – + – + + – ++ – – + – + – + + + – – + – – + + – + + – – – + – + – + + – – – – + – – – – – + + + + – + + + ++ – – + – + – + + – – – – + – – – – – + + + + – + + + + – – + – + – + + + – – + – – + + – + – ++ – – + – – + + – + + – – – + – + – + + – – – – + – – – – – + + + + – + + + + – – + – + – + + ++ – – – + – + – + + – – – – + – – – – – + + + + – + + + + – – + – + – + + + – – + – – + + – + ++ – – – + – – – – – + + + + – + + + + – – + – + – + + + – – + – – + + – + + – – – + – + – + – ++ – – – – + + + + – + + + + – – + – + – + + + – – + – – + + – + + – – – + – + – + + – – – – – ++ – – – – + – – – – – + + + + – + + + + – – + – + – + + + – – + – – + + – + + – – – + – + – + +– + + + + – + + + + – – + – + – + + + – – + – – + + – + + – – – + – + – + + – – – – + – – – – +– + + + – + + + + – – + – + – + + + – – + – – + + – + + – – – + – + – + + – – – – + – – – – + +– + + + – – + – + – + + + – – + – – + + – + + – – – + – + – + + – – – – + – – – – – + + + + + +– + + – + + – – – + – + – + + – – – – + – – – – – + + + + – + + + + – – + – + – + + + – – + – +– + + – – + – – + + – + + – – – + – + – + + – – – – + – – – – – + + + + – + + + + – – + – + + +– + – + + – – – + – + – + + – – – – + – – – – – + + + + – + + + + – – + – + – + + + – – + – + +– + – + – + + + – – + – – + + – + + – – – + – + – + + – – – – + – – – – – + + + + – + + + + – +– + – + – + + – – – – + – – – – – + + + + – + + + + – – + – + – + + + – – + – – + + – + + – – +– + – – + + – + + – – – + – + – + + – – – – + – – – – – + + + + – + + + + – – + – + – + + + – +– + – – – + – + – + + – – – – + – – – – – + + + + – + + + + – – + – + – + + + – – + – – + + + +– + – – – – + – – – – – + + + + – + + + + – – + – + – + + + – – + – – + + – + + – – – + – + + +– + – – – – – + + + + – + + + + – – + – + – + + + – – + – – + + – + + – – – + – + – + + – – – +– – + + + + – + + + + – – + – + – + + + – – + – – + + – + + – – – + – + – + + – – – – + – – – +– – + + + – – + – – + + – + + – – – + – + – + + – – – – + – – – – – + + + + – + + + + – – + + +– – + + – – – – + – – – – – + + + + – + + + + – – + – + – + + + – – + – – + + – + + – – – + + +– – + – + + + – – + – – + + – + + – – – + – + – + + – – – – + – – – – – + + + + – + + + + – + +– – + – + + – – – – + – – – – – + + + + – + + + + – – + – + – + + + – – + – – + + – + + – – + +– – + – + – + + – – – – + – – – – – + + + + – + + + + – – + – + – + + + – – + – – + + – + + – +– – + – – – – – + + + + – + + + + – – + – + – + + + – – + – – + + – + + – – – + – + – + + – – +– – – + + + + – + + + + – – + – + – + + + – – + – – + + – + + – – – + – + – + + – – – – + – – +– – – + + – + + – – – + – + – + + – – – – + – – – – – + + + + – + + + + – – + – + – + + + – + +– – – + – – – – – + + + + – + + + + – – + – + – + + + – – + – – + + – + + – – – + – + – + + – +– – – – + + + + – + + + + – – + – + – + + + – – + – – + + – + + – – – + – + – + + – – – – + – +– – – – – – + + + + – + + + + – – + – + – + + + – – + – – + + – + + – – – + – + – + + – – – + \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Plackett–Burman_design-2.md b/data/en.wikipedia.org/wiki/Plackett–Burman_design-2.md new file mode 100644 index 000000000..26511007b --- /dev/null +++ b/data/en.wikipedia.org/wiki/Plackett–Burman_design-2.md @@ -0,0 +1,13 @@ +--- +title: "Plackett–Burman design" +chunk: 3/3 +source: "https://en.wikipedia.org/wiki/Plackett–Burman_design" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:51:29.193177+00:00" +instance: "kb-cron" +--- + +== References == + + This article incorporates public domain material from the National Institute of Standards and Technology \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Pocock_boundary-0.md b/data/en.wikipedia.org/wiki/Pocock_boundary-0.md new file mode 100644 index 000000000..f6bc5fd10 --- /dev/null +++ b/data/en.wikipedia.org/wiki/Pocock_boundary-0.md @@ -0,0 +1,27 @@ +--- +title: "Pocock boundary" +chunk: 1/1 +source: "https://en.wikipedia.org/wiki/Pocock_boundary" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:51:30.364390+00:00" +instance: "kb-cron" +--- + +The Pocock boundary is a method for determining whether to stop a clinical trial prematurely. The typical clinical trial compares two groups of patients. One group are given a placebo or conventional treatment, while the other group of patients are given the treatment that is being tested. The investigators running the clinical trial will wish to stop the trial early for ethical reasons if the treatment group clearly shows evidence of benefit. In other words, "when early results proved so promising it was no longer fair to keep patients on the older drugs for comparison, without giving them the opportunity to change." +The concept was introduced by the medical statistician Stuart Pocock in 1977. The many reasons underlying when to stop a clinical trial for benefit were discussed in his editorial from 2005. + + +== Details == +The Pocock boundary gives a p-value threshold for each interim analysis which guides the data monitoring committee on whether to stop the trial. The boundary used depends on the number of interim analyses. + +The Pocock boundary is simple to use in that the p-value threshold is the same at each interim analysis. The disadvantages are that the number of interim analyses must be fixed at the start and it is not possible under this scheme to add analyses after the trial has started. Another disadvantage is that investigators and readers frequently do not understand how the p-values are reported: for example, if there are five interim analyses planned, but the trial is stopped after the third interim analysis because the p-value was 0.01, then the overall p-value for the trial is still reported as <0.05 and not as 0.01. + + +== See also == +Haybittle–Peto boundary +O'Brien–Fleming boundary +Interim analysis + + +== References == \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Preregistration_(science)-0.md b/data/en.wikipedia.org/wiki/Preregistration_(science)-0.md index 8be564cda..fc8bd5c42 100644 --- a/data/en.wikipedia.org/wiki/Preregistration_(science)-0.md +++ b/data/en.wikipedia.org/wiki/Preregistration_(science)-0.md @@ -4,7 +4,7 @@ chunk: 1/4 source: "https://en.wikipedia.org/wiki/Preregistration_(science)" category: "reference" tags: "science, encyclopedia" -date_saved: "2026-05-05T07:01:10.158451+00:00" +date_saved: "2026-05-05T09:53:09.860059+00:00" instance: "kb-cron" --- diff --git a/data/en.wikipedia.org/wiki/Preregistration_(science)-1.md b/data/en.wikipedia.org/wiki/Preregistration_(science)-1.md index a48f79ba5..8fd6f5042 100644 --- a/data/en.wikipedia.org/wiki/Preregistration_(science)-1.md +++ b/data/en.wikipedia.org/wiki/Preregistration_(science)-1.md @@ -4,7 +4,7 @@ chunk: 2/4 source: "https://en.wikipedia.org/wiki/Preregistration_(science)" category: "reference" tags: "science, encyclopedia" -date_saved: "2026-05-05T07:01:10.158451+00:00" +date_saved: "2026-05-05T09:53:09.860059+00:00" instance: "kb-cron" --- diff --git a/data/en.wikipedia.org/wiki/Preregistration_(science)-2.md b/data/en.wikipedia.org/wiki/Preregistration_(science)-2.md index 833b62644..f0f850b5c 100644 --- a/data/en.wikipedia.org/wiki/Preregistration_(science)-2.md +++ b/data/en.wikipedia.org/wiki/Preregistration_(science)-2.md @@ -4,7 +4,7 @@ chunk: 3/4 source: "https://en.wikipedia.org/wiki/Preregistration_(science)" category: "reference" tags: "science, encyclopedia" -date_saved: "2026-05-05T07:01:10.158451+00:00" +date_saved: "2026-05-05T09:53:09.860059+00:00" instance: "kb-cron" --- diff --git a/data/en.wikipedia.org/wiki/Preregistration_(science)-3.md b/data/en.wikipedia.org/wiki/Preregistration_(science)-3.md index 0f1009679..29be0eed6 100644 --- a/data/en.wikipedia.org/wiki/Preregistration_(science)-3.md +++ b/data/en.wikipedia.org/wiki/Preregistration_(science)-3.md @@ -4,7 +4,7 @@ chunk: 4/4 source: "https://en.wikipedia.org/wiki/Preregistration_(science)" category: "reference" tags: "science, encyclopedia" -date_saved: "2026-05-05T07:01:10.158451+00:00" +date_saved: "2026-05-05T09:53:09.860059+00:00" instance: "kb-cron" --- diff --git a/data/en.wikipedia.org/wiki/Provocation_test-0.md b/data/en.wikipedia.org/wiki/Provocation_test-0.md new file mode 100644 index 000000000..97c548a8d --- /dev/null +++ b/data/en.wikipedia.org/wiki/Provocation_test-0.md @@ -0,0 +1,19 @@ +--- +title: "Provocation test" +chunk: 1/1 +source: "https://en.wikipedia.org/wiki/Provocation_test" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:51:31.547405+00:00" +instance: "kb-cron" +--- + +A provocation test, also called a provocation trial or provocation study, is a form of medical clinical trial whereby participants are exposed to either a substance or "thing" that is claimed to provoke a response, or to a sham substance or device that should provoke no response, or a severe exercise as in Erb's test for low serum calcium . An example of a provocation test, performed on an individual, is a skin allergy test. + + +== See also == +Blind experiment +Control group + + +== References == \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Pseudoreplication-0.md b/data/en.wikipedia.org/wiki/Pseudoreplication-0.md new file mode 100644 index 000000000..a9fd26f66 --- /dev/null +++ b/data/en.wikipedia.org/wiki/Pseudoreplication-0.md @@ -0,0 +1,41 @@ +--- +title: "Pseudoreplication" +chunk: 1/1 +source: "https://en.wikipedia.org/wiki/Pseudoreplication" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:51:32.728123+00:00" +instance: "kb-cron" +--- + +Pseudoreplication (sometimes unit of analysis error) has many definitions. Pseudoreplication was originally defined in 1984 by Stuart H. Hurlbert as the use of inferential statistics to test for treatment effects with data from experiments where either treatments are not replicated (though samples may be) or +replicates are not statistically independent. Subsequently, Millar and Anderson identified it as a special case of inadequate specification of random factors where both random and fixed factors are present. It is sometimes narrowly interpreted as an inflation of the number of samples or replicates which are not statistically independent. This definition omits the confounding of unit and treatment effects in a misspecified F-ratio. In practice, incorrect F-ratios for statistical tests of fixed effects often arise from a default F-ratio that is formed over the error rather the mixed term. +Lazic defined pseudoreplication as a problem of correlated samples (e.g. from longitudinal studies) where correlation is not taken into account when computing the confidence interval for the sample mean. For the effect of serial or temporal correlation also see Markov chain central limit theorem. + +The problem of inadequate specification arises when treatments are assigned to units that are subsampled and the treatment F-ratio in an analysis of variance (ANOVA) table is formed with respect to the residual mean square rather than with respect to the among unit mean square. The F-ratio relative to the within unit mean square is vulnerable to the confounding of treatment and unit effects, especially when experimental unit number is small (e.g. four tank units, two tanks treated, two not treated, several subsamples per tank). The problem is eliminated by forming the F-ratio relative to the correct mean square in the ANOVA table (tank by treatment MS in the example above), where this is possible. The problem is addressed by the use of mixed models. +Hurlbert reported "pseudoreplication" in 48% of the studies he examined, that used inferential statistics. Several studies examining scientific papers published up to 2016 similarly found about half of the papers were suspected of pseudoreplication. When time and resources limit the number of experimental units, and unit effects cannot be eliminated statistically by testing over the unit variance, it is important to use other sources of information to evaluate the degree to which an F-ratio is confounded by unit effects. + + +== Replication == +Replication increases the precision of an estimate, while randomization addresses the broader applicability of a sample to a population. Replication must be appropriate: replication at the experimental unit level must be considered, in addition to replication within units. + + +== Hypothesis testing == +Statistical tests (e.g. t-test and the related ANOVA family of tests) rely on appropriate replication to estimate statistical significance. Tests based on the t and F distributions assume homogeneous, normal, and independent errors. Correlated errors can lead to false precision and p-values that are too small. + + +== Types == +Hurlbert (1984) defined four types of pseudoreplication. + +Simple pseudoreplication (Figure 5a in Hurlbert 1984) occurs when there is one experimental unit per treatment. Inferential statistics cannot separate variability due to treatment from variability due to experimental units when there is only one measurement per unit. +Temporal pseudoreplication (Figure 5c in Hurlbert 1984) occurs when experimental units differ enough in time that temporal effects among units are likely, and treatment effects are correlated with temporal effects. Inferential statistics cannot separate variability due to treatment from variability due to experimental units when there is only one measurement per unit. +Sacrificial pseudoreplication (Figure 5b in Hurlbert 1984) occurs when means within a treatment are used in an analysis, and these means are tested over the within unit variance. In Figure 5b, the erroneous F-ratio will have 1 df in the numerator (treatment) mean square and 4 df in the denominator mean square (2-1 = 1 df for each experimental unit). The correct F-ratio will have 1 df in the numerator (treatment) and 2 df in the denominator (2-1 = 1 df for each treatment). The correct F-ratio controls for effects of experimental units but with 2 df in the denominator it will have little power to detect treatment differences. +Implicit pseudoreplication occurs when standard errors (or confidence limits) are estimated within experimental units. As with other sources of pseudoreplication, treatment effects cannot be statistically separated from effects due to variation among experimental units. + + +== See also == +Replication (statistics) +Resampling (statistics) + + +== References == \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/PubPeer-0.md b/data/en.wikipedia.org/wiki/PubPeer-0.md index 938968655..a40e1a08e 100644 --- a/data/en.wikipedia.org/wiki/PubPeer-0.md +++ b/data/en.wikipedia.org/wiki/PubPeer-0.md @@ -4,7 +4,7 @@ chunk: 1/1 source: "https://en.wikipedia.org/wiki/PubPeer" category: "reference" tags: "science, encyclopedia" -date_saved: "2026-05-05T06:32:53.010068+00:00" +date_saved: "2026-05-05T09:53:12.208470+00:00" instance: "kb-cron" --- diff --git a/data/en.wikipedia.org/wiki/Publons-0.md b/data/en.wikipedia.org/wiki/Publons-0.md new file mode 100644 index 000000000..e5c504956 --- /dev/null +++ b/data/en.wikipedia.org/wiki/Publons-0.md @@ -0,0 +1,50 @@ +--- +title: "Publons" +chunk: 1/1 +source: "https://en.wikipedia.org/wiki/Publons" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:53:10.986861+00:00" +instance: "kb-cron" +--- + +Publons was a commercial website that provided a free service for academics to track, verify, and showcase their peer review and editorial contributions for academic journals. It was launched in 2012, bought by Clarivate in 2017 and merged into Web of Science in 2022. It claimed that over 3,000,000 researchers joined the site, adding more than one million reviews across 25,000 journals. In 2019, ResearcherID was integrated with Publons. +Publons produced a verified record of a person's review and editorial activity for journals, which could be downloaded to include in CVs, funding and job applications, and promotion and performance evaluations. Publons' business model was based on partnering with publishers. + + +== Background == +Publons was founded by Andrew Preston and Daniel Johnston to address the static state of peer-reviewing practices in academic research publishing, in view of encouraging collaboration and speeding scientific development. The Publons name was an homage to the "publon", the "minimum unit of publishable material". The company was registered in New Zealand and had an office in London, UK. It was acquired by Clarivate Analytics in 2017. + + +== Services == +Publons also provided: + +tools for publishers to find, screen, contact, and motivate peer reviewers; +data and publications about global peer review behaviour; +peer review training for early-career researchers; and +features for academics to discuss and evaluate published research +Reviewers can choose whether or not to make the content of their reviews open access following publication of the reviewed publication, though journals can choose to override this. Review content is shared using a Creative Commons CC BY 4.0 license. Publons has partnerships with many publishers and with related services such as Altmetric and ORCID. + + +== Publons Peer Review Awards == +Publons Peer Review Awards are recognitions for top peer reviewers and editors. Publons' Awards started in 2016. In 2017 an award program called the Sentinel Award was added, for outstanding advocacy, innovation or contribution to scholarly peer review. + + +== Reception == +TechCrunch remarked that lack of transparency leads to many problems in the publication process, and Publons purports to help with that. Research Information noted that while the site supports both pre- and post-publication review, not all reviews are published, in deference to existing publication norms. Nature noted that peer review is an important job, and reported on the reactions of two of Publons's most prolific reviewers. +Publons sends unsolicited bulk email to academics to advertise its service. This was cited by email service providers for being a violation of acceptable use policies. + + +== See also == +Web of Science +Journal club +JournalReview.org +Peerage of Science +PubPeer + + +== References == + + +== External links == +Official website \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Quantitative_Biology_(journal)-0.md b/data/en.wikipedia.org/wiki/Quantitative_Biology_(journal)-0.md new file mode 100644 index 000000000..9e3b5f8d3 --- /dev/null +++ b/data/en.wikipedia.org/wiki/Quantitative_Biology_(journal)-0.md @@ -0,0 +1,28 @@ +--- +title: "Quantitative Biology (journal)" +chunk: 1/1 +source: "https://en.wikipedia.org/wiki/Quantitative_Biology_(journal)" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:53:13.415769+00:00" +instance: "kb-cron" +--- + +Quantitative Biology is a quarterly peer-reviewed open-access interdisciplinary scientific journal covering experimental biology, theoretical biology, computational biology, bioinformatics, systems biology, synthetic biology, and digital and AI biology. +The journal was established in 2013 by Higher Education Press and Tsinghua University. The co-founding editors-in-chief were Michael Q. Zhang and Chao Tang. As of 2019, the editors-in-chief were Chao Tang and Xuegong Zhang. +Before 2020, Quantitative Biology was co-published by Higher Education Press and Springer Nature. As of 2023, it was published by Higher Education Press and Wiley. +According to the Journal Citation Reports and Scopus, the journal had a 2024 impact factor of 1.4, and a 2024 CiteScore of 2.0, respectively. + + +== Abstracting and indexing == +The journal is abstracted and indexed in: + +Emerging Sources Citation Index (ESCI) +Scopus +Biological Abstracts +BIOSIS +Chemical Abstracts Service +Embase + + +== References == \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Quasi-experiment-0.md b/data/en.wikipedia.org/wiki/Quasi-experiment-0.md new file mode 100644 index 000000000..f697bd59d --- /dev/null +++ b/data/en.wikipedia.org/wiki/Quasi-experiment-0.md @@ -0,0 +1,18 @@ +--- +title: "Quasi-experiment" +chunk: 1/3 +source: "https://en.wikipedia.org/wiki/Quasi-experiment" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:51:33.934268+00:00" +instance: "kb-cron" +--- + +A quasi-experiment is a research design used to estimate the causal impact of an intervention. This research design is aimed at assessing the difference between outcomes (e.g., reading knowledge, depressive symptoms) in a group that experienced an intervention and a group that did not. The intervention is broadly construed such that it could be designed by researchers (e.g., a reading program) or it could be an event affecting a group of people such as disaster (e.g., an earthquake). Quasi-experiments share similarities with experiments and randomized controlled trials, but specifically lack random assignment to intervention and control conditions. Instead, quasi-experimental designs typically compare groups that are either preexisting (e.g., whether someone was exposed to COVID-19) or groups that were created without random assignment (e.g., students attending schools with different reading programs). +The causal analysis of quasi-experiments depends on assumptions that render non-randomness irrelevant, for example, the parallel trends assumption for difference-in-differences (DiD) method. Quasi-experiments are subject to concerns about internal validity. In other words, it may be difficult to convincingly demonstrate a causal link between the treatment and control conditions on the observed outcomes in quasi-experimental designs. This is particularly true if there are confounding variables that cannot be controlled or accounted for. A major impediment to drawing causal conclusions in quasi-experiments is the possibility that unmeasured variables could have contributed to treatment–control group differences in the outcomes, which is less a problem in true experiments because randomization largely equates the intervention and control groups on both measured and unmeasured background factors, particularly when there are large numbers of experimental units to randomize. The experimental units most commonly used in quasi-experimental research are humans although sometimes organizational units such as schools and divisions in an organization are employed. + +== Design == +The first part of creating a quasi-experimental design is to identify the variables. The quasi-independent variable is the variable that is manipulated in order to affect a dependent variable. In a time series analysis, the dependent variable is observed over time for any changes that may take place. One or more covariates are usually included in analyses, ideally variables that predict both group membership and the outcome. These are additional variables that are often used to address confounding, for example, by means of statistical adjustment or matching on an influential covariate. Once the variables have been identified and defined, experimental and control treatment procedures should then be implemented (assuming researchers are not studying nonmanipulable events like natural disasters) and group differences should be assessed. +In an experiment with random assignment, study units would have approximately the same probability of being assigned to the treatment and control conditions. As such, an aim of random assignment is to make the experimental and control groups as equivalent as possible by distributing known and unknown confounds evenly across treatments. In contrast, in a quasi-experimental design, assignment to a given treatment condition is based on something other than random assignment. Depending on the type of quasi-experimental design, the researcher might have control over assignment to the treatment condition but use some criteria other than random assignment (e.g., a cutoff score on a reading test) to determine which participants are placed in the treatment and control conditions. Factors such as cost, feasibility, political concerns, or convenience may influence how or if participants are assigned to a given treatment conditions. Consequently, quasi-experiments, compared to true experiments, are subject to more concerns regarding internal validity. +Many quasi-experiments use the "pre-post testing". This means that participants are assessed before they are exposed to the treatment and control conditions. And they are tested after the treatment and control conditions have been implemented. Pre-testing helps researchers identify confounds such as members of one group being heavier than the another in a diet study. If there are group differences on an important covariariate measured (e.g., weight, BMI) at the pre-test stage, the researcher can statistically control for that covariate after the post-test data have been collected. +There are several types of quasi-experimental designs, each with different strengths, weaknesses and applications. These designs include (but are not limited to): \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Quasi-experiment-1.md b/data/en.wikipedia.org/wiki/Quasi-experiment-1.md new file mode 100644 index 000000000..a22a4038e --- /dev/null +++ b/data/en.wikipedia.org/wiki/Quasi-experiment-1.md @@ -0,0 +1,41 @@ +--- +title: "Quasi-experiment" +chunk: 2/3 +source: "https://en.wikipedia.org/wiki/Quasi-experiment" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:51:33.934268+00:00" +instance: "kb-cron" +--- + +Difference in differences (pre-post with-without comparison) +Event study +Nonequivalent control groups design +no-treatment control group designs +nonequivalent dependent variables designs +removed treatment group designs +repeated treatment designs +reversed treatment nonequivalent control groups designs +cohort designs +post-test only designs +regression continuity designs +Regression discontinuity design +Case-control design +time-series designs +multiple time series design +interrupted time series design +instrumental variables +Panel analysis +Of all of these designs, the regression discontinuity design comes the closest to the experimental design, as the experimenter maintains control of the treatment assignment and it is known to "yield an unbiased estimate of the treatment effects". It does, however, require large numbers of study participants and precise modeling of the functional form between the assignment and the outcome variable, in order to yield the same power as a traditional experimental design. +Though quasi-experiments are sometimes shunned by those who consider themselves to be experimental purists (leading Donald T. Campbell to coin the term "queasy experiments" for them), they can be useful in areas where it is not feasible or desirable to conduct an experiment or randomized control trial. Such instances include evaluating the impact of public policy changes, educational interventions or large scale health interventions. The primary drawback of quasi-experimental designs is that they cannot eliminate the possibility of confounding bias, which can hinder one's ability to draw causal inferences. This drawback is often used as an excuse to discount quasi-experimental results. However, such bias can be controlled for by using various statistical techniques such as multiple regression, if one can identify and measure the confounding variable(s). Such techniques can be used to model and partial out the effects of confounding variables techniques, thereby improving the accuracy of the results obtained from quasi-experiments. Moreover, the developing use of propensity score matching to match participants on variables important to the treatment selection process can also improve the accuracy of quasi-experimental results. +In fact, data derived from quasi-experimental analyses has been shown to closely match experimental data in certain cases, even when different criteria were used. In sum, quasi-experiments are a valuable tool, especially for the applied researcher. On their own, quasi-experimental designs do not allow one to make definitive causal inferences; however, they provide necessary and valuable information that cannot be obtained by experimental methods alone. Shadish et al. (2002) suggested that researchers, especially those interested in investigating applied research questions, move beyond the traditional experimental design and avail themselves of the possibilities inherent in quasi-experimental designs. + +== Ethics == +A true experiment would, for example, randomly assign children to a scholarship, in order to control for all other variables. Quasi-experiments are commonly used in social sciences, public health, education, and policy analysis, especially when it is not practical or reasonable to randomize study participants to the treatment condition. +As an example, suppose we divide households into two categories: Households in which the parents spank their children, and households in which the parents do not spank their children. We can run a linear regression to determine if there is a positive correlation between parents' spanking and their children's aggressive behavior. However, to simply randomize parents to spanking or not spanking categories may not be practical or ethical, because some parents may believe it is morally wrong to spank their children and refuse to participate. +Some authors distinguish between a natural experiment and a "quasi-experiment". A natural experiment may approximate random assignment, or involve real randomization not by the experimenters or for the experiment. A quasi-experiment generally does not involve actual randomization. +Quasi-experiments have outcome measures, treatments, and experimental units, but do not use random assignment. Quasi-experiments are often the design that most people choose over true experiments. It is usually easily conducted as opposed to true experiments, because they bring in features from both experimental and non-experimental designs. Measured variables as well as manipulated variables can be brought in. Usually quasi-experiments are chosen by experimenters because they maximize internal and external validity. + +== Advantages == +Since quasi-experimental designs are used when randomization is impractical and/or unethical, they are typically easier to set up than true experimental designs, which require random assignment of subjects. Additionally, utilizing quasi-experimental designs minimizes threats to ecological validity as natural environments do not suffer the same problems of artificiality as compared to a well-controlled laboratory setting. Since quasi-experiments are natural experiments, findings in one may be applied to other subjects and settings, allowing for some generalizations to be made about population. Also, this experimentation method is efficient in longitudinal research that involves longer time periods which can be followed up in different environments. +Other advantages of quasi experiments include the idea of having any manipulations the experimenter so chooses. In natural experiments, the researchers have to let manipulations occur on their own and have no control over them whatsoever. Also, using self selected groups in quasi experiments also takes away the chance of ethical, conditional, etc. concerns while conducting the study. \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Quasi-experiment-2.md b/data/en.wikipedia.org/wiki/Quasi-experiment-2.md new file mode 100644 index 000000000..0ee757add --- /dev/null +++ b/data/en.wikipedia.org/wiki/Quasi-experiment-2.md @@ -0,0 +1,29 @@ +--- +title: "Quasi-experiment" +chunk: 3/3 +source: "https://en.wikipedia.org/wiki/Quasi-experiment" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:51:33.934268+00:00" +instance: "kb-cron" +--- + +== Disadvantages == +Quasi-experimental estimates of impact are subject to contamination by confounding variables. In the example above, a variation in the children's response to spanking is plausibly influenced by factors that cannot be easily measured and controlled, for example the child's intrinsic wildness or the parent's irritability. The lack of random assignment in the quasi-experimental design method may allow studies to be more feasible, but this also poses many challenges for the investigator in terms of internal validity. This deficiency in randomization makes it harder to rule out confounding variables and introduces new threats to internal validity. Because randomization is absent, some knowledge about the data can be approximated, but conclusions of causal relationships are difficult to determine due to a variety of extraneous and confounding variables that exist in a social environment. Moreover, even if these threats to internal validity are assessed, causation still cannot be fully established because the experimenter does not have total control over extraneous variables. +Disadvantages also include the study groups may provide weaker evidence because of the lack of randomness. Randomness brings a lot of useful information to a study because it broadens results and therefore gives a better representation of the population as a whole. Using unequal groups can also be a threat to internal validity. If groups are not equal, which is sometimes the case in quasi experiments, then the experimenter might not be positive about determining the causes of the results. + +== Internal validity == +Internal validity is the approximate truth about inferences regarding cause-effect or causal relationships. This is why validity is important for quasi experiments because they are all about causal relationships. It occurs when the experimenter tries to control all variables that could affect the results of the experiment. Statistical regression, history and the participants are all possible threats to internal validity. The question you would want to ask while trying to keep internal validity high is "Are there any other possible reasons for the outcome besides the reason I want it to be?" If so, then internal validity might not be as strong. + +== External validity == +External validity is the extent to which the results obtained from a study sample can be generalized "to" some well-specified population of interest, and "across" subpopulations of people, times, contexts, and methods of study. Lynch has argued that generalizing "to" a population is almost never possible because the populations to which we would like to project are measures of future behavior, which by definition cannot be sampled. Therefore, the more relevant question is whether treatment effects generalize "across" subpopulations that vary on background factors that might not be salient to the researcher. External validity depends on whether the treatments studies have homogeneous effects across different subsets of people, times, contexts, and methods of study or whether the sign and magnitude of any treatment effects changes across subsets in ways that may not be acknowledged or understood by the researchers. Athey and Imbens and Athey and Wager have pioneered machine learning techniques for inductive understanding of heterogeneous treatment effects. + +== Design types == +"Person-by-treatment" designs are the most common type of quasi experiment design. In this design, the experimenter measures at least one independent variable. Along with measuring one variable, the experimenter will also manipulate a different independent variable. Because there is manipulating and measuring of different independent variables, the research is mostly done in laboratories. An important factor in dealing with person-by-treatment designs is that random assignment will need to be used in order to make sure that the experimenter has complete control over the manipulations that are being done to the study. +An example of this type of design was performed at the University of Notre Dame. The study was conducted to see if being mentored for one's job led to increased job satisfaction. The results showed that many people who did have a mentor showed very high job satisfaction. However, the study also showed that those who did not receive the mentor also had a high number of satisfied employees. Seibert concluded that although the workers who had mentors were happy, he could not assume that the reason for it was the mentors themselves because of the numbers of the high number of non-mentored employees that said they were satisfied. This is why prescreening is very important so that you can minimize any flaws in the study before they are seen. +Natural experiments constitute a different type of quasi-experimental design. The design differs from the standard type of quasi-experiment in which the research team assigns pre-existing nonrandomized groups of individuals to intervention/treatment and control conditions. Without randomization, confidence that the groups are equivalent on all relevant background factors is at best very weak. The natural experiment mimics the randomization found in true experiments and the occurrences under study occur naturally. Capitalizing on naturally occurring events, for example, can be helpful in studying the impact of traumatic events on individuals. + +== References == + +== External links == +Quasi-Experimental Design at the Research Methods Knowledge Base \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Random_assignment-0.md b/data/en.wikipedia.org/wiki/Random_assignment-0.md new file mode 100644 index 000000000..2255ee688 --- /dev/null +++ b/data/en.wikipedia.org/wiki/Random_assignment-0.md @@ -0,0 +1,58 @@ +--- +title: "Random assignment" +chunk: 1/1 +source: "https://en.wikipedia.org/wiki/Random_assignment" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:51:35.110967+00:00" +instance: "kb-cron" +--- + +Random assignment or random placement is an experimental technique for assigning human participants or animal subjects to different groups in an experiment (e.g., a treatment group versus a control group) using randomization, such as by a chance procedure (e.g., flipping a coin) or a random number generator. This ensures that each participant or subject has an equal chance of being placed in any group. Random assignment of participants helps to ensure that any differences between and within the groups are not systematic at the outset of the experiment. Thus, any differences between groups recorded at the end of the experiment can be more confidently attributed to the experimental procedures or treatment. +Random assignment, blinding, and controlling are key aspects of the design of experiments because they help ensure that the results are not spurious or deceptive via confounding. This is why randomized controlled trials are vital in clinical research, especially ones that can be double-blinded and placebo-controlled. +Mathematically, there are distinctions between randomization, pseudorandomization, and quasirandomization, as well as between random number generators and pseudorandom number generators. How much these differences matter in experiments (such as clinical trials) is a matter of trial design and statistical rigor, which affect evidence grading. Studies done with pseudo- or quasirandomization are usually given nearly the same weight as those with true randomization but are viewed with a bit more caution. + + +== Benefits of random assignment == +Imagine an experiment in which the participants are not randomly assigned; perhaps the first 10 people to arrive are assigned to the Experimental group, and the last 10 people to arrive are assigned to the Control group. At the end of the experiment, the experimenter finds differences between the Experimental group and the Control group, and claims these differences are a result of the experimental procedure. However, they also may be due to some other preexisting attribute of the participants, e.g. people who arrive early versus people who arrive late. +Imagine the experimenter instead uses a coin flip to randomly assign participants. If the coin lands heads-up, the participant is assigned to the Experimental group. If the coin lands tails-up, the participant is assigned to the Control group. At the end of the experiment, the experimenter finds differences between the Experimental group and the Control group. Because each participant had an equal chance of being placed in any group, it is unlikely the differences could be attributable to some other preexisting attribute of the participant, e.g. those who arrived on time versus late. + + +== Potential issues == +Random assignment does not guarantee that the groups are matched or equivalent. The groups may still differ on some preexisting attribute due to chance. The use of random assignment cannot eliminate this possibility, but it greatly reduces it. +To express this same idea statistically - If a randomly assigned group is compared to the mean it may be discovered that they differ, even though they were assigned from the same group. If a test of statistical significance is applied to randomly assigned groups to test the difference between sample means against the null hypothesis that they are equal to the same population mean (i.e., population mean of differences = 0), given the probability distribution, the null hypothesis will sometimes be "rejected," that is, deemed not plausible. That is, the groups will be sufficiently different on the variable tested to conclude statistically that they did not come from the same population, even though, procedurally, they were assigned from the same total group. For example, using random assignment may create an assignment to groups that has 20 blue-eyed people and 5 brown-eyed people in one group. This is a rare event under random assignment, but it could happen, and when it does it might add some doubt to the causal agent in the experimental hypothesis. + + +== Random sampling == +Random sampling is a related, but distinct, process. Random sampling is recruiting participants in a way that they represent a larger population. Because most basic statistical tests require the hypothesis of an independent randomly sampled population, random assignment is the desired assignment method because it provides control for all attributes of the members of the samples—in contrast to matching on only one or more variables—and provides the mathematical basis for estimating the likelihood of group equivalence for characteristics one is interested in, both for pretreatment checks on equivalence and the evaluation of post treatment results using inferential statistics. More advanced statistical modeling can be used to adapt the inference to the sampling method. + + +== History == +Randomization was emphasized in the theory of statistical inference of Charles S. Peirce in "Illustrations of the Logic of Science" (1877–1878) and "A Theory of Probable Inference" (1883). Peirce applied randomization in the Peirce-Jastrow experiment on weight perception. +Charles S. Peirce randomly assigned volunteers to a blinded, repeated-measures design to evaluate their ability to discriminate weights. +Peirce's experiment inspired other researchers in psychology and education, which developed a research tradition of randomized experiments in laboratories and specialized textbooks in the eighteen-hundreds. +Jerzy Neyman advocated randomization in survey sampling (1934) and in experiments (1923). Ronald A. Fisher advocated randomization in his book on experimental design (1935). + + +== See also == +Asymptotic theory (statistics) + + +== References == + +Caliński, Tadeusz & Kageyama, Sanpei (2000). Block designs: A Randomization approach, Volume I: Analysis. Lecture Notes in Statistics. Vol. 150. New York: Springer-Verlag. ISBN 0-387-98578-6. +Hinkelmann, Klaus; Kempthorne, Oscar (2008). Design and Analysis of Experiments. Vol. I and II (Second ed.). Wiley. ISBN 978-0-470-38551-7. +Hinkelmann, Klaus; Kempthorne, Oscar (2008). Design and Analysis of Experiments, Volume I: Introduction to Experimental Design (Second ed.). Wiley. ISBN 978-0-471-72756-9. +Hinkelmann, Klaus; Kempthorne, Oscar (2005). Design and Analysis of Experiments, Volume 2: Advanced Experimental Design (First ed.). Wiley. ISBN 978-0-471-55177-5. +Charles S. Peirce, "Illustrations of the Logic of Science" (1877–1878) +Charles S. Peirce, "A Theory of Probable Inference" (1883) +Charles Sanders Peirce; Joseph Jastrow (1885). "On Small Differences in Sensation". Memoirs of the National Academy of Sciences. 3: 73–83. http://psychclassics.yorku.ca/Peirce/small-diffs.htm +Hacking, Ian (September 1988). "Telepathy: Origins of Randomization in Experimental Design". Isis. 79 (3): 427–451. doi:10.1086/354775. JSTOR 234674. MR 1013489. S2CID 52201011. +Stephen M. Stigler (November 1992). "A Historical View of Statistical Concepts in Psychology and Educational Research". American Journal of Education. 101 (1): 60–70. doi:10.1086/444032. S2CID 143685203. +Trudy Dehue (December 1997). "Deception, Efficiency, and Random Groups: Psychology and the Gradual Origination of the Random Group Design" (PDF). Isis. 88 (4): 653–673. doi:10.1086/383850. PMID 9519574. S2CID 23526321. +Basic Psychology by Gleitman, Fridlund, and Reisberg. +"What statistical testing is, and what it is not," Journal of Experimental Education, 1993, vol 61, pp. 293–316 by Shaver. + + +== External links == +Experimental Random Assignment Tool: Random assignment tool - Experimental \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Randomised_non-comparative_trial-0.md b/data/en.wikipedia.org/wiki/Randomised_non-comparative_trial-0.md new file mode 100644 index 000000000..66063dfe4 --- /dev/null +++ b/data/en.wikipedia.org/wiki/Randomised_non-comparative_trial-0.md @@ -0,0 +1,16 @@ +--- +title: "Randomised non-comparative trial" +chunk: 1/1 +source: "https://en.wikipedia.org/wiki/Randomised_non-comparative_trial" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:51:36.297838+00:00" +instance: "kb-cron" +--- + +A randomised non-comparative trial, RNCT (or also non-comparative randomised trial), is a type of clinical trial where participants are randomised to different conditions (arms), but where the primary analysis involves comparing each arm separately to a historical control, predefined benchmark or something else, with no formal comparison between the two arms. +The study design appears to have arisen in oncology, where single-arm studies are not unusual. It promises reduced sample size requirements. An RNCT acts like multiple single-arm designs run concurrently. A review found RNCTs dating back to 2002, and having been used in high-profile oncology studies and also beyond oncology. +The design has been criticised by statisticians. It is unclear what benefit randomisation adds compared to running separate single-arm studies. Pavlos Msaouel described the use of the word "randomisation" as merely being "talismanic" in a 2025 article, it leading readers to think the study has the benefits of a randomised controlled trial (RCT). While the point of an RNCT is not to have a comparison between arms, about half of reported RNCTs still reported some sort of comparison. + + +== References == \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Randomization-0.md b/data/en.wikipedia.org/wiki/Randomization-0.md new file mode 100644 index 000000000..c883e1a20 --- /dev/null +++ b/data/en.wikipedia.org/wiki/Randomization-0.md @@ -0,0 +1,33 @@ +--- +title: "Randomization" +chunk: 1/2 +source: "https://en.wikipedia.org/wiki/Randomization" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:51:37.479553+00:00" +instance: "kb-cron" +--- + +Randomization is a statistical process in which a random mechanism is employed to select a sample from a population or assign subjects to different groups. The process is crucial in ensuring the random allocation of experimental units or treatment protocols, thereby minimizing selection bias and enhancing the statistical validity. It facilitates the objective comparison of treatment effects in experimental design, as it equates groups statistically by balancing both known and unknown factors at the outset of the study. In statistical terms, it underpins the principle of probabilistic equivalence among groups, allowing for the unbiased estimation of treatment effects and the generalizability of conclusions drawn from sample data to the broader population. +Randomization is not haphazard; instead, a random process is a sequence of random variables describing a process whose outcomes do not follow a deterministic pattern but follow an evolution described by probability distributions. For example, a random sample of individuals from a population refers to a sample where every individual has a known probability of being sampled. This would be contrasted with nonprobability sampling, where arbitrary individuals are selected. A runs test can be used to determine whether the occurrence of a set of measured values is random. Randomization is widely applied in various fields, especially in scientific research, statistical analysis, and resource allocation, to ensure fairness and validity in the outcomes. +In various contexts, randomization may involve + +Generating Random Permutations: This is essential in various situations, such as shuffling cards. By randomly rearranging the sequence, it ensures fairness and unpredictability in games and experiments. +Selecting Random Samples from Populations: In statistical sampling, this method is vital for obtaining representative samples. By randomly choosing a subset of individuals, biases are minimized, ensuring that the sample accurately reflects the larger population. +Random Allocation in Experimental Design: Random assignment of experimental units to treatment or control conditions is fundamental in scientific studies. This approach ensures that each unit has an equal chance of receiving any treatment, thereby reducing systematic bias and improving the reliability of experimental results. +Generating Random Numbers: The process of random number generation is central to simulations, cryptographic applications, and statistical analysis. These numbers form the basis for simulations, model testing, and secure data encryption. +Data Stream Transformation: In telecommunications, randomization is used to transform data streams. Techniques like scramblers randomize the data to prevent predictable patterns, which is crucial for securing communication channels and enhancing transmission reliability." +Randomization has many uses in gambling, political use, statistical analysis, art, cryptography, gaming and other fields. + +== In gambling == + +In the world of gambling, the integrity and fairness of games hinge significantly on effective randomization. This principle serves as a cornerstone in gambling, ensuring that each game outcome is unpredictable and not manipulable. The necessity for advanced randomization methods stems from the potential for skilled gamblers to exploit weaknesses in poorly randomized systems. High-quality randomization thwarts attempts at prediction or manipulation, maintaining the fairness of games. A quintessential example of randomization in gambling is the shuffling of playing cards. This process must be thoroughly random to prevent any predictability in the order of cards. Casinos often employ automatic shuffling machines, which enhance randomness beyond what manual shuffling can achieve. +With the rise of online casinos, digital random number generators (RNGs) have become crucial. These RNGs use complex algorithms to produce outcomes that are as unpredictable as their real-world counterparts. The gambling industry invests heavily in research to develop more effective randomization techniques. To ensure that gambling games are fair and random, regulatory bodies rigorously test and certify shuffling and random number generation methods. This oversight is vital in maintaining trust in the gambling industry, ensuring that players have equal chances of winning. +The unpredictability inherent in randomization is also a key factor in the psychological appeal of gambling. The thrill and suspense created by the uncertainty of outcomes contribute significantly to the allure and excitement of gambling games. +In summary, randomization in gambling is not just a technical necessity; it is a fundamental principle that upholds the fairness, integrity, and thrill of the games. As technology advances, so too do the methods to ensure that this randomization remains effective and beyond reproach + +== In politics == +The concept of randomization in political systems, specifically through the method of allotment or sortition, has ancient roots and contemporary relevance, significantly impacting the evolution and practice of democracy. +In the fifth century BC, Athenian democracy was pioneering in its approach to ensuring political equality, or isonomia. Central to this system was the principle of random selection, seen as a cornerstone for fair representation. The unique structure of Greek democracy, which translates to "rule by the people," was exemplified by administrative roles being rotated among citizens, selected randomly through lot. This method was perceived as more democratic than elections, which the Athenians argued could lead to inequalities. They believed that elections, which often favored candidates based on merit or popularity, contradicted the principle of equal rights for all citizens. Furthermore, the random allotment of positions like magistrates or jury members served as a deterrent to vote-buying and corruption, as it was impossible to predict who would be chosen for these roles. + +In modern times, the concept of allotment, also known as sortition, is primarily seen in the selection of jurors within Anglo-Saxon legal systems, such as those in the UK and the United States. However, its political implications extend further. There have been various proposals to integrate sortition into government structures. The idea is that sortition could introduce a new dimension of representation and fairness in political systems, countering issues associated with electoral politics. This concept has garnered academic interest, with scholars exploring the potential of random selection in enhancing the democratic process, both in political frameworks and organizational structures. The ongoing study and debate surrounding the use of sortition reflect its enduring relevance and potential as a tool for political innovation and integrity. \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Randomization-1.md b/data/en.wikipedia.org/wiki/Randomization-1.md new file mode 100644 index 000000000..404a751e7 --- /dev/null +++ b/data/en.wikipedia.org/wiki/Randomization-1.md @@ -0,0 +1,71 @@ +--- +title: "Randomization" +chunk: 2/2 +source: "https://en.wikipedia.org/wiki/Randomization" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:51:37.479553+00:00" +instance: "kb-cron" +--- + +== Randomization in statistical analysis == +Randomization is a core principle in statistical theory, whose importance was emphasized by Charles S. Peirce in "Illustrations of the Logic of Science" (1877–1878) and "A Theory of Probable Inference" (1883). Its application in statistical methodologies is multifaceted and includes critical processes such as randomized controlled experiments, survey sampling and simulations. + +=== Randomized controlled experiment === + +In the realm of scientific research, particularly within clinical study designs, constraints such as limited manpower, material resources, financial backing, and time necessitate a selective approach to participant inclusion. Despite the broad spectrum of potential participants who fulfill the inclusion criteria, it is impractical to incorporate every eligible individual in the target population due to these constraints. Therefore, a representative subset of treatment groups is chosen based on the specific requirements of the research. A randomized sampling method is employed to ensure the integrity and representativeness of the study. This method ensures that all qualified subjects within the target population have an equal opportunity to be selected. Such a strategy is pivotal in mirroring the overall characteristics of the target population and in mitigating the risk of selection bias. +The selected samples (or continuous non-randomly sampled samples) are grouped using randomization methods so that all research subjects in the sample have an equal chance of entering the experimental group or the control group and receiving corresponding treatment. In particular, the random grouping after the research subjects are stratified can make the known or unknown influencing factors between the groups basically consistent, thereby enhancing the comparability between the groups. + +=== Survey sampling === +Survey sampling uses randomization, following the criticisms of previous "representative methods" by Jerzy Neyman in his 1922 report to the International Statistical Institute. It randomly displays the answer options to survey participants, which prevents order bias caused by the tendency of respondents to choose the first option when the same order is presented to different respondents. To overcome this, researchers can give the answer options in a random order so that the respondents allocate some time to read all the options and choose an honest answer. For example, consider an automobile dealer who wants to conduct a feedback survey and ask the respondents to select their preferred automobile brand. The user can create a study with randomized answers to display the different automobile brands so that the respondents do not see them in the same order. + +=== Resampling === + +Some important methods of statistical inference use resampling from the observed data. Multiple alternative versions of the data-set that "might have been observed" are created by randomization of the original data-set, the only one observed. The variation of statistics calculated for these alternative data-sets is a guide to the uncertainty of statistics estimated from the original data. + +=== Simulation === + +In many scientific and engineering fields, computer simulations of real phenomena are commonly used. When the real phenomena are affected by unpredictable processes, such as radio noise or day-to-day weather, these processes can be simulated using random or pseudo-random numbers. One of the most prominent uses of randomization in simulations is in Monte Carlo methods. These methods rely on repeated random sampling to obtain numerical results, typically to model probability distributions or to estimate uncertain quantities in a system. +Randomization also allows for the testing of models or algorithms against unexpected inputs or scenarios. This is essential in fields like machine learning and artificial intelligence, where algorithms must be robust against a variety of inputs and conditions. + +== In cryptography == + +== In art == +Randomization plays a fascinating and often underappreciated role in literature, music, and art, where it introduces elements of unpredictability and spontaneity. Here is how it manifests in each of these creative fields: + +=== Literature === + +Pioneered by surrealists and later popularized by writers like William S. Burroughs, automatic writing and cut-up techniques involve randomly rearranging text to create new literary forms. It disrupts linear narratives, fostering unexpected connections and meanings. + +=== Music === +In aleatoric music, elements of the composition are left to chance or the performer's discretion. Composers like John Cage used randomization to create music where certain elements are unforeseeable, resulting in each performance being uniquely different. Modern musicians sometimes employ computer algorithms that generate music based on random inputs. These compositions can range from electronic music to more classical forms, where randomness plays a key role in creating harmony, melody, or rhythm. + +=== Art === +Some artists in abstract expressionism movement, like Jackson Pollock, used random methods (like dripping or splattering paint) to create their artworks. This approach emphasizes the physical act of painting and the role of chance in the artistic process.Also, contemporary artists often use algorithms and computer-generated randomness to create visual art. This can result in intricate patterns and designs that would be difficult or impossible to predict or replicate manually. + +== Techniques == +Although historically "manual" randomization techniques (such as shuffling cards, drawing pieces of paper from a bag, spinning a roulette wheel) were common, nowadays automated techniques are mostly used. As both selecting random samples and random permutations can be reduced to simply selecting random numbers, random number generation methods are now most commonly used, both hardware random number generators and pseudo-random number generators. + +=== Optimization === +Randomization is used in optimization to alleviate the computational burden associated to robust control techniques: a sample of values of the uncertainty parameters is randomly drawn and robustness is enforced for these values only. This approach has gained popularity by the introduction of rigorous theories that permit one to have control on the probabilistic level of robustness, see scenario optimization. +Common randomization methods including + +Simple randomization (coin flipping, drawing lots and random number method) +Stratified randomization (stratified sampling and stratified allocation) +Block randomization +Systematic randomization +Cluster randomization +Multistage sampling +Quasi-randomization +Covariate Adaptive Randomization + +== See also == +Randomized algorithm +Bias +Random number generation + +== References == + +== External links == +RQube Archived 2019-06-20 at the Wayback Machine - Generate quasi-random stimulus sequences for experimental designs +RandList - Randomization List Generator \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Randomized_controlled_trial-0.md b/data/en.wikipedia.org/wiki/Randomized_controlled_trial-0.md index ae8cfd5fd..4ce3a0aa4 100644 --- a/data/en.wikipedia.org/wiki/Randomized_controlled_trial-0.md +++ b/data/en.wikipedia.org/wiki/Randomized_controlled_trial-0.md @@ -4,7 +4,7 @@ chunk: 1/7 source: "https://en.wikipedia.org/wiki/Randomized_controlled_trial" category: "reference" tags: "science, encyclopedia" -date_saved: "2026-05-05T07:01:11.523187+00:00" +date_saved: "2026-05-05T09:51:38.699153+00:00" instance: "kb-cron" --- diff --git a/data/en.wikipedia.org/wiki/Randomized_controlled_trial-1.md b/data/en.wikipedia.org/wiki/Randomized_controlled_trial-1.md index c85e497f0..91fe48898 100644 --- a/data/en.wikipedia.org/wiki/Randomized_controlled_trial-1.md +++ b/data/en.wikipedia.org/wiki/Randomized_controlled_trial-1.md @@ -4,7 +4,7 @@ chunk: 2/7 source: "https://en.wikipedia.org/wiki/Randomized_controlled_trial" category: "reference" tags: "science, encyclopedia" -date_saved: "2026-05-05T07:01:11.523187+00:00" +date_saved: "2026-05-05T09:51:38.699153+00:00" instance: "kb-cron" --- diff --git a/data/en.wikipedia.org/wiki/Randomized_controlled_trial-2.md b/data/en.wikipedia.org/wiki/Randomized_controlled_trial-2.md index 63c69f82d..a7159b79f 100644 --- a/data/en.wikipedia.org/wiki/Randomized_controlled_trial-2.md +++ b/data/en.wikipedia.org/wiki/Randomized_controlled_trial-2.md @@ -4,7 +4,7 @@ chunk: 3/7 source: "https://en.wikipedia.org/wiki/Randomized_controlled_trial" category: "reference" tags: "science, encyclopedia" -date_saved: "2026-05-05T07:01:11.523187+00:00" +date_saved: "2026-05-05T09:51:38.699153+00:00" instance: "kb-cron" --- diff --git a/data/en.wikipedia.org/wiki/Randomized_controlled_trial-3.md b/data/en.wikipedia.org/wiki/Randomized_controlled_trial-3.md index 85d91ef0a..6dda4fb3b 100644 --- a/data/en.wikipedia.org/wiki/Randomized_controlled_trial-3.md +++ b/data/en.wikipedia.org/wiki/Randomized_controlled_trial-3.md @@ -4,7 +4,7 @@ chunk: 4/7 source: "https://en.wikipedia.org/wiki/Randomized_controlled_trial" category: "reference" tags: "science, encyclopedia" -date_saved: "2026-05-05T07:01:11.523187+00:00" +date_saved: "2026-05-05T09:51:38.699153+00:00" instance: "kb-cron" --- diff --git a/data/en.wikipedia.org/wiki/Randomized_controlled_trial-4.md b/data/en.wikipedia.org/wiki/Randomized_controlled_trial-4.md index 6d76d3c27..b9164a2a2 100644 --- a/data/en.wikipedia.org/wiki/Randomized_controlled_trial-4.md +++ b/data/en.wikipedia.org/wiki/Randomized_controlled_trial-4.md @@ -4,7 +4,7 @@ chunk: 5/7 source: "https://en.wikipedia.org/wiki/Randomized_controlled_trial" category: "reference" tags: "science, encyclopedia" -date_saved: "2026-05-05T07:01:11.523187+00:00" +date_saved: "2026-05-05T09:51:38.699153+00:00" instance: "kb-cron" --- diff --git a/data/en.wikipedia.org/wiki/Randomized_controlled_trial-5.md b/data/en.wikipedia.org/wiki/Randomized_controlled_trial-5.md index a872235be..621621d7f 100644 --- a/data/en.wikipedia.org/wiki/Randomized_controlled_trial-5.md +++ b/data/en.wikipedia.org/wiki/Randomized_controlled_trial-5.md @@ -4,7 +4,7 @@ chunk: 6/7 source: "https://en.wikipedia.org/wiki/Randomized_controlled_trial" category: "reference" tags: "science, encyclopedia" -date_saved: "2026-05-05T07:01:11.523187+00:00" +date_saved: "2026-05-05T09:51:38.699153+00:00" instance: "kb-cron" --- diff --git a/data/en.wikipedia.org/wiki/Randomized_controlled_trial-6.md b/data/en.wikipedia.org/wiki/Randomized_controlled_trial-6.md index b9f2650d6..d09d2632a 100644 --- a/data/en.wikipedia.org/wiki/Randomized_controlled_trial-6.md +++ b/data/en.wikipedia.org/wiki/Randomized_controlled_trial-6.md @@ -4,7 +4,7 @@ chunk: 7/7 source: "https://en.wikipedia.org/wiki/Randomized_controlled_trial" category: "reference" tags: "science, encyclopedia" -date_saved: "2026-05-05T07:01:11.523187+00:00" +date_saved: "2026-05-05T09:51:38.699153+00:00" instance: "kb-cron" --- diff --git a/data/en.wikipedia.org/wiki/Randomized_experiment-0.md b/data/en.wikipedia.org/wiki/Randomized_experiment-0.md new file mode 100644 index 000000000..eaaea14f2 --- /dev/null +++ b/data/en.wikipedia.org/wiki/Randomized_experiment-0.md @@ -0,0 +1,67 @@ +--- +title: "Randomized experiment" +chunk: 1/1 +source: "https://en.wikipedia.org/wiki/Randomized_experiment" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:51:39.917852+00:00" +instance: "kb-cron" +--- + +In science, randomized experiments are the experiments that allow the greatest reliability and validity of statistical estimates of treatment effects. Randomization-based inference is especially important in experimental design and in survey sampling. + + +== Overview == +In the statistical theory of design of experiments, randomization involves randomly allocating the experimental units across the treatment groups. For example, if an experiment compares a new drug against a standard drug, then the patients should be allocated to either the new drug or to the standard drug control using randomization. +Randomized experimentation is not haphazard. Randomization reduces bias by equalising other factors that have not been explicitly accounted for in the experimental design (according to the law of large numbers). Randomization also produces ignorable designs, which are valuable in model-based statistical inference, especially Bayesian or likelihood-based. In the design of experiments, the simplest design for comparing treatments is the "completely randomized design". Some "restriction on randomization" can occur with blocking and experiments that have hard-to-change factors; additional restrictions on randomization can occur when a full randomization is infeasible or when it is desirable to reduce the variance of estimators of selected effects. +Randomization of treatment in clinical trials pose ethical problems. In some cases, randomization reduces the therapeutic options for both physician and patient, and so randomization requires clinical equipoise regarding the treatments. + + +== Online randomized controlled experiments == +Web sites can run randomized controlled experiments to create a feedback loop. Key differences between offline experimentation and online experiments include: + +Logging: user interactions can be logged reliably. +Number of users: large sites, such as Amazon, Bing/Microsoft, and Google run experiments, each with over a million users. +Number of concurrent experiments: large sites run tens of overlapping, or concurrent, experiments. +Robots, whether web crawlers from valid sources or malicious internet bots. +Ability to ramp-up experiments from low percentages to higher percentages. +Speed / performance has significant impact on key metrics. +Ability to use the pre-experiment period as an A/A test to reduce variance. + + +== History == + +A controlled experiment appears to have been suggested in the Old Testament's Book of Daniel. King Nebuchadnezzar proposed that some Israelites eat "a daily amount of food and wine from the king's table." Daniel preferred a vegetarian diet, but the official was concerned that the king would "see you looking worse than the other young men your age? The king would then have my head because of you." Daniel then proposed the following controlled experiment: "Test your servants for ten days. Give us nothing but vegetables to eat and water to drink. Then compare our appearance with that of the young men who eat the royal food, and treat your servants in accordance with what you see". (Daniel 1:12–13). +Randomized experiments were institutionalized in psychology and education in the late eighteen-hundreds, following the invention of randomized experiments by C. S. Peirce. +Outside of psychology and education, randomized experiments were popularized by R.A. Fisher in his book Statistical Methods for Research Workers, which also introduced additional principles of experimental design. + + +== Statistical interpretation == + +The Rubin Causal Model provides a common way to describe a randomized experiment. While the Rubin Causal Model provides a framework for defining the causal parameters (i.e., the effects of a randomized treatment on an outcome), the analysis of experiments can take a number of forms. The model assumes that there are two potential outcomes for each unit in the study: the outcome if the unit receives the treatment and the outcome if the unit does not receive the treatment. The difference between these two potential outcomes is known as the treatment effect, which is the causal effect of the treatment on the outcome. Most commonly, randomized experiments are analyzed using ANOVA, student's t-test, regression analysis, or a similar statistical test. The model also accounts for potential confounding factors, which are factors that could affect both the treatment and the outcome. By controlling for these confounding factors, the model helps to ensure that any observed treatment effect is truly causal and not simply the result of other factors that are correlated with both the treatment and the outcome. +The Rubin Causal Model is a useful a framework for understanding how to estimate the causal effect of the treatment, even when there are confounding variables that may affect the outcome. This model specifies that the causal effect of the treatment is the difference in the outcomes that would have been observed for each individual if they had received the treatment and if they had not received the treatment. In practice, it is not possible to observe both potential outcomes for the same individual, so statistical methods are used to estimate the causal effect using data from the experiment. + + +== Empirical evidence that randomization makes a difference == +Empirically differences between randomized and non-randomized studies, and between adequately and inadequately randomized trials have been difficult to detect. + + +== Directed acyclic graph (DAG) explanation of randomization == +Randomization is the cornerstone of many scientific claims. To randomize, means that we can eliminate the confounding factors. Say we study the effect of A on B. Yet, there are many unobservables U that potentially affect B and confound our estimate of the finding. To explain these kinds of issues, statisticians or econometricians nowadays use directed acyclic graph. + + +== See also == +A/B testing +Allocation concealment +Random assignment +Randomized block design +Randomized controlled trial + + +== References == + +Caliński, Tadeusz & Kageyama, Sanpei (2000). Block designs: A Randomization approach, Volume I: Analysis. Lecture Notes in Statistics. Vol. 150. New York: Springer-Verlag. ISBN 978-0-387-98578-7. +Caliński, Tadeusz & Kageyama, Sanpei (2003). Block designs: A Randomization approach, Volume II: Design. Lecture Notes in Statistics. Vol. 170. New York: Springer-Verlag. ISBN 978-0-387-95470-7. +Hacking, Ian (September 1988). "Telepathy: Origins of Randomization in Experimental Design". Isis. 79 (3): 427–451. doi:10.1086/354775. JSTOR 234674. MR 1013489. S2CID 52201011. +Hinkelmann, Klaus; Kempthorne, Oscar (2008). Design and Analysis of Experiments, Volume I: Introduction to Experimental Design (Second ed.). Wiley. ISBN 978-0-471-72756-9. MR 2363107. +Kempthorne, Oscar (1992). "Intervention experiments, randomization and inference". In Malay Ghosh and Pramod K. Pathak (ed.). Current Issues in Statistical Inference—Essays in Honor of D. Basu. Institute of Mathematical Statistics Lecture Notes - Monograph Series. Hayward, CA: Institute for Mathematical Statistics. pp. 13–31. doi:10.1214/lnms/1215458836. ISBN 978-0-940600-24-9. MR 1194407. \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Registered_report-0.md b/data/en.wikipedia.org/wiki/Registered_report-0.md new file mode 100644 index 000000000..7516a0e2d --- /dev/null +++ b/data/en.wikipedia.org/wiki/Registered_report-0.md @@ -0,0 +1,43 @@ +--- +title: "Registered report" +chunk: 1/1 +source: "https://en.wikipedia.org/wiki/Registered_report" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:53:14.608295+00:00" +instance: "kb-cron" +--- + +A registered report is a scholarly publishing format in which the research question, study design, and analysis plan are peer-reviewed before data are collected or analyzed. This publication model aims to reduce publication bias and improve research transparency by restricting the evaluation of scientific manuscripts to the research question, method, and analysis plan, without consideration of whether the results support the research question or not. + + +== Process == +The Registered Reports process occurs in two stages of peer review. In the first stage, known as Stage 1 review, authors submit a detailed study proposal to a journal that offers the Registered Reports format. The proposal typically includes the theoretical rationale, research questions or hypotheses, experimental design, sampling plan, and a preregistered statistical analysis plan. At this point, no data have yet been collected or analyzed. +Peer reviewers evaluate the submission based on the importance of the research question and the quality and rigor of the proposed methodology. Because the preregistration undergoes peer review, Registered Reports introduce an additional quality control mechanism compared with standard preregistration practices in which researchers register study plans independently. +Following Stage 1 review, a journal may reject the submission, request revisions, or grant in-principle acceptance (IPA). In-principle acceptance indicates that the journal commits to publishing the study provided that the authors follow the approved research plan and that the study meets basic quality standards. +After receiving in-principle acceptance, researchers conduct the study and analyze the data according to the preregistered plan. Deviations from the preregistration are permitted but must be transparently documented, and their impact on the validity and severity of the statistical tests should be evaluated. Substantial deviations may require additional editorial or peer review. +Once the study has been completed, authors submit the full manuscript, including results and conclusions, for Stage 2 review. Reviewers assess whether the preregistered procedures were followed, evaluate any deviations from the original plan, and examine whether the conclusions are supported by the data. Because publication is already conditionally guaranteed, manuscripts are not rejected on the basis of null or statistically non-significant results, except in rare cases where methodological problems render the study uninformative. + + +== History == +Concerns about publication bias in the scientific literature date back to early metascientific studies, such as Sterling's analysis of psychology journals. Several authors proposed publication models in which editorial decisions would be made without knowledge of study results. +Rosenthal suggested that reviewers evaluate research procedures before seeing the results in order to reduce bias in editorial decisions. Similarly, Walster and Cleary (1970) argued that decisions about how data should be treated must be made before examining the data and proposed that publication decisions should follow the same principle. +A more explicit proposal to review study protocols prior to data collection was described by Johnson (1975) in the European Journal of Parapsychology. Johnson proposed that researchers submit a manuscript describing the study rationale, hypotheses, and experimental design before conducting the research, and that editors commit to publishing the study regardless of the outcome. This approach was intended to prevent selective reporting and post-hoc modification of hypotheses, as Johnson explain (p. 43): +“According to the philosophy of this model, the experimenter should define his problem, formulate his hypothesis, and outline his experiment, prior to commencing his study. He should write his manuscript, stating at least essential facts, before carrying out his investigation. this manuscript, in principle only lacking data in the tables, presentation of the results, and interpretation of the results, should be sent to one or more editors, and the experiment should not initiate his study until at least one of the editors has promised to publish the study, regardless of the outcome of the experiment. in this way we could avoid selective reporting. Furthermore, the experimenter will not be given the opportunity to change his hypotheses in such a way that they “fit” the outcome of the experiment”. +The modern Registered Reports format was independently developed and introduced at the journal Cortex by Chris Chambers in 2013. The format was subsequently adopted by a growing number of journals across scientific disciplines. The first Registered Reports were published in 2014 in a special issue of Social Psychology containing preregistered replication studies. + + +== Benefits == +Registered Reports are designed to reduce publication bias and other forms of systematic bias in scientific publishing. Because publication decisions are made before results are known, studies are not preferentially published based on statistically significant outcomes. +Another advantage is that peer review occurs earlier in the research process. Feedback from reviewers can improve study design and address potential methodological weaknesses before data collection begins. In addition, researchers know in advance whether their study will be published if they follow the approved protocol, which can be particularly beneficial for projects requiring substantial time or resources. +Empirical studies suggest that Registered Reports produce a different distribution of reported results compared with traditional publications. Scheel and colleagues found that approximately 96% of traditional psychology articles reported statistically significant results, whereas only 44% of hypotheses tested in Registered Reports yielded significant findings. Although this difference is correlational and cannot determine causation, it may reflect reduced selective reporting or reduced engagement in practices such as p-hacking after in-principle acceptance. +Some studies also suggest that Registered Reports exhibit higher methodological rigor and stronger alignment between research questions and analytical methods than comparable traditional articles. + + +== Limitations == +One limitation of the Registered Reports format is that researchers cannot begin data collection until the Stage 1 peer review process has been completed and in-principle acceptance has been granted. This requirement can introduce uncertainty in research timelines. +To address this issue, some journals and organizations have implemented scheduled review systems in which researchers announce in advance when they intend to submit a Stage 1 manuscript. Editors then recruit reviewers ahead of time so that peer review can begin immediately after submission. Platforms such as Peer Community in Registered Reports (PCI-RR) have adopted this workflow to accelerate the Stage 1 review process. +Despite these practical limitations, the format has received broad support in the scientific community as a mechanism for reducing publication bias and improving the transparency and rigor of research practices, and no criticisms on the Registered Report format have been published in the scientific literature. + + +== References == \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Regression_discontinuity_design-0.md b/data/en.wikipedia.org/wiki/Regression_discontinuity_design-0.md new file mode 100644 index 000000000..d58db2f59 --- /dev/null +++ b/data/en.wikipedia.org/wiki/Regression_discontinuity_design-0.md @@ -0,0 +1,292 @@ +--- +title: "Regression discontinuity design" +chunk: 1/3 +source: "https://en.wikipedia.org/wiki/Regression_discontinuity_design" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:51:41.067762+00:00" +instance: "kb-cron" +--- + +Regression discontinuity designs (RDD) are a quasi-experimental pretest–posttest design that attempts to determine the causal effects of interventions by assigning a cutoff or threshold above or below which an intervention is assigned. By comparing observations lying closely on either side of the threshold, it is possible to estimate the average treatment effect in environments where random assignment to conditions is unfeasible. True causal inference using RDDs is still impossible, because the RDD cannot account for the potentially confounding effects of other variables without randomization. +The RDD was originally applied by Donald Thistlethwaite and Donald Campbell (1960) to evaluate the effect of scholarship programs on student career plans. The RDD is used in disciplines like psychology, economics, political science, epidemiology, and other related disciplines. Recent comparisons of randomised controlled trials (RCTs) and RDDs have empirically demonstrated the internal validity of the design. + +== Example == +The intuition behind the RDD is well illustrated using the evaluation of merit-based scholarships. The main problem with estimating the causal effect of such an intervention is the homogeneity of performance to the assignment of treatment (e.g., a scholarship award). Since high-performing students are more likely to be awarded the merit scholarship and continue performing well at the same time, comparing the outcomes of awardees and non-recipients would lead to an upward bias of the estimates. Even if the scholarship did not improve grades at all, awardees would have performed better than non-recipients, simply because scholarships were given to students who were performing well before. +Despite the absence of an experimental design, an RDD can exploit exogenous characteristics of the intervention to elicit causal effects. If all students above a given grade—for example 80%—are given the scholarship, it is possible to elicit the local treatment effect by comparing students around the 80% cut-off. The intuition here is that a student scoring 79% is likely to be very similar to a student scoring 81%—given the pre-defined threshold of 80%. However, one student will receive the scholarship while the other will not. Comparing the outcome of the awardee (treatment group) to the counterfactual outcome of the non-recipient (control group) will hence deliver the local treatment effect. + +== Methodology == +The two most common approaches to estimation using an RDD are non-parametric and parametric (normally polynomial regression). + +=== Non-parametric estimation === +The most common non-parametric method used in the RDD context is a local linear regression. This is of the form: + + + + + Y + = + α + + + τ + D + + + + β + + 1 + + + ( + X + − + c + ) + + + + β + + 2 + + + D + ( + X + − + c + ) + + + ε + , + + + {\displaystyle Y=\alpha +\tau D+\beta _{1}(X-c)+\beta _{2}D(X-c)+\varepsilon ,} + + +where + + + + c + + + {\displaystyle c} + + is the treatment cutoff and + + + + D + + + {\displaystyle D} + + is a binary variable equal to one if + + + + X + ≥ + c + + + {\displaystyle X\geq c} + +. Letting + + + + h + + + {\displaystyle h} + + be the bandwidth of data used, we have + + + + c + − + h + ≤ + X + ≤ + c + + + h + + + {\displaystyle c-h\leq X\leq c+h} + +. Different slopes and intercepts fit data on either side of the cutoff. Typically either a rectangular kernel (no weighting) or a triangular kernel are used. The rectangular kernel has a more straightforward interpretation over sophisticated kernels which yield little efficiency gains. +The major benefit of using non-parametric methods in an RDD is that they provide estimates based on data closer to the cut-off, which is intuitively appealing. This reduces some bias that can result from using data farther away from the cutoff to estimate the discontinuity at the cutoff. More formally, local linear regressions are preferred because they have better bias properties and have better convergence. However, the use of both types of estimation, if feasible, is a useful way to argue that the estimated results do not rely too heavily on the particular approach taken. + +=== Parametric estimation === +An example of a parametric estimation is: + + + + + Y + = + α + + + + β + + 1 + + + + x + + i + + + + + + β + + 2 + + + + c + + i + + + + + + β + + 3 + + + + c + + i + + + 2 + + + + + + β + + 4 + + + + c + + i + + + 3 + + + + + ε + , + + + {\displaystyle Y=\alpha +\beta _{1}x_{i}+\beta _{2}c_{i}+\beta _{3}c_{i}^{2}+\beta _{4}c_{i}^{3}+\varepsilon ,} + + +where + + + + + + x + + i + + + = + + + { + + + + 1 + + if + + + c + + i + + + ≥ + + + + c + ¯ + + + + + + + + 0 + + if + + + c + + i + + + < + + + + c + ¯ + + + + + + + + + + + + {\displaystyle x_{i}={\begin{cases}1{\text{ if }}c_{i}\geq {\bar {c}}\\0{\text{ if }}c_{i}<{\bar {c}}\end{cases}}} + + +and + + + + + + + c + ¯ + + + + + + {\displaystyle {\bar {c}}} + + is the treatment cutoff. +Note that the polynomial part can be shortened or extended according to the needs. + +=== Other examples === +Policies in which treatment is determined by an age eligibility criterion (e.g. pensions, minimum legal drinking age). +Elections in which one politician wins by a marginal majority. +Placement scores within education that sort students into treatment programs. \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Regression_discontinuity_design-1.md b/data/en.wikipedia.org/wiki/Regression_discontinuity_design-1.md new file mode 100644 index 000000000..4e44e29c9 --- /dev/null +++ b/data/en.wikipedia.org/wiki/Regression_discontinuity_design-1.md @@ -0,0 +1,42 @@ +--- +title: "Regression discontinuity design" +chunk: 2/3 +source: "https://en.wikipedia.org/wiki/Regression_discontinuity_design" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:51:41.067762+00:00" +instance: "kb-cron" +--- + +== Required assumptions == +Regression discontinuity design requires that all potentially relevant variables besides the treatment variable and outcome variable be continuous at the point where the treatment and outcome discontinuities occur. One sufficient, though not necessary, condition is if the treatment assignment is "as good as random" at the threshold for treatment. If this holds, then it guarantees that those who just barely received treatment are comparable to those who just barely did not receive treatment, as treatment status is effectively random. +Treatment assignment at the threshold can be "as good as random" if there is randomness in the assignment variable and the agents considered (individuals, firms, etc.) cannot perfectly manipulate their treatment status. For example, suppose the treatment is passing an exam, where a grade of 50% is required. In this case, this example is a valid regression discontinuity design so long as grades are somewhat random, due either to the randomness of grading or randomness of student performance. +Students must not also be able to perfectly manipulate their grade so as to determine their treatment status perfectly. Two examples include students being able to convince teachers to "mercy pass" them, or students being allowed to retake the exam until they pass. In the former case, those students who barely fail but are able to secure a "mercy pass" may differ from those who just barely fail but cannot secure a "mercy pass". This leads to selection bias, as the treatment and control groups now differ. In the latter case, some students may decide to retake the exam, stopping once they pass. This also leads to selection bias since only some students will decide to retake the exam. + +=== Testing the validity of the assumptions === +It is impossible to definitively test for validity if agents are able to determine their treatment status perfectly. However, some tests can provide evidence that either supports or discounts the validity of the regression discontinuity design. + +==== Density test ==== + +McCrary (2008) suggested examining the density of observations of the assignment variable. Suppose there is a discontinuity in the density of the assignment variable at the threshold for treatment. In this case, this may suggest that some agents were able to manipulate their treatment status perfectly. +For example, if several students are able to get a "mercy pass", then there will be more students who just barely passed the exam than who just barely failed. Similarly, if students are allowed to retake the exam until they pass, then there will be a similar result. In both cases, this will likely show up when the density of exam grades is examined. "Gaming the system" in this manner could bias the treatment effect estimate. + +==== Continuity of observable variables ==== +Since the validity of the regression discontinuity design relies on those who were just barely treated being the same as those who were just barely not treated, it makes sense to examine if these groups are similarly based on observable variables. For the earlier example, one could test if those who just barely passed have different characteristics (demographics, family income, etc.) than those who just barely failed. Although some variables may differ for the two groups based on random chance, most of these variables should be the same. + +==== Falsification tests ==== + +===== Predetermined variables ===== +Similar to the continuity of observable variables, one would expect there to be continuity in predetermined variables at the treatment cutoff. Since these variables were determined before the treatment decision, treatment status should not affect them. Consider the earlier merit-based scholarship example. If the outcome of interest is future grades, then we would not expect the scholarship to affect previous grades. If a discontinuity in predetermined variables is present at the treatment cutoff, then this puts the validity of the regression discontinuity design into question. + +===== Other discontinuities ===== +If discontinuities are present at other points of the assignment variable, where these are not expected, then this may make the regression discontinuity design suspect. Consider the example of Carpenter and Dobkin (2011) who studied the effect of legal access to alcohol in the United States. As the access to alcohol increases at age 21, this leads to changes in various outcomes, such as mortality rates and morbidity rates. If mortality and morbidity rates also increase discontinuously at other ages, then it throws the interpretation of the discontinuity at age 21 into question. + +==== Inclusion and exclusion of covariates ==== +If parameter estimates are sensitive to removing or adding covariates to the model, then this may cast doubt on the validity of the regression discontinuity design. A significant change may suggest that those who just barely got treatment to differ in these covariates from those who just barely did not get treatment. Including covariates would remove some of this bias. If a large amount of bias is present, and the covariates explain a significant amount of this, then their inclusion or exclusion would significantly change the parameter estimate. +Recent work has shown how to add covariates, under what conditions doing so is valid, and the potential for increased precision. + +== Advantages == +When properly implemented and analysed, the RDD yields an unbiased estimate of the local treatment effect. The RDD can be almost as good as a randomised experiment in measuring a treatment effect. +RDD, as a quasi-experiment, does not require ex-ante randomisation and circumvents ethical issues of random assignment. +Well-executed RDD studies can generate treatment effect estimates similar to estimates from randomised studies. \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Regression_discontinuity_design-2.md b/data/en.wikipedia.org/wiki/Regression_discontinuity_design-2.md new file mode 100644 index 000000000..eeab15b13 --- /dev/null +++ b/data/en.wikipedia.org/wiki/Regression_discontinuity_design-2.md @@ -0,0 +1,42 @@ +--- +title: "Regression discontinuity design" +chunk: 3/3 +source: "https://en.wikipedia.org/wiki/Regression_discontinuity_design" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:51:41.067762+00:00" +instance: "kb-cron" +--- + +== Disadvantages == +The estimated effects are only unbiased if the functional form of the relationship between the treatment and outcome is correctly modelled. The most popular caveats are non-linear relationships that are mistaken as a discontinuity. +Contamination by other treatments. Suppose another treatment occurs at the same cutoff value of the same assignment variable. In that case, the measured discontinuity in the outcome variable may be partially attributed to this other treatment. For example, suppose a researcher wishes to study the impact of legal access to alcohol on mental health using a regression discontinuity design at the minimum legal drinking age. The measured impact could be confused with legal access to gambling, which may occur at the same age. + +== Extensions == + +=== Fuzzy RDD === +The identification of causal effects hinges on the crucial assumption that there is indeed a sharp cut-off, around which there is a discontinuity in the probability of assignment from 0 to 1. In reality, however, cutoffs are often not strictly implemented (e.g. exercised discretion for students who just fell short of passing the threshold) and the estimates will hence be biased. +In contrast to the sharp regression discontinuity design, a fuzzy regression discontinuity design (FRDD) does not require a sharp discontinuity in the probability of assignment. Still, it is applicable as long as the probability of assignment is different. The intuition behind it is related to the instrumental variable strategy and intention to treat. Fuzzy RDD does not provide an unbiased estimate when the quantity of interest is the proportional effect (e.g. vaccine effectiveness), but extensions exist that do. + +=== Regression kink design === +When the assignment variable is continuous (e.g. student aid) and depends predictably on another observed variable (e.g. family income), one can identify treatment effects using sharp changes in the slope of the treatment function. This technique was coined regression kink design by Nielsen, Sørensen, and Taber (2010), though they cite similar earlier analyses. They write, "This approach resembles the regression discontinuity idea. Instead of a discontinuity of in the level of the stipend-income function, we have a discontinuity in the slope of the function." Rigorous theoretical foundations were provided by Card et al. (2012) and an empirical application by Bockerman et al. (2018). +Note that regression kinks (or kinked regression) can also mean a type of segmented regression, which is a different type of analysis. +Final considerations +The RD design takes the shape of a quasi-experimental research design with a clear structure that is devoid of randomized experimental features. Several aspects deny the RD designs an allowance for a status quo. For instance, the designs often involve serious issues that do not offer room for random experiments. Besides, the design of the experiments depends on the accuracy of the modelling process and the relationship between inputs and outputs. + +== See also == +Quasi-experiment +Design of quasi-experiments + +== References == + +== Further reading == +Angrist, J. D.; Pischke, J.-S. (2008). "Getting a Little Jumpy: Regression Discontinuity Designs". Mostly Harmless Econometrics: An Empiricist's Companion. Princeton University Press. pp. 251–268. ISBN 978-0-691-12035-5. +Cattaneo, Matias D.; Titiunik, Rocio (2022). "Regression Discontinuity Designs". Annual Review of Economics. 14: 821–851. arXiv:2108.09400. doi:10.1146/annurev-economics-051520-021409. S2CID 125763727. +Cattaneo, Matias D.; Idrobo, Nicolas; Titiunik, Rocío (2024). A Practical Introduction to Regression Discontinuity Designs: Extensions. Cambridge University Press. +Cook, Thomas D. (2008). "'Waiting for Life to Arrive': A history of the regression-discontinuity design in Psychology, Statistics and Economics". Journal of Econometrics. 142 (2): 636–654. doi:10.1016/j.jeconom.2007.05.002. +Imbens, Guido W.; Wooldridge, Jeffrey M. (2009). "Recent Developments in the Econometrics of Program Evaluation". Journal of Economic Literature. 47 (1): 5–86. doi:10.1257/jel.47.1.5. +Maas, Iris L.; Nolte, Sandra; Walter, Otto B.; Berger, Thomas; Hautzinger, Martin (2017). "The regression discontinuity design showed to be a valid alternative to a randomized controlled trial for estimating treatment effects". Journal of Clinical Epidemiology. 82: 94–102. doi:10.1016/j.jclinepi.2016.11.008. PMID 27865902. + +== External links == +Regression-Discontinuity Analysis at Research Methods Knowledge Base \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Repeated_measures_design-0.md b/data/en.wikipedia.org/wiki/Repeated_measures_design-0.md new file mode 100644 index 000000000..c2139e99b --- /dev/null +++ b/data/en.wikipedia.org/wiki/Repeated_measures_design-0.md @@ -0,0 +1,56 @@ +--- +title: "Repeated measures design" +chunk: 1/2 +source: "https://en.wikipedia.org/wiki/Repeated_measures_design" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:51:42.307330+00:00" +instance: "kb-cron" +--- + +Repeated measures design is a research design that involves multiple measures of the same variable taken on the same or matched subjects either under different conditions or over two or more time periods. For instance, repeated measurements are collected in a longitudinal study in which change over time is assessed. + +== Crossover studies == + +A popular repeated-measures design is the crossover study. A crossover study is a longitudinal study in which subjects receive a sequence of different treatments (or exposures). While crossover studies can be observational studies, many important crossover studies are controlled experiments. Crossover designs are common for experiments in many scientific disciplines, for example psychology, education, pharmaceutical science, and health care, especially medicine. +Randomized, controlled, crossover experiments are especially important in health care. In a randomized clinical trial, the subjects are randomly assigned treatments. When such a trial is a repeated measures design, the subjects are randomly assigned to a sequence of treatments. A crossover clinical trial is a repeated-measures design in which each patient is randomly assigned to a sequence of treatments, including at least two treatments (of which one may be a standard treatment or a placebo): Thus each patient crosses over from one treatment to another. +Nearly all crossover designs have "balance", which means that all subjects should receive the same number of treatments and that all subjects participate for the same number of periods. In most crossover trials, each subject receives all treatments. +However, many repeated-measures designs are not crossovers: the longitudinal study of the sequential effects of repeated treatments need not use any "crossover", for example (Vonesh & Chinchilli; Jones & Kenward). + +== Uses == +Limited number of participants—The repeated measure design reduces the variance of estimates of treatment-effects, allowing statistical inference to be made with fewer subjects. +Efficiency—Repeated measure designs allow many experiments to be completed more quickly, as fewer groups need to be trained to complete an entire experiment. For example, experiments in which each condition takes only a few minutes, whereas the training to complete the tasks take as much, if not more time. +Longitudinal analysis—Repeated measure designs allow researchers to monitor how participants change over time, both long- and short-term situations. + +== Order effects == +Order effects may occur when a participant in an experiment is able to perform a task and then perform it again. Examples of order effects include performance improvement or decline in performance, which may be due to learning effects, boredom or fatigue. The impact of order effects may be smaller in long-term longitudinal studies or by counterbalancing using a crossover design. + +== Counterbalancing == +In this technique, two groups each perform the same tasks or experience the same conditions, but in reverse order. With two tasks or conditions, four groups are formed. + +Counterbalancing attempts to take account of two important sources of systematic variation in this type of design: practice and boredom effects. Both might otherwise lead to different performance of participants due to familiarity with or tiredness to the treatments. + +== Limitations == +It may not be possible for each participant to be in all conditions of the experiment (i.e. time constraints, location of experiment, etc.). Severely diseased subjects tend to drop out of longitudinal studies, potentially biasing the results. In these cases mixed effects models would be preferable as they can deal with missing values. +Mean regression may affect conditions with significant repetitions. Maturation may affect studies that extend over time. Events outside the experiment may change the response between repetitions. + +== Repeated measures ANOVA == + +Repeated measures analysis of variance (rANOVA) is a commonly used statistical approach to repeated measure designs. With such designs, the repeated-measure factor (the qualitative independent variable) is the within-subjects factor, while the dependent quantitative variable on which each participant is measured is the dependent variable. + +=== Partitioning of error === +One of the greatest advantages to rANOVA, as is the case with repeated measures designs in general, is the ability to partition out variability due to individual differences. Consider the general structure of the F-statistic: + +F = MSTreatment / MSError = (SSTreatment/dfTreatment)/(SSError/dfError) +In a between-subjects design there is an element of variance due to individual difference that is combined with the treatment and error terms: + +SSTotal = SSTreatment + SSError +dfTotal = n − 1 +In a repeated measures design it is possible to partition subject variability from the treatment and error terms. In such a case, variability can be broken down into between-treatments variability (or within-subjects effects, excluding individual differences) and within-treatments variability. The within-treatments variability can be further partitioned into between-subjects variability (individual differences) and error (excluding the individual differences): + +SSTotal = SSTreatment (excluding individual difference) + SSSubjects + SSError +dfTotal = dfTreatment (within subjects) + dfbetween subjects + dferror = (k − 1) + (s − 1) + ((k - 1)(s − 1)) = ks -1= n-1, where k is the number of time levels and s is the number of subjects. +In reference to the general structure of the F-statistic, it is clear that by partitioning out the between-subjects variability, the F-value will increase because the sum of squares error term will be smaller resulting in a smaller MSError. It is noteworthy that partitioning variability reduces degrees of freedom from the F-test, therefore the between-subjects variability must be significant enough to offset the loss in degrees of freedom. If between-subjects variability is small this process may actually reduce the F-value. + +=== Assumptions === +As with all statistical analyses, specific assumptions should be met to justify the use of this test. Violations can moderately to severely affect results and often lead to an inflation of type 1 error. With the rANOVA, standard univariate and multivariate assumptions apply. The univariate assumptions are: \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Repeated_measures_design-1.md b/data/en.wikipedia.org/wiki/Repeated_measures_design-1.md new file mode 100644 index 000000000..a0198cddd --- /dev/null +++ b/data/en.wikipedia.org/wiki/Repeated_measures_design-1.md @@ -0,0 +1,58 @@ +--- +title: "Repeated measures design" +chunk: 2/2 +source: "https://en.wikipedia.org/wiki/Repeated_measures_design" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:51:42.307330+00:00" +instance: "kb-cron" +--- + +Normality—For each level of the within-subjects factor, the dependent variable must have a normal distribution. +Sphericity—Difference scores computed between two levels of a within-subjects factor must have the same variance for the comparison of any two levels. (This assumption only applies if there are more than 2 levels of the independent variable.) +Randomness—Cases should be derived from a random sample, and scores from different participants should be independent of each other. +The rANOVA also requires that certain multivariate assumptions be met, because a multivariate test is conducted on difference scores. These assumptions include: + +Multivariate normality—The difference scores are multivariately normally distributed in the population. +Randomness—Individual cases should be derived from a random sample, and the difference scores for each participant are independent from those of another participant. + +=== F test === +As with other analysis of variance tests, the rANOVA makes use of an F statistic to determine significance. Depending on the number of within-subjects factors and assumption violations, it is necessary to select the most appropriate of three tests: + +Standard Univariate ANOVA F test—This test is commonly used given only two levels of the within-subjects factor (i.e. time point 1 and time point 2). This test is not recommended given more than 2 levels of the within-subjects factor because the assumption of sphericity is commonly violated in such cases. +Alternative Univariate test—These tests account for violations to the assumption of sphericity, and can be used when the within-subjects factor exceeds 2 levels. The F statistic is the same as in the Standard Univariate ANOVA F test, but is associated with a more accurate p-value. This correction is done by adjusting the degrees of freedom downward for determining the critical F value. Two corrections are commonly used: the Greenhouse–Geisser correction and the Huynh–Feldt correction. The Greenhouse–Geisser correction is more conservative, but addresses a common issue of increasing variability over time in a repeated-measures design. The Huynh–Feldt correction is less conservative, but does not address issues of increasing variability. It has been suggested that lower Huynh–Feldt be used with smaller departures from sphericity, while Greenhouse–Geisser be used when the departures are large. +Multivariate Test—This test does not assume sphericity, but is also highly conservative. + +=== Effect size === +One of the most commonly reported effect size statistics for rANOVA is partial eta-squared (ηp2). It is also common to use the multivariate η2 when the assumption of sphericity has been violated, and the multivariate test statistic is reported. A third effect size statistic that is reported is the generalized η2, which is comparable to ηp2 in a one-way repeated measures ANOVA. It has been shown to be a better estimate of effect size with other within-subjects tests. + +=== Cautions === +rANOVA is not always the best statistical analysis for repeated measure designs. The rANOVA is vulnerable to effects from missing values, imputation, unequivalent time points between subjects and violations of sphericity. These issues can result in sampling bias and inflated rates of Type I error. In such cases it may be better to consider use of a linear mixed model. + +== See also == + +== Notes == + +== References == + +=== Design and analysis of experiments === +Jones, Byron; Kenward, Michael G. (2003). Design and Analysis of Cross-Over Trials (Second ed.). London: Chapman and Hall.{{cite book}}: CS1 maint: publisher location (link) +Vonesh, Edward F. & Chinchilli, Vernon G. (1997). Linear and Nonlinear Models for the Analysis of Repeated Measurements. London: Chapman and Hall.{{cite book}}: CS1 maint: publisher location (link) + +=== Exploration of longitudinal data === +Davidian, Marie; David M. Giltinan (1995). Nonlinear Models for Repeated Measurement Data. Chapman & Hall/CRC Monographs on Statistics & Applied Probability. ISBN 978-0-412-98341-2. +Fitzmaurice, Garrett; Davidian, Marie; Verbeke, Geert; Molenberghs, Geert, eds. (2008). Longitudinal Data Analysis. Boca Raton, Florida: Chapman and Hall/CRC. ISBN 978-1-58488-658-7. +Jones, Byron; Kenward, Michael G. (2003). Design and Analysis of Cross-Over Trials (Second ed.). London: Chapman and Hall.{{cite book}}: CS1 maint: publisher location (link) +Kim, Kevin & Timm, Neil (2007). ""Restricted MGLM and growth curve model" (Chapter 7)". Univariate and multivariate general linear models: Theory and applications with SAS (with 1 CD-ROM for Windows and UNIX). Statistics: Textbooks and Monographs (Second ed.). Boca Raton, Florida: Chapman & Hall/CRC. ISBN 978-1-58488-634-1. +Kollo, Tõnu & von Rosen, Dietrich (2005). ""Multivariate linear models" (chapter 4), especially "The Growth curve model and extensions" (Chapter 4.1)". Advanced multivariate statistics with matrices. Mathematics and its applications. Vol. 579. New York: Springer. ISBN 978-1-4020-3418-3. +Kshirsagar, Anant M. & Smith, William Boyce (1995). Growth curves. Statistics: Textbooks and Monographs. Vol. 145. New York: Marcel Dekker, Inc. ISBN 0-8247-9341-2. +Pan, Jian-Xin & Fang, Kai-Tai (2002). Growth curve models and statistical diagnostics. Springer Series in Statistics. New York: Springer-Verlag. ISBN 0-387-95053-2. +Seber, G. A. F. & Wild, C. J. (1989). ""Growth models (Chapter 7)"". Nonlinear regression. Wiley Series in Probability and Mathematical Statistics: Probability and Mathematical Statistics. New York: John Wiley & Sons, Inc. pp. 325–367. ISBN 0-471-61760-1. +Timm, Neil H. (2002). ""The general MANOVA model (GMANOVA)" (Chapter 3.6.d)". Applied multivariate analysis. Springer Texts in Statistics. New York: Springer-Verlag. ISBN 0-387-95347-7. +Vonesh, Edward F. & Chinchilli, Vernon G. (1997). Linear and Nonlinear Models for the Analysis of Repeated Measurements. London: Chapman and Hall.{{cite book}}: CS1 maint: publisher location (link) (Comprehensive treatment of theory and practice) +Conaway, M. (1999, October 11). Repeated Measures Design. Retrieved February 18, 2008, from http://biostat.mc.vanderbilt.edu/twiki/pub/Main/ClinStat/repmeas.PDF +Minke, A. (1997, January). Conducting Repeated Measures Analyses: Experimental Design Considerations. Retrieved February 18, 2008, from Ericae.net: http://ericae.net/ft/tamu/Rm.htm +Shaughnessy, J. J. (2006). Research Methods in Psychology. New York: McGraw-Hill. + +== External links == +Examples of all ANOVA and ANCOVA models with up to three treatment factors, including randomized block, split plot, repeated measures, and Latin squares, and their analysis in R (University of Southampton) \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Replication_(statistics)-0.md b/data/en.wikipedia.org/wiki/Replication_(statistics)-0.md new file mode 100644 index 000000000..6f278098d --- /dev/null +++ b/data/en.wikipedia.org/wiki/Replication_(statistics)-0.md @@ -0,0 +1,74 @@ +--- +title: "Replication (statistics)" +chunk: 1/1 +source: "https://en.wikipedia.org/wiki/Replication_(statistics)" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:51:43.532403+00:00" +instance: "kb-cron" +--- + +In engineering, science, and statistics, replication is the process of repeating a study or experiment under the same or similar conditions. It is a crucial step to test the original claim and confirm or reject the accuracy of results as well as for identifying and correcting the flaws in the original experiment. ASTM, in standard E1847, defines replication as "... the repetition of the set of all the treatment combinations to be compared in an experiment. Each of the repetitions is called a replicate." +For a full factorial design, replicates are multiple experimental runs with the same factor levels. You can replicate combinations of factor levels, groups of factor level combinations, or even entire designs. For instance, consider a scenario with three factors, each having two levels, and an experiment that tests every possible combination of these levels (a full factorial design). One complete replication of this design would comprise 8 runs ( + + + + + 2 + + 3 + + + + + {\displaystyle 2^{3}} + +). The design can be executed once or with several replicates. + +There are two main types of replication in statistics. First, there is a type called “exact replication” (also called "direct replication"), which involves repeating the study as closely as possible to the original to see whether the original results can be precisely reproduced. For instance, repeating a study on the effect of a specific diet on weight loss using the same diet plan and measurement methods. The second type of replication is called “conceptual replication.” This involves testing the same theory as the original study but with different conditions. For example, Testing the same diet's effect on blood sugar levels instead of weight loss, using different measurement methods. +Both exact (direct) replications and conceptual replications are important. Direct replications help confirm the accuracy of the findings within the conditions that were initially tested. On the hand conceptual replications examine the validity of the theory behind those findings and explore different conditions under which those findings remain true. In essence conceptual replication provides insights, into how generalizable the findings are. + + +== The difference between replicates and repeats == +Replication is not the same as repeated measurements of the same item. Both repeat and replicate measurements involve multiple observations taken at the same levels of experimental factors. However, repeat measurements are collected during a single experimental session, while replicate measurements are gathered across different experimental sessions. Replication in statistics evaluates the consistency of experiment results across different trials to ensure external validity, while repetition measures precision and internal consistency within the same or similar experiments. +Replicates Example: Testing a new drug's effect on blood pressure in separate groups on different days. +Repeats Example: Measuring blood pressure multiple times in one group during a single session. + + +== Statistical methods in replication == +In replication studies within the field of statistics, several key methods and concepts are employed to assess the reliability of research findings. Here are some of the main statistical methods and concepts used in replication: +P-Values: The p-value is a measure of the probability that the observed data would occur by chance if the null hypothesis were true. In replication studies p-values help us determine whether the findings can be consistently replicated. A low p-value in a replication study indicates that the results are not likely due to random chance. For example, if a study found a statistically significant effect of a test condition on an outcome, and the replication find statistically significant effects as well, this suggests that the original finding is likely reproducible. +Confidence Intervals: Confidence intervals provide a range of values within which the true effect size is likely to fall. In replication studies, comparing the confidence intervals of the original study and the replication can indicate whether the results are consistent. For example, if the original study reports a treatment effect with a 95% confidence interval of [5, 10], and the replication study finds a similar effect with a confidence interval of [6, 11], this overlap indicates consistent findings across both studies. + + +== Example == +As an example, consider a continuous process which produces items. Batches of items are then processed or treated. Finally, tests or measurements are conducted. Several options might be available to obtain ten test values. Some possibilities are: + +One finished and treated item might be measured repeatedly to obtain ten test results. Only one item was measured so there is no replication. The repeated measurements help identify observational error. +Ten finished and treated items might be taken from a batch and each measured once. This is not full replication because the ten samples are not random and not representative of the continuous nor batch processing. +Five items are taken from the continuous process based on sound statistical sampling. These are processed in a batch and tested twice each. This includes replication of initial samples but does not allow for batch-to-batch variation in processing. The repeated tests on each provide some measure and control of testing error. +Five items are taken from the continuous process based on sound statistical sampling. These are processed in five different batches and tested twice each. This plan includes proper replication of initial samples and also includes batch-to-batch variation. The repeated tests on each provide some measure and control of testing error. +For proper sampling, a process or batch of products should be in reasonable statistical control; inherent random variation is present but variation due to assignable (special) causes is not. Evaluation or testing of a single item does not allow for item-to-item variation and may not represent the batch or process. Replication is needed to account for this variation among items and treatments. +Each option would call for different data analysis methods and yield different conclusions. + + +== See also == + +Degrees of freedom (statistics) +Design of experiments +Pseudoreplication +Replication crisis +Sample size +Statistical ensemble +Statistical process control +Test method + + +== References == + + +== Bibliography == +ASTM E122-07 Standard Practice for Calculating Sample Size to Estimate, With Specified Precision, the Average for a Characteristic of a Lot or Process +"Engineering Statistics Handbook", NIST/SEMATEK +Pyzdek, T, "Quality Engineering Handbook", 2003, ISBN 0-8247-4614-7. +Godfrey, A. B., "Juran's Quality Handbook", 1999, ISBN 9780070340039. \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Research_Integrity_and_Peer_Review-0.md b/data/en.wikipedia.org/wiki/Research_Integrity_and_Peer_Review-0.md new file mode 100644 index 000000000..0beb65439 --- /dev/null +++ b/data/en.wikipedia.org/wiki/Research_Integrity_and_Peer_Review-0.md @@ -0,0 +1,18 @@ +--- +title: "Research Integrity and Peer Review" +chunk: 1/1 +source: "https://en.wikipedia.org/wiki/Research_Integrity_and_Peer_Review" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:53:15.749896+00:00" +instance: "kb-cron" +--- + +Research Integrity and Peer Review is an international, open access, peer reviewed journal that was launched in 2016. It is published by BioMed Central and focuses on problems in peer review, replication, and the scientific process. + + +== See also == +Metascience + + +== References == \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Resentful_demoralization-0.md b/data/en.wikipedia.org/wiki/Resentful_demoralization-0.md new file mode 100644 index 000000000..a03e0499f --- /dev/null +++ b/data/en.wikipedia.org/wiki/Resentful_demoralization-0.md @@ -0,0 +1,22 @@ +--- +title: "Resentful demoralization" +chunk: 1/1 +source: "https://en.wikipedia.org/wiki/Resentful_demoralization" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:51:44.708773+00:00" +instance: "kb-cron" +--- + +Resentful demoralization is an issue in controlled experiments in which those in the control group become resentful of not receiving the experimental treatment. Alternatively, the experimental group could be resentful of the control group, if the experimental group perceive its treatment as inferior. +They may become angry, depressed, uncooperative, or non-compliant. This may lead to significant systematic differences in the outcome of the control group, obscuring the results of the study and threatening their validity. +Resentful demoralization can also make participants less likely to join future studies. If people feel treated unfairly or get upset with their treatment, they might not want to be part of research again. This can lead to fewer participants and affect the quality of future studies. So, it's important to address these feelings to keep participants engaged and ensure good research outcomes. + + +== See also == +Clinical trial +Randomized controlled trial + + +== External links == +Resentful Demoralization in Encyclopedia of Statistics in Behavioral Science. (subscription required) \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Response_surface_methodology-0.md b/data/en.wikipedia.org/wiki/Response_surface_methodology-0.md new file mode 100644 index 000000000..07cd904d1 --- /dev/null +++ b/data/en.wikipedia.org/wiki/Response_surface_methodology-0.md @@ -0,0 +1,97 @@ +--- +title: "Response surface methodology" +chunk: 1/1 +source: "https://en.wikipedia.org/wiki/Response_surface_methodology" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:51:45.870436+00:00" +instance: "kb-cron" +--- + +In statistics, response surface methodology (RSM) explores the relationships between several explanatory variables and one or more response variables. RSM is an empirical model which employs the use of mathematical and statistical techniques to relate input variables, otherwise known as factors, to the response. RSM became very useful because other methods available, such as the theoretical model, could be very cumbersome to use, time-consuming, inefficient, error-prone, and unreliable. The method was introduced by George E. P. Box and K. B. Wilson in 1951. The main idea of RSM is to use a sequence of designed experiments to obtain an optimal response. Box and Wilson suggest using a second-degree polynomial model to do this. They acknowledge that this model is only an approximation, but they use it because such a model is easy to estimate and apply, even when little is known about the process. +Statistical approaches such as RSM can be employed to maximize the production of a special substance by optimization of operational factors. Of late, for formulation optimization, the RSM, using proper design of experiments (DoE), has become extensively used. In contrast to conventional methods, the interaction among process variables can be determined by statistical techniques. + + +== Basic approach of response surface methodology == +An easy way to estimate a first-degree polynomial model is to use a factorial experiment or a fractional factorial design. This is sufficient to determine which explanatory variables affect the response variable(s) of interest. Once it is suspected that only significant explanatory variables are left, then a more complicated design, such as a central composite design can be implemented to estimate a second-degree polynomial model, which is still only an approximation at best. However, the second-degree model can be used to optimize (maximize, minimize, or attain a specific target for) the response variable(s) of interest. + + +== Important RSM properties and features == +Orthogonality +The property that allows individual effects of the k-factors to be estimated independently without (or with minimal) confounding. Also orthogonality provides minimum variance estimates of the model coefficient so that they are uncorrelated. +Rotatability +The property of rotating points of the design about the center of the factor space. The moments of the distribution of the design points are constant. +Uniformity +A third property of CCD designs used to control the number of center points is uniform precision (or Uniformity). + + +== Special geometries == + + +=== Cube === +Cubic designs are discussed by Kiefer, by Atkinson, Donev, and Tobias and by Hardin and Sloane. + + +=== Sphere === +Spherical designs are discussed by Kiefer and by Hardin and Sloane. + + +=== Simplex geometry and mixture experiments === +Mixture experiments are discussed in many books on the design of experiments, and in the response-surface methodology textbooks of Box and Draper and of Atkinson, Donev and Tobias. An extensive discussion and survey appears in the advanced textbook by John Cornell. + + +== Extensions == + + +=== Multiple objective functions === + +Some extensions of response surface methodology deal with the multiple response problem. Multiple response variables create difficulty because what is optimal for one response may not be optimal for other responses. Other extensions are used to reduce variability in a single response while targeting a specific value, or attaining a near maximum or minimum while preventing variability in that response from getting too large. + + +== Practical concerns == +Response surface methodology uses statistical models, and therefore practitioners need to be aware that even the best statistical model is an approximation to reality. In practice, both the models and the parameter values are unknown, and subject to uncertainty on top of ignorance. Of course, an estimated optimum point need not be optimum in reality, because of the errors of the estimates and of the inadequacies of the model. +Nonetheless, response surface methodology has an effective track-record of helping researchers improve products and services: For example, Box's original response-surface modeling enabled chemical engineers to improve a process that had been stuck at a saddle-point for years. The engineers had not been able to afford to fit a cubic three-level design to estimate a quadratic model, and their biased linear-models estimated the gradient to be zero. Box's design reduced the costs of experimentation so that a quadratic model could be fit, which led to a (long-sought) ascent direction. + + +== See also == +Box–Behnken design +Central composite design +Gradient-enhanced kriging (GEK) +IOSO method based on response-surface methodology +Optimal designs +Plackett–Burman design +Polynomial and rational function modeling +Polynomial regression +Probabilistic design +Surrogate model +Bayesian Optimization + + +== References == + +Box, G.E.P.; Wilson, K.B. (1951). "On the Experimental Attainment of Optimum Conditions". Journal of the Royal Statistical Society, Series B. 13 (1): 1–45. doi:10.1111/j.2517-6161.1951.tb00067.x. +Box, G. E. P. and Draper, Norman. 2007. Response Surfaces, Mixtures, and Ridge Analyses, Second Edition [of Empirical Model-Building and Response Surfaces, 1987], Wiley. +Atkinson, A.C.; Donev, A.N.; Tobias, R.D. (2007). Optimum Experimental Designs, with SAS. Oxford University Press. pp. 511+xvi. ISBN 978-0-19-929660-6. +Cornell, John (2002). Experiments with Mixtures: Designs, Models, and the Analysis of Mixture Data (third ed.). Wiley. ISBN 978-0-471-07916-3. +Goos, Peter] (2002). The Optimal Design of Blocked and Split-plot Experiments. Lecture Notes in Statistics. Vol. 164. Springer. ISBN 978-0-387-95515-5. Archived from the original on 2020-10-31. Retrieved 2009-06-01. +Kiefer, Jack Carl (1985). L. D. Brown; et al. (eds.). Jack Carl Kiefer Collected Papers III Design of Experiments. Springer-Verlag. ISBN 978-0-387-96004-3. +Pukelsheim, Friedrich (2006). Optimal Design of Experiments. SIAM. ISBN 978-0-89871-604-7. +Hardin, R.H.; Sloane, N.J.A. (1993). "A New Approach to the Construction of Optimal Designs" (PDF). Journal of Statistical Planning and Inference. 37 (3): 339–369. doi:10.1016/0378-3758(93)90112-J. +Hardin, R.H.; Sloane, N.J.A. "Computer-Generated Minimal (and Larger) Response Surface Designs: (I) The Sphere" (PDF). +Hardin, R.H.; Sloane, N.J.A. "Computer-Generated Minimal (and Larger) Response Surface Designs: (II) The Cube" (PDF). +Ghosh, S.; Rao, C. R., eds. (1996). Design and Analysis of Experiments. Handbook of Statistics. Vol. 13. North-Holland. ISBN 978-0-444-82061-7. +Draper, Norman; Lin, Dennis K.J. (1996). "11 Response surface designs". Response surface designs. Handbook of Statistics. Vol. 13. pp. 343–375. doi:10.1016/S0169-7161(96)13013-3. ISBN 9780444820617. +Gaffke, Norbert; Heiligers, Berthold (1996). "30 Approximate designs for polynomial regression: Invariance, admissibility, and optimality". Approximate designs for polynomial regression: Invariance, admissibility, and optimality. Handbook of Statistics. Vol. 13. pp. 1149–99. doi:10.1016/S0169-7161(96)13032-7. ISBN 9780444820617. + + +=== Historical === +Gergonne, J. D. (1974) [1815]. "The application of the method of least squares to the interpolation of sequences". Historia Mathematica. 1 (4) (Translated by Ralph St. John and S. M. Stigler from the 1815 French ed.): 439–447. doi:10.1016/0315-0860(74)90034-2. +Stigler, Stephen M. (1974). "Gergonne's 1815 paper on the design and analysis of polynomial regression experiments". Historia Mathematica. 1 (4): 431–9. doi:10.1016/0315-0860(74)90033-0. +Peirce, C. S. (1876). "Note on the Theory of the Economy of Research" (PDF). Coast Survey Report. Appendix No. 14: 197–201. +Reprinted in Collected Papers of Charles Sanders Peirce. Vol. 7. 1958. paragraphs 139–157, +and in Peirce, C. S. (July–August 1967). "Note on the Theory of the Economy of Research". Operations Research. 15 (4): 643–8. doi:10.1287/opre.15.4.643. JSTOR 168276. +Smith, Kirstine (1918). "On the Standard Deviations of Adjusted and Interpolated Values of an Observed Polynomial Function and its Constants and the Guidance They Give Towards a Proper Choice of the Distribution of the Observations". Biometrika. 12 (1/2): 1–85. doi:10.2307/2331929. JSTOR 2331929. + + +== External links == +Response surface designs \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Restricted_randomization-0.md b/data/en.wikipedia.org/wiki/Restricted_randomization-0.md new file mode 100644 index 000000000..35181e445 --- /dev/null +++ b/data/en.wikipedia.org/wiki/Restricted_randomization-0.md @@ -0,0 +1,34 @@ +--- +title: "Restricted randomization" +chunk: 1/4 +source: "https://en.wikipedia.org/wiki/Restricted_randomization" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:51:47.072793+00:00" +instance: "kb-cron" +--- + +In statistics, restricted randomization occurs in the design of experiments and in particular in the context of randomized experiments and randomized controlled trials. Restricted randomization allows intuitively poor allocations of treatments to experimental units to be avoided, while retaining the theoretical benefits of randomization. For example, in a clinical trial of a new proposed treatment of obesity compared to a control, an experimenter would want to avoid outcomes of the randomization in which the new treatment was allocated only to the heaviest patients. +The concept was introduced by Frank Yates (1948) and William J. Youden (1972) "as a way of avoiding bad spatial patterns of treatments in designed experiments." + +== Example of nested data == +Consider a batch process that uses 7 monitor wafers in each run. The plan further calls for measuring a response variable on each wafer at each of 9 sites. The organization of the sampling plan has a hierarchical or nested structure: the batch run is the topmost level, the second level is an individual wafer, and the third level is the site on the wafer. +The total amount of data generated per batch run will be 7 · 9 = 63 observations. One approach to analyzing these data would be to compute the mean of all these points as well as their standard deviation and use those results as responses for each run. +Analyzing the data as suggested above is not absolutely incorrect, but doing so loses information that one might otherwise obtain. For example, site 1 on wafer 1 is physically different from site 1 on wafer 2 or on any other wafer. The same is true for any of the sites on any of the wafers. Similarly, wafer 1 in run 1 is physically different from wafer 1 in run 2, and so on. To describe this situation one says that sites are nested within wafers while wafers are nested within runs. +As a consequence of this nesting, there are restrictions on the randomization that can occur in the experiment. This kind of restricted randomization always produces nested sources of variation. Examples of nested variation or restricted randomization discussed on this page are split-plot and strip-plot designs. +The objective of an experiment with this type of sampling plan is generally to reduce the variability due to sites on the wafers and wafers within runs (or batches) in the process. The sites on the wafers and the wafers within a batch become sources of unwanted variation and an investigator seeks to make the system robust to those sources—in other words, one could treat wafers and sites as noise factors in such an experiment. +Because the wafers and the sites represent unwanted sources of variation and because one of the objectives is to reduce the process sensitivity to these sources of variation, treating wafers and sites as random effects in the analysis of the data is a reasonable approach. In other words, nested variation is often another way of saying nested random effects or nested sources of noise. If the factors "wafers" and "sites" are treated as random effects, then it is possible to estimate a variance component due to each source of variation through analysis of variance techniques. Once estimates of the variance components have been obtained, an investigator is then able to determine the largest source of variation in the process under experimentation, and also determine the magnitudes of the other sources of variation in relation to the largest source. + +== Nested random effects == +If an experiment or process has nested variation, the experiment or process has multiple sources of random error that affect its output. Having nested random effects in a model is the same thing as having nested variation in a model. + +== Split-plot designs == +Split-plot designs result when a particular type of restricted randomization has occurred during the experiment. A simple factorial experiment can result in a split-plot type of design because of the way the experiment was actually executed. +In many industrial experiments, three situations often occur: + +some of the factors of interest may be 'hard to vary' while the remaining factors are easy to vary. As a result, the order in which the treatment combinations for the experiment are run is determined by the ordering of these 'hard-to-vary' factors +experimental units are processed together as a batch for one or more of the factors in a particular treatment combination +experimental units are processed individually, one right after the other, for the same treatment combination without resetting the factor settings for that treatment combination. + +=== Split-plot experimental examples === +An experiment run under one of the above three situations usually results in a split-plot type of design. Consider an experiment to examine electroplating of aluminum (non-aqueous) on copper strips. The three factors of interest are: current (A); solution temperature (T); and the solution concentration of the plating agent (S). Plating rate is the measured response. There are a total of 16 copper strips available for the experiment. The treatment combinations to be run (orthogonally scaled) are listed below in standard order (i.e., they have not been randomized): \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Restricted_randomization-1.md b/data/en.wikipedia.org/wiki/Restricted_randomization-1.md new file mode 100644 index 000000000..bfcdd6361 --- /dev/null +++ b/data/en.wikipedia.org/wiki/Restricted_randomization-1.md @@ -0,0 +1,31 @@ +--- +title: "Restricted randomization" +chunk: 2/4 +source: "https://en.wikipedia.org/wiki/Restricted_randomization" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:51:47.072793+00:00" +instance: "kb-cron" +--- + +==== Example: some factors hard to vary ==== +Consider running the experiment under the first condition listed above, with the factor solution concentration of the plating agent (S) being hard to vary. Since this factor is hard to vary, the experimenter would like to randomize the treatment combinations so that the solution concentration factor has a minimal number of changes. In other words, the randomization of the treatment runs is restricted somewhat by the level of the solution concentration factor. +As a result, the treatment combinations might be randomized such that those treatment runs corresponding to one level of the concentration (−1) are run first. Each copper strip is individually plated, meaning only one strip at a time is placed in the solution for a given treatment combination. Once the four runs at the low level of solution concentration have been completed, the solution is changed to the high level of concentration (1), and the remaining four runs of the experiment are performed (where again, each strip is individually plated). +Once one complete replicate of the experiment has been completed, a second replicate is performed with a set of four copper strips processed for a given level of solution concentration before changing the concentration and processing the remaining four strips. Note that the levels for the remaining two factors can still be randomized. In addition, the level of concentration that is run first in the replication runs can also be randomized. +Running the experiment in this way results in a split-plot design. Solution concentration is known as the whole plot factor and the subplot factors are the current and the solution temperature. +A split-plot design has more than one size experimental unit. In this experiment, one size experimental unit is an individual copper strip. The treatments or factors that were applied to the individual strips are solution temperature and current (these factors were changed each time a new strip was placed in the solution). The other or larger size experimental unit is a set of four copper strips. The treatment or factor that was applied to a set of four strips is solution concentration (this factor was changed after four strips were processed). The smaller size experimental unit is referred to as the subplot experimental unit, while the larger experimental unit is referred to as the whole plot unit. +There are 16 subplot experimental units for this experiment. Solution temperature and current are the subplot factors in this experiment. There are four whole-plot experimental units in this experiment. Solution concentration is the whole-plot factor in this experiment. Since there are two sizes of experimental units, there are two error terms in the model, one that corresponds to the whole-plot error or whole-plot experimental unit and one that corresponds to the subplot error or subplot experimental unit. +The ANOVA table for this experiment would look, in part, as follows: + +The first three sources are from the whole-plot level, while the next 12 are from the subplot portion. A normal probability plot of the 12 subplot term estimates could be used to look for statistically significant terms. + +==== Example: batch process ==== +Consider running the experiment under the second condition listed above (i.e., a batch process) for which four copper strips are placed in the solution at one time. A specified level of current can be applied to an individual strip within the solution. The same 16 treatment combinations (a replicated 23 factorial) are run as were run under the first scenario. However, the way in which the experiment is performed would be different. There are four treatment combinations of solution temperature and solution concentration: (−1, −1), (−1, 1), (1, −1), (1, 1). The experimenter randomly chooses one of these four treatments to set up first. Four copper strips are placed in the solution. Two of the four strips are randomly assigned to the low current level. The remaining two strips are assigned to the high current level. The plating is performed and the response is measured. A second treatment combination of temperature and concentration is chosen and the same procedure is followed. This is done for all four temperature / concentration combinations. +Running the experiment in this way also results in a split-plot design in which the whole-plot factors are now solution concentration and solution temperature, and the subplot factor is current. +In this experiment, one size experimental unit is again an individual copper strip. The treatment or factor that was applied to the individual strips is current (this factor was changed each time for a different strip within the solution). The other or larger size experimental unit is again a set of four copper strips. The treatments or factors that were applied to a set of four strips are solution concentration and solution temperature (these factors were changed after four strips were processed). +The smaller size experimental unit is again referred to as the subplot experimental unit. There are 16 subplot experimental units for this experiment. Current is the subplot factor in this experiment. +The larger-size experimental unit is the whole-plot experimental unit. There are four whole plot experimental units in this experiment and solution concentration and solution temperature are the whole plot factors in this experiment. +There are two sizes of experimental units and there are two error terms in the model: one that corresponds to the whole-plot error or whole-plot experimental unit, and one that corresponds to the subplot error or subplot experimental unit. +The ANOVA for this experiment looks, in part, as follows: + +The first three sources come from the whole-plot level and the next 5 come from the subplot level. Since there are 8 degrees of freedom for the subplot error term, this MSE can be used to test each effect that involves current. \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Restricted_randomization-2.md b/data/en.wikipedia.org/wiki/Restricted_randomization-2.md new file mode 100644 index 000000000..6c65cc6d9 --- /dev/null +++ b/data/en.wikipedia.org/wiki/Restricted_randomization-2.md @@ -0,0 +1,22 @@ +--- +title: "Restricted randomization" +chunk: 3/4 +source: "https://en.wikipedia.org/wiki/Restricted_randomization" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:51:47.072793+00:00" +instance: "kb-cron" +--- + +==== Example: experimental units processed individually ==== +Consider running the experiment under the third scenario listed above. There is only one copper strip in the solution at one time. However, two strips, one at the low current and one at the high current, are processed one right after the other under the same temperature and concentration setting. Once two strips have been processed, the concentration is changed and the temperature is reset to another combination. Two strips are again processed, one after the other, under this temperature and concentration setting. This process is continued until all 16 copper strips have been processed. +Running the experiment in this way also results in a split-plot design in which the whole-plot factors are again solution concentration and solution temperature and the subplot factor is current. In this experiment, one size experimental unit is an individual copper strip. The treatment or factor that was applied to the individual strips is current (this factor was changed each time for a different strip within the solution). The other or larger-size experimental unit is a set of two copper strips. The treatments or factors that were applied to a pair of two strips are solution concentration and solution temperature (these factors were changed after two strips were processed). The smaller size experimental unit is referred to as the subplot experimental unit. +There are 16 subplot experimental units for this experiment. Current is the subplot factor in the experiment. There are eight whole-plot experimental units in this experiment. Solution concentration and solution temperature are the whole plot factors. There are two error terms in the model, one that corresponds to the whole-plot error or whole-plot experimental unit, and one that corresponds to the subplot error or subplot experimental unit. +The ANOVA for this (third) approach is, in part, as follows: + +The first four terms come from the whole-plot analysis and the next 5 terms come from the subplot analysis. Note that we have separate error terms for both the whole plot and the subplot effects, each based on 4 degrees of freedom. +As can be seen from these three scenarios, one of the major differences in split-plot designs versus simple factorial designs is the number of different sizes of experimental units in the experiment. Split-plot designs have more than one size experimental unit, i.e., more than one error term. Since these designs involve different sizes of experimental units and different variances, the standard errors of the various mean comparisons involve one or more of the variances. Specifying the appropriate model for a split-plot design involves being able to identify each size of experimental unit. The way an experimental unit is defined relative to the design structure (for example, a completely randomized design versus a randomized complete block design) and the treatment structure (for example, a full 23 factorial, a resolution V half fraction, a two-way treatment structure with a control group, etc.). As a result of having greater than one size experimental unit, the appropriate model used to analyze split-plot designs is a mixed model. +If the data from an experiment are analyzed with only one error term used in the model, misleading and invalid conclusions can be drawn from the results. + +== Strip-plot designs == +Similar to a split-plot design, a strip-plot design can result when some type of restricted randomization has occurred during the experiment. A simple factorial design can result in a strip-plot design depending on how the experiment was conducted. Strip-plot designs often result from experiments that are conducted over two or more process steps in which each process step is a batch process, i.e., completing each treatment combination of the experiment requires more than one processing step with experimental units processed together at each process step. As in the split-plot design, strip-plot designs result when the randomization in the experiment has been restricted in some way. As a result of the restricted randomization that occurs in strip-plot designs, there are multiple sizes of experimental units. Therefore, there are different error terms or different error variances that are used to test the factors of interest in the design. A traditional strip-plot design has three sizes of experimental units. \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Restricted_randomization-3.md b/data/en.wikipedia.org/wiki/Restricted_randomization-3.md new file mode 100644 index 000000000..59124cb70 --- /dev/null +++ b/data/en.wikipedia.org/wiki/Restricted_randomization-3.md @@ -0,0 +1,38 @@ +--- +title: "Restricted randomization" +chunk: 4/4 +source: "https://en.wikipedia.org/wiki/Restricted_randomization" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:51:47.072793+00:00" +instance: "kb-cron" +--- + +=== Strip-plot example: two steps and three factor variables === +Consider the following example from the semiconductor industry. An experiment requires an implant step and an anneal step. At both the anneal and the implant steps there are three factors to test. The implant process accommodates 12 wafers in a batch, and implanting a single wafer under a specified set of conditions is not practical nor does doing so represent economical use of the implanter. The anneal furnace can handle up to 100 wafers. +The settings for a two-level factorial design for the three factors in the implant step are denoted (A, B, C), and a two-level factorial design for the three factors in the anneal step are denoted (D, E, F). Also present are interaction effects between the implant factors and the anneal factors. Therefore, this experiment contains three sizes of experimental units, each of which has a unique error term for estimating the significance of effects. +To put actual physical meaning to each of the experimental units in the above example, consider each combination of implant and anneal steps as an individual wafer. A batch of eight wafers goes through the implant step first. Treatment combination 3 in factors A, B, and C is the first implant treatment run. This implant treatment is applied to all eight wafers at once. Once the first implant treatment is finished, another set of eight wafers is implanted with treatment combination 5 of factors A, B, and C. This continues until the last batch of eight wafers is implanted with treatment combination 6 of factors A, B, and C. Once all of the eight treatment combinations of the implant factors have been run, the anneal step starts. The first anneal treatment combination to be run is treatment combination 5 of factors D, E, and F. This anneal treatment combination is applied to a set of eight wafers, with each of these eight wafers coming from one of the eight implant treatment combinations. After this first batch of wafers has been annealed, the second anneal treatment is applied to a second batch of eight wafers, with these eight wafers coming from one each of the eight implant treatment combinations. This is continued until the last batch of eight wafers has been implanted with a particular combination of factors D, E, and F. +Running the experiment in this way results in a strip-plot design with three sizes of experimental units. A set of eight wafers that are implanted together is the experimental unit for the implant factors A, B, and C and for all of their interactions. There are eight experimental units for the implant factors. A different set of eight wafers are annealed together. This different set of eight wafers is the second size experimental unit and is the experimental unit for the anneal factors D, E, and F and for all of their interactions. The third size experimental unit is a single wafer. This is the experimental unit for all of the interaction effects between the implant factors and the anneal factors. +Actually, the above description of the strip-plot design represents one block or one replicate of this experiment. If the experiment contains no replication and the model for the implant contains only the main effects and two-factor interactions, the three-factor interaction term A*B*C (1 degree of freedom) provides the error term for the estimation of effects within the implant experimental unit. Invoking a similar model for the anneal experimental unit produces the three-factor interaction term D*E*F for the error term (1 degree of freedom) for effects within the anneal experimental unit. + +== See also == + +Hierarchical linear modeling +Mixed-design analysis of variance +Multilevel model +Nested case-control study + +== References == + +"How can I account for nested variation (restricted randomization)?". (U.S.) National Institute of Standards and Technology: Information Technology Laboratory. Retrieved March 26, 2012. + +== Further reading == +For a more detailed discussion of these designs and the appropriate analysis procedures, see: + +Milliken, G. A.; Johnson, D. E. (1984). Analysis of Messy Data. Vol. 1. New York: Van Nostrand Reinhold. +Miller, A. (1997). "Strip-Plot Configuration of Fractional Factorials". Technometrics. 39 (2): 153–161. doi:10.2307/1270903. JSTOR 1270903. + +== External links == +Examples of all ANOVA and ANCOVA models with up to three treatment factors, including randomized block, split plot, repeated measures, and Latin squares, and their analysis in R + + This article incorporates public domain material from the National Institute of Standards and Technology \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Riffle_splitter-0.md b/data/en.wikipedia.org/wiki/Riffle_splitter-0.md new file mode 100644 index 000000000..2afd0d980 --- /dev/null +++ b/data/en.wikipedia.org/wiki/Riffle_splitter-0.md @@ -0,0 +1,47 @@ +--- +title: "Riffle splitter" +chunk: 1/1 +source: "https://en.wikipedia.org/wiki/Riffle_splitter" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:51:48.272622+00:00" +instance: "kb-cron" +--- + +The riffle splitter is a device used to divide a bulk sample of material into smaller, representative sub-samples. It can be used in laboratory settings or fieldwork. +The device is usually constructed with steel sheet and should be designed to have an even number of opposing inclined chutes (the riffles), with each chute having the same width. The recommended chute width should be at least 2.5× the size of the maximum particle diameter that can be found in the lot to be split. Riffle splitters are typically used in assay and analytical laboratories to reduce the size of samples provided from other sources (crushed rock, soils, powders and so on) to a lot size that is appropriate for the next stage of analytical sample preparation. + + +== Design == +There are many different versions of the riffle splitter. However, not all can be considered correct sub-sampling devices, in that the two sub-sample halves are deemed to be representative of the original lot. The issue of correctness of a riffles split sub-sample are function of both the design and the use of the splitter. +The design key items are: + +The splitter should have an even number of riffles so the two sub sample have the same mass +The chute widths should be 2.5× the maximum particle size so as to preclude blocking of the chutes by groups larger fragments in the lot. Typically this maximum particle size is in the order of 15 mm for most commercially available riffle splitters +The chutes should have relatively sharp edges so a fragment falling onto a chute edge is directed into an adjacent chute and does not bounce away into a non-edge-adjacent chute +The feeder to the chute (either a pan or dump-box) should have the same width as the set of riffles +The design of tiered riffle splitters (splitters stacked in tiers above one another) is often incorrect as the outflow from an upper tier does not usually fall vertically in the centre of the next tier (see correct use below) + + +== Use == +Incorrect use of a riffle splitter will lead to sample biases, with the subs lot potentially having unacceptably higher or lower concentrations of the lot analytes or attributes being measured. The main operational factors are as follows: + +The splitter should be operated in an environment where any dust generated is captured in an appropriate manner that does not endanger the health and safety of the operator, with operators using appropriate dust masks when splitting potential hazardous materials, such a rock silica dust. +The primary lot being split must by dry and free-flowing. +The splitter should be fed by a pan, bucket or hopper that is the same width as the set of riffles in the device. +Before splitting the material in the feed, the device should be leveled such that approximately the same volume of material will pass through each riffle. Dumping material from bags, buckets or shovels is incorrect and may lead to unacceptable sub sampling biases. +The material from the primary lot should be fed slowly to the centre of the device so as to fall vertically under gravity through the splitter. Feeding the material to one side of the device usually results in flooding of the riffles with the excess material flowing (often the fine fraction) into an adjacent riffle (see Gy, 1982 p. 295). +A good practice is to alternate between the two sub lots from sample to sample when selecting the sub lot that is to be subject to analysis or further splitting or other sample preparation, so as to mitigate any potential bias that may occur by always selecting the sub lot from the same side of the splitter. +Importantly, the splitter needs to be cleaned after each use using a brush and or compressed air streams to blow away any retained dust. + + +== Use in mineral exploration == +In many real-world situations outside the laboratory obeying the correct use advice above is not always possible. In particular tiered riffle splitters are widely used in the mineral industry for sub sampling drill hole cuttings at the drilling site. These devices are problematic in that they are usually fed rapidly, the dump-devices are not well designed to allow the material to flow evenly and freely, and the volume of material and sometimes moist state, often results, in choking of the splitter, overflows and sample losses. The best approach is usually to slow the rate of drilling, split the primary lot after drilling as a separate exercise (not part of the drilling routine) and only split samples that are dry and free flowing (all other need be dried and crushed). Importantly replicates samples from the splitter rejects need to be collected regularly to monitor the potential splitter bias that may occur when the analytical sub sample is always collected from the same side of the splitting device. + + +== See also == +Sampling (statistics) +Gy's sampling theory + + +== References == \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Round-robin_test-0.md b/data/en.wikipedia.org/wiki/Round-robin_test-0.md new file mode 100644 index 000000000..d184943ca --- /dev/null +++ b/data/en.wikipedia.org/wiki/Round-robin_test-0.md @@ -0,0 +1,33 @@ +--- +title: "Round-robin test" +chunk: 1/1 +source: "https://en.wikipedia.org/wiki/Round-robin_test" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:51:49.467773+00:00" +instance: "kb-cron" +--- + +In experimental methodology, a round-robin test is an interlaboratory test (measurement, analysis, or experiment) performed independently several times. This can involve multiple independent scientists performing the test with the use of the same method in different equipment, or a variety of methods and equipment. In reality it is often a combination of the two, for example if a sample is analysed, or one (or more) of its properties is measured by different laboratories using different methods, or even just by different units of equipment of identical construction. +A round-robin program is a measurement systems analysis technique which uses analysis of variance (ANOVA) random effects model to assess a measurement system. + + +== Round-robin tests regarding occupational safety and health == +Companies are obliged to determine whether hazardous substances are in the air at the workplace by using appropriate measurements. To maintain the quality parameters for analytic procedures, the quality assurance methods are to be applied that are state-of-the-art. The Institute for Occupational Safety and Health of the German Social Accident Insurance (IFA) provides proficiency testing as support and assistance for the measuring stations worldwide for the purpose of self-examination and for the external presentation of quality standards. The PT schemes are conducted according to DIN EN ISO/IEC 17043 and DIN ISO 13528. The IFA organises the proficiency testing schemes in co-operation with the German association of environmental and OSH measurement bodies (BUA) and in collaboration with international partner institutes. + + +== Purpose == +There are different reasons for performing a round-robin test: + +determination the reproducibility of a test method or process +verification of a new method of analysis. If a new method of analysis has been developed, a round-robin test involving proven methods would verify whether the new method produces results that agree with the established method. +providing basis, by interlaboratory testing, for certificates of quantitative analysis on a given material in certified reference materials production. + + +== Further info == +ASTM E691 Standard Practice for Conducting an Interlaboratory Study to Determine the Precision of a Test Method +IUPAC: Round Robin Test on the Molecular Characterization of Epoxy Resins by Liquid Chromatography +NIST/SEMATEK (2008) Handbook of Statistical Methods + + +== References == \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Royal_Commission_on_Animal_Magnetism-0.md b/data/en.wikipedia.org/wiki/Royal_Commission_on_Animal_Magnetism-0.md index 3ce263665..f9d0c1d6f 100644 --- a/data/en.wikipedia.org/wiki/Royal_Commission_on_Animal_Magnetism-0.md +++ b/data/en.wikipedia.org/wiki/Royal_Commission_on_Animal_Magnetism-0.md @@ -4,7 +4,7 @@ chunk: 1/17 source: "https://en.wikipedia.org/wiki/Royal_Commission_on_Animal_Magnetism" category: "reference" tags: "science, encyclopedia" -date_saved: "2026-05-05T06:28:43.347187+00:00" +date_saved: "2026-05-05T09:51:50.817218+00:00" instance: "kb-cron" --- diff --git a/data/en.wikipedia.org/wiki/Royal_Commission_on_Animal_Magnetism-1.md b/data/en.wikipedia.org/wiki/Royal_Commission_on_Animal_Magnetism-1.md index fd23cc4e4..e2d6bc8da 100644 --- a/data/en.wikipedia.org/wiki/Royal_Commission_on_Animal_Magnetism-1.md +++ b/data/en.wikipedia.org/wiki/Royal_Commission_on_Animal_Magnetism-1.md @@ -4,7 +4,7 @@ chunk: 2/17 source: "https://en.wikipedia.org/wiki/Royal_Commission_on_Animal_Magnetism" category: "reference" tags: "science, encyclopedia" -date_saved: "2026-05-05T06:28:43.347187+00:00" +date_saved: "2026-05-05T09:51:50.817218+00:00" instance: "kb-cron" --- diff --git a/data/en.wikipedia.org/wiki/Royal_Commission_on_Animal_Magnetism-10.md b/data/en.wikipedia.org/wiki/Royal_Commission_on_Animal_Magnetism-10.md index 86deb26e5..1a0cf6cee 100644 --- a/data/en.wikipedia.org/wiki/Royal_Commission_on_Animal_Magnetism-10.md +++ b/data/en.wikipedia.org/wiki/Royal_Commission_on_Animal_Magnetism-10.md @@ -4,7 +4,7 @@ chunk: 11/17 source: "https://en.wikipedia.org/wiki/Royal_Commission_on_Animal_Magnetism" category: "reference" tags: "science, encyclopedia" -date_saved: "2026-05-05T06:28:43.347187+00:00" +date_saved: "2026-05-05T09:51:50.817218+00:00" instance: "kb-cron" --- diff --git a/data/en.wikipedia.org/wiki/Royal_Commission_on_Animal_Magnetism-11.md b/data/en.wikipedia.org/wiki/Royal_Commission_on_Animal_Magnetism-11.md index 96836a1d7..61ff6d7fe 100644 --- a/data/en.wikipedia.org/wiki/Royal_Commission_on_Animal_Magnetism-11.md +++ b/data/en.wikipedia.org/wiki/Royal_Commission_on_Animal_Magnetism-11.md @@ -4,7 +4,7 @@ chunk: 12/17 source: "https://en.wikipedia.org/wiki/Royal_Commission_on_Animal_Magnetism" category: "reference" tags: "science, encyclopedia" -date_saved: "2026-05-05T06:28:43.347187+00:00" +date_saved: "2026-05-05T09:51:50.817218+00:00" instance: "kb-cron" --- diff --git a/data/en.wikipedia.org/wiki/Royal_Commission_on_Animal_Magnetism-12.md b/data/en.wikipedia.org/wiki/Royal_Commission_on_Animal_Magnetism-12.md index fb43e1a07..8ace704d4 100644 --- a/data/en.wikipedia.org/wiki/Royal_Commission_on_Animal_Magnetism-12.md +++ b/data/en.wikipedia.org/wiki/Royal_Commission_on_Animal_Magnetism-12.md @@ -4,7 +4,7 @@ chunk: 13/17 source: "https://en.wikipedia.org/wiki/Royal_Commission_on_Animal_Magnetism" category: "reference" tags: "science, encyclopedia" -date_saved: "2026-05-05T06:28:43.347187+00:00" +date_saved: "2026-05-05T09:51:50.817218+00:00" instance: "kb-cron" --- diff --git a/data/en.wikipedia.org/wiki/Royal_Commission_on_Animal_Magnetism-13.md b/data/en.wikipedia.org/wiki/Royal_Commission_on_Animal_Magnetism-13.md index 402cc0fb4..82d52178c 100644 --- a/data/en.wikipedia.org/wiki/Royal_Commission_on_Animal_Magnetism-13.md +++ b/data/en.wikipedia.org/wiki/Royal_Commission_on_Animal_Magnetism-13.md @@ -4,7 +4,7 @@ chunk: 14/17 source: "https://en.wikipedia.org/wiki/Royal_Commission_on_Animal_Magnetism" category: "reference" tags: "science, encyclopedia" -date_saved: "2026-05-05T06:28:43.347187+00:00" +date_saved: "2026-05-05T09:51:50.817218+00:00" instance: "kb-cron" --- diff --git a/data/en.wikipedia.org/wiki/Royal_Commission_on_Animal_Magnetism-14.md b/data/en.wikipedia.org/wiki/Royal_Commission_on_Animal_Magnetism-14.md index dee86aad0..2f0d13a99 100644 --- a/data/en.wikipedia.org/wiki/Royal_Commission_on_Animal_Magnetism-14.md +++ b/data/en.wikipedia.org/wiki/Royal_Commission_on_Animal_Magnetism-14.md @@ -4,7 +4,7 @@ chunk: 15/17 source: "https://en.wikipedia.org/wiki/Royal_Commission_on_Animal_Magnetism" category: "reference" tags: "science, encyclopedia" -date_saved: "2026-05-05T06:28:43.347187+00:00" +date_saved: "2026-05-05T09:51:50.817218+00:00" instance: "kb-cron" --- diff --git a/data/en.wikipedia.org/wiki/Royal_Commission_on_Animal_Magnetism-15.md b/data/en.wikipedia.org/wiki/Royal_Commission_on_Animal_Magnetism-15.md index b61a3450c..c6ec58f4d 100644 --- a/data/en.wikipedia.org/wiki/Royal_Commission_on_Animal_Magnetism-15.md +++ b/data/en.wikipedia.org/wiki/Royal_Commission_on_Animal_Magnetism-15.md @@ -4,7 +4,7 @@ chunk: 16/17 source: "https://en.wikipedia.org/wiki/Royal_Commission_on_Animal_Magnetism" category: "reference" tags: "science, encyclopedia" -date_saved: "2026-05-05T06:28:43.347187+00:00" +date_saved: "2026-05-05T09:51:50.817218+00:00" instance: "kb-cron" --- diff --git a/data/en.wikipedia.org/wiki/Royal_Commission_on_Animal_Magnetism-16.md b/data/en.wikipedia.org/wiki/Royal_Commission_on_Animal_Magnetism-16.md index fe3daba6c..502367bdf 100644 --- a/data/en.wikipedia.org/wiki/Royal_Commission_on_Animal_Magnetism-16.md +++ b/data/en.wikipedia.org/wiki/Royal_Commission_on_Animal_Magnetism-16.md @@ -4,7 +4,7 @@ chunk: 17/17 source: "https://en.wikipedia.org/wiki/Royal_Commission_on_Animal_Magnetism" category: "reference" tags: "science, encyclopedia" -date_saved: "2026-05-05T06:28:43.347187+00:00" +date_saved: "2026-05-05T09:51:50.817218+00:00" instance: "kb-cron" --- diff --git a/data/en.wikipedia.org/wiki/Royal_Commission_on_Animal_Magnetism-2.md b/data/en.wikipedia.org/wiki/Royal_Commission_on_Animal_Magnetism-2.md index f81931f38..273999512 100644 --- a/data/en.wikipedia.org/wiki/Royal_Commission_on_Animal_Magnetism-2.md +++ b/data/en.wikipedia.org/wiki/Royal_Commission_on_Animal_Magnetism-2.md @@ -4,7 +4,7 @@ chunk: 3/17 source: "https://en.wikipedia.org/wiki/Royal_Commission_on_Animal_Magnetism" category: "reference" tags: "science, encyclopedia" -date_saved: "2026-05-05T06:28:43.347187+00:00" +date_saved: "2026-05-05T09:51:50.817218+00:00" instance: "kb-cron" --- diff --git a/data/en.wikipedia.org/wiki/Royal_Commission_on_Animal_Magnetism-3.md b/data/en.wikipedia.org/wiki/Royal_Commission_on_Animal_Magnetism-3.md index d452d56d1..136e611e7 100644 --- a/data/en.wikipedia.org/wiki/Royal_Commission_on_Animal_Magnetism-3.md +++ b/data/en.wikipedia.org/wiki/Royal_Commission_on_Animal_Magnetism-3.md @@ -4,7 +4,7 @@ chunk: 4/17 source: "https://en.wikipedia.org/wiki/Royal_Commission_on_Animal_Magnetism" category: "reference" tags: "science, encyclopedia" -date_saved: "2026-05-05T06:28:43.347187+00:00" +date_saved: "2026-05-05T09:51:50.817218+00:00" instance: "kb-cron" --- diff --git a/data/en.wikipedia.org/wiki/Royal_Commission_on_Animal_Magnetism-4.md b/data/en.wikipedia.org/wiki/Royal_Commission_on_Animal_Magnetism-4.md index 178f2045e..9a0a9be62 100644 --- a/data/en.wikipedia.org/wiki/Royal_Commission_on_Animal_Magnetism-4.md +++ b/data/en.wikipedia.org/wiki/Royal_Commission_on_Animal_Magnetism-4.md @@ -4,7 +4,7 @@ chunk: 5/17 source: "https://en.wikipedia.org/wiki/Royal_Commission_on_Animal_Magnetism" category: "reference" tags: "science, encyclopedia" -date_saved: "2026-05-05T06:28:43.347187+00:00" +date_saved: "2026-05-05T09:51:50.817218+00:00" instance: "kb-cron" --- diff --git a/data/en.wikipedia.org/wiki/Royal_Commission_on_Animal_Magnetism-5.md b/data/en.wikipedia.org/wiki/Royal_Commission_on_Animal_Magnetism-5.md index d43a4fea2..655a8c667 100644 --- a/data/en.wikipedia.org/wiki/Royal_Commission_on_Animal_Magnetism-5.md +++ b/data/en.wikipedia.org/wiki/Royal_Commission_on_Animal_Magnetism-5.md @@ -4,7 +4,7 @@ chunk: 6/17 source: "https://en.wikipedia.org/wiki/Royal_Commission_on_Animal_Magnetism" category: "reference" tags: "science, encyclopedia" -date_saved: "2026-05-05T06:28:43.347187+00:00" +date_saved: "2026-05-05T09:51:50.817218+00:00" instance: "kb-cron" --- diff --git a/data/en.wikipedia.org/wiki/Royal_Commission_on_Animal_Magnetism-6.md b/data/en.wikipedia.org/wiki/Royal_Commission_on_Animal_Magnetism-6.md index d2da960b3..f8cacf2d4 100644 --- a/data/en.wikipedia.org/wiki/Royal_Commission_on_Animal_Magnetism-6.md +++ b/data/en.wikipedia.org/wiki/Royal_Commission_on_Animal_Magnetism-6.md @@ -4,7 +4,7 @@ chunk: 7/17 source: "https://en.wikipedia.org/wiki/Royal_Commission_on_Animal_Magnetism" category: "reference" tags: "science, encyclopedia" -date_saved: "2026-05-05T06:28:43.347187+00:00" +date_saved: "2026-05-05T09:51:50.817218+00:00" instance: "kb-cron" --- diff --git a/data/en.wikipedia.org/wiki/Royal_Commission_on_Animal_Magnetism-7.md b/data/en.wikipedia.org/wiki/Royal_Commission_on_Animal_Magnetism-7.md index 4271d92d1..957b7322a 100644 --- a/data/en.wikipedia.org/wiki/Royal_Commission_on_Animal_Magnetism-7.md +++ b/data/en.wikipedia.org/wiki/Royal_Commission_on_Animal_Magnetism-7.md @@ -4,7 +4,7 @@ chunk: 8/17 source: "https://en.wikipedia.org/wiki/Royal_Commission_on_Animal_Magnetism" category: "reference" tags: "science, encyclopedia" -date_saved: "2026-05-05T06:28:43.347187+00:00" +date_saved: "2026-05-05T09:51:50.817218+00:00" instance: "kb-cron" --- diff --git a/data/en.wikipedia.org/wiki/Royal_Commission_on_Animal_Magnetism-8.md b/data/en.wikipedia.org/wiki/Royal_Commission_on_Animal_Magnetism-8.md index 40bd32556..f97e2bfc1 100644 --- a/data/en.wikipedia.org/wiki/Royal_Commission_on_Animal_Magnetism-8.md +++ b/data/en.wikipedia.org/wiki/Royal_Commission_on_Animal_Magnetism-8.md @@ -4,7 +4,7 @@ chunk: 9/17 source: "https://en.wikipedia.org/wiki/Royal_Commission_on_Animal_Magnetism" category: "reference" tags: "science, encyclopedia" -date_saved: "2026-05-05T06:28:43.347187+00:00" +date_saved: "2026-05-05T09:51:50.817218+00:00" instance: "kb-cron" --- diff --git a/data/en.wikipedia.org/wiki/Royal_Commission_on_Animal_Magnetism-9.md b/data/en.wikipedia.org/wiki/Royal_Commission_on_Animal_Magnetism-9.md index 0e7a1f3de..6054e87d6 100644 --- a/data/en.wikipedia.org/wiki/Royal_Commission_on_Animal_Magnetism-9.md +++ b/data/en.wikipedia.org/wiki/Royal_Commission_on_Animal_Magnetism-9.md @@ -4,7 +4,7 @@ chunk: 10/17 source: "https://en.wikipedia.org/wiki/Royal_Commission_on_Animal_Magnetism" category: "reference" tags: "science, encyclopedia" -date_saved: "2026-05-05T06:28:43.347187+00:00" +date_saved: "2026-05-05T09:51:50.817218+00:00" instance: "kb-cron" --- diff --git a/data/en.wikipedia.org/wiki/Sample_ratio_mismatch-0.md b/data/en.wikipedia.org/wiki/Sample_ratio_mismatch-0.md new file mode 100644 index 000000000..2b927017d --- /dev/null +++ b/data/en.wikipedia.org/wiki/Sample_ratio_mismatch-0.md @@ -0,0 +1,20 @@ +--- +title: "Sample ratio mismatch" +chunk: 1/1 +source: "https://en.wikipedia.org/wiki/Sample_ratio_mismatch" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:51:52.013933+00:00" +instance: "kb-cron" +--- + +In the design of experiments, a sample ratio mismatch (SRM) is a statistically significant difference between the expected and actual ratios of the sizes of treatment and control groups in an experiment. Sample ratio mismatches also known as unbalanced sampling often occur in online controlled experiments due to failures in randomization and instrumentation. +Sample ratio mismatches can be detected using a chi-squared test. Using methods to detect SRM can help non-experts avoid making decisions using biased data. If the sample size is large enough, even a small discrepancy between the observed and expected group sizes can invalidate the results of an experiment. + + +== Example == +Suppose we run an A/B test in which we randomly assign 1000 users to equally sized treatment and control groups (a 50–50 split). The expected size of each group is 500. However, the actual sizes of the treatment and control groups are 600 and 400. +Using Pearson's chi-squared goodness of fit test, we find a sample ratio mismatch with a p-value of 2.54×10−10. In other words, if the assignment of users were truly random, the probability that these treatment and control group sizes would occur by chance is 2.54×10−10. + + +== References == \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Saturated_array-0.md b/data/en.wikipedia.org/wiki/Saturated_array-0.md new file mode 100644 index 000000000..f38f13e5c --- /dev/null +++ b/data/en.wikipedia.org/wiki/Saturated_array-0.md @@ -0,0 +1,11 @@ +--- +title: "Saturated array" +chunk: 1/1 +source: "https://en.wikipedia.org/wiki/Saturated_array" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:51:53.168008+00:00" +instance: "kb-cron" +--- + +In experiments in which additional factors are not likely to interact with any of the other factors, a saturated array can be used. In a saturated array, a controllable factor is substituted for the interaction of two or more by-products. Using a saturated array, a two-factor test matrix could be used to test three factors. Using the saturated array allows three factors to be tested in four tests rather than in eight, as would be required by a standard orthogonal array. \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Scheirer–Ray–Hare_test-0.md b/data/en.wikipedia.org/wiki/Scheirer–Ray–Hare_test-0.md new file mode 100644 index 000000000..45543f034 --- /dev/null +++ b/data/en.wikipedia.org/wiki/Scheirer–Ray–Hare_test-0.md @@ -0,0 +1,28 @@ +--- +title: "Scheirer–Ray–Hare test" +chunk: 1/1 +source: "https://en.wikipedia.org/wiki/Scheirer–Ray–Hare_test" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:51:54.386214+00:00" +instance: "kb-cron" +--- + +The Scheirer–Ray–Hare (SRH) test is a statistical test that can be used to examine whether a measure is affected by two or more factors. Since it does not require a normal distribution of the data, it is one of the non-parametric methods. It is an extension of the Kruskal–Wallis test, the non-parametric equivalent for one-way analysis of variance (ANOVA), to the application for more than one factor. It is thus a non-parametric alternative to multi-factorial ANOVA analyses. The test is named after James Scheirer, William Ray and Nathan Hare, who published it in 1976. + + +== Test description == +The Scheirer–Ray–Hare test is analogous to the parametric multi-factorial ANOVA of investigating the influence of two different factors on a measure for which different samples are available for the factors. As with the parametric analysis of variance, the test can be used to investigate the null hypotheses that the two factors examined in each case have no influence on the positional parameter of the samples and thus on the measure, and that there are no interactions between the two factors. A p-value less than 0.05 for one or more of these three hypotheses leads to their rejection. As with many other non-parametric methods, the analysis in this method relies on the evaluation of the ranks of the samples in the samples rather than the actual observations. Modifications also allow extending the test to examine more than two factors. +The test strength of the Scheirer–Ray–Hare test, i.e. the probability of actually finding a statistically significant result, is significantly lower than that of the parametric multi-factorial ANOVA, so that it is considered more conservative in comparison of both methods. For this reason, and because the method was described later than most other parametric and non-parametric variance analysis tests, it has found little use in textbooks and statistical analysis software. With computer programs that contain a function for parametric multi-factorial ANOVA, however, with additional manual effort and a calculation of the Scheirer Ray Hare test is possible. +Since the Scheirer–Ray–Hare test only makes a statement about the diversity of all samples considered, it makes sense to perform a post-hoc test that compares the individual samples in pairs. + + +== Alternative procedures == +The parametric alternative to the Scheirer–Ray–Hare test is multi-factorial ANOVA, which requires a normal distribution of data within the samples. The Kruskal–Wallis test, from which the Scheirer–Ray–Hare test is derived, serves in contrast to this to investigate the influence of exactly one factor on the measured variable. A non-parametric test comparing exactly two unpaired samples is the Wilcoxon–Mann–Whitney test. + + +== References == + + +== Literature == +Robert R. Sokal, F. James Rohlf: Biometry: The Principles And Practice of Statistics In Biological Research. Third edition. Freeman, New York 1995, ISBN 0-71-672411-1, pp. 445–447 \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Scholarly_peer_review-0.md b/data/en.wikipedia.org/wiki/Scholarly_peer_review-0.md index 3d8427edc..0d0be2c21 100644 --- a/data/en.wikipedia.org/wiki/Scholarly_peer_review-0.md +++ b/data/en.wikipedia.org/wiki/Scholarly_peer_review-0.md @@ -4,7 +4,7 @@ chunk: 1/12 source: "https://en.wikipedia.org/wiki/Scholarly_peer_review" category: "reference" tags: "science, encyclopedia" -date_saved: "2026-05-05T06:28:11.658490+00:00" +date_saved: "2026-05-05T09:53:18.276870+00:00" instance: "kb-cron" --- diff --git a/data/en.wikipedia.org/wiki/Scholarly_peer_review-1.md b/data/en.wikipedia.org/wiki/Scholarly_peer_review-1.md index 77787a300..b56f20eb8 100644 --- a/data/en.wikipedia.org/wiki/Scholarly_peer_review-1.md +++ b/data/en.wikipedia.org/wiki/Scholarly_peer_review-1.md @@ -4,7 +4,7 @@ chunk: 2/12 source: "https://en.wikipedia.org/wiki/Scholarly_peer_review" category: "reference" tags: "science, encyclopedia" -date_saved: "2026-05-05T06:28:11.658490+00:00" +date_saved: "2026-05-05T09:53:18.276870+00:00" instance: "kb-cron" --- diff --git a/data/en.wikipedia.org/wiki/Scholarly_peer_review-10.md b/data/en.wikipedia.org/wiki/Scholarly_peer_review-10.md index 3a6384463..d56466164 100644 --- a/data/en.wikipedia.org/wiki/Scholarly_peer_review-10.md +++ b/data/en.wikipedia.org/wiki/Scholarly_peer_review-10.md @@ -4,7 +4,7 @@ chunk: 11/12 source: "https://en.wikipedia.org/wiki/Scholarly_peer_review" category: "reference" tags: "science, encyclopedia" -date_saved: "2026-05-05T06:28:11.658490+00:00" +date_saved: "2026-05-05T09:53:18.276870+00:00" instance: "kb-cron" --- diff --git a/data/en.wikipedia.org/wiki/Scholarly_peer_review-11.md b/data/en.wikipedia.org/wiki/Scholarly_peer_review-11.md index 1a85bae50..a3a9b2041 100644 --- a/data/en.wikipedia.org/wiki/Scholarly_peer_review-11.md +++ b/data/en.wikipedia.org/wiki/Scholarly_peer_review-11.md @@ -4,7 +4,7 @@ chunk: 12/12 source: "https://en.wikipedia.org/wiki/Scholarly_peer_review" category: "reference" tags: "science, encyclopedia" -date_saved: "2026-05-05T06:28:11.658490+00:00" +date_saved: "2026-05-05T09:53:18.276870+00:00" instance: "kb-cron" --- diff --git a/data/en.wikipedia.org/wiki/Scholarly_peer_review-2.md b/data/en.wikipedia.org/wiki/Scholarly_peer_review-2.md index cfe582cce..48dc939c0 100644 --- a/data/en.wikipedia.org/wiki/Scholarly_peer_review-2.md +++ b/data/en.wikipedia.org/wiki/Scholarly_peer_review-2.md @@ -4,7 +4,7 @@ chunk: 3/12 source: "https://en.wikipedia.org/wiki/Scholarly_peer_review" category: "reference" tags: "science, encyclopedia" -date_saved: "2026-05-05T06:28:11.658490+00:00" +date_saved: "2026-05-05T09:53:18.276870+00:00" instance: "kb-cron" --- diff --git a/data/en.wikipedia.org/wiki/Scholarly_peer_review-3.md b/data/en.wikipedia.org/wiki/Scholarly_peer_review-3.md index 0cf752ad8..82102bbdd 100644 --- a/data/en.wikipedia.org/wiki/Scholarly_peer_review-3.md +++ b/data/en.wikipedia.org/wiki/Scholarly_peer_review-3.md @@ -4,7 +4,7 @@ chunk: 4/12 source: "https://en.wikipedia.org/wiki/Scholarly_peer_review" category: "reference" tags: "science, encyclopedia" -date_saved: "2026-05-05T06:28:11.658490+00:00" +date_saved: "2026-05-05T09:53:18.276870+00:00" instance: "kb-cron" --- diff --git a/data/en.wikipedia.org/wiki/Scholarly_peer_review-4.md b/data/en.wikipedia.org/wiki/Scholarly_peer_review-4.md index 8ff288b74..c5146caf5 100644 --- a/data/en.wikipedia.org/wiki/Scholarly_peer_review-4.md +++ b/data/en.wikipedia.org/wiki/Scholarly_peer_review-4.md @@ -4,7 +4,7 @@ chunk: 5/12 source: "https://en.wikipedia.org/wiki/Scholarly_peer_review" category: "reference" tags: "science, encyclopedia" -date_saved: "2026-05-05T06:28:11.658490+00:00" +date_saved: "2026-05-05T09:53:18.276870+00:00" instance: "kb-cron" --- diff --git a/data/en.wikipedia.org/wiki/Scholarly_peer_review-5.md b/data/en.wikipedia.org/wiki/Scholarly_peer_review-5.md index d68056c29..367e6cd41 100644 --- a/data/en.wikipedia.org/wiki/Scholarly_peer_review-5.md +++ b/data/en.wikipedia.org/wiki/Scholarly_peer_review-5.md @@ -4,7 +4,7 @@ chunk: 6/12 source: "https://en.wikipedia.org/wiki/Scholarly_peer_review" category: "reference" tags: "science, encyclopedia" -date_saved: "2026-05-05T06:28:11.658490+00:00" +date_saved: "2026-05-05T09:53:18.276870+00:00" instance: "kb-cron" --- diff --git a/data/en.wikipedia.org/wiki/Scholarly_peer_review-6.md b/data/en.wikipedia.org/wiki/Scholarly_peer_review-6.md index 4473f1a23..c2a9be099 100644 --- a/data/en.wikipedia.org/wiki/Scholarly_peer_review-6.md +++ b/data/en.wikipedia.org/wiki/Scholarly_peer_review-6.md @@ -4,7 +4,7 @@ chunk: 7/12 source: "https://en.wikipedia.org/wiki/Scholarly_peer_review" category: "reference" tags: "science, encyclopedia" -date_saved: "2026-05-05T06:28:11.658490+00:00" +date_saved: "2026-05-05T09:53:18.276870+00:00" instance: "kb-cron" --- diff --git a/data/en.wikipedia.org/wiki/Scholarly_peer_review-7.md b/data/en.wikipedia.org/wiki/Scholarly_peer_review-7.md index 025692d7d..2f8f65abd 100644 --- a/data/en.wikipedia.org/wiki/Scholarly_peer_review-7.md +++ b/data/en.wikipedia.org/wiki/Scholarly_peer_review-7.md @@ -4,7 +4,7 @@ chunk: 8/12 source: "https://en.wikipedia.org/wiki/Scholarly_peer_review" category: "reference" tags: "science, encyclopedia" -date_saved: "2026-05-05T06:28:11.658490+00:00" +date_saved: "2026-05-05T09:53:18.276870+00:00" instance: "kb-cron" --- diff --git a/data/en.wikipedia.org/wiki/Scholarly_peer_review-8.md b/data/en.wikipedia.org/wiki/Scholarly_peer_review-8.md index 114b133f1..b39d17ff8 100644 --- a/data/en.wikipedia.org/wiki/Scholarly_peer_review-8.md +++ b/data/en.wikipedia.org/wiki/Scholarly_peer_review-8.md @@ -4,7 +4,7 @@ chunk: 9/12 source: "https://en.wikipedia.org/wiki/Scholarly_peer_review" category: "reference" tags: "science, encyclopedia" -date_saved: "2026-05-05T06:28:11.658490+00:00" +date_saved: "2026-05-05T09:53:18.276870+00:00" instance: "kb-cron" --- diff --git a/data/en.wikipedia.org/wiki/Scholarly_peer_review-9.md b/data/en.wikipedia.org/wiki/Scholarly_peer_review-9.md index 9b75199ed..44edace2a 100644 --- a/data/en.wikipedia.org/wiki/Scholarly_peer_review-9.md +++ b/data/en.wikipedia.org/wiki/Scholarly_peer_review-9.md @@ -4,7 +4,7 @@ chunk: 10/12 source: "https://en.wikipedia.org/wiki/Scholarly_peer_review" category: "reference" tags: "science, encyclopedia" -date_saved: "2026-05-05T06:28:11.658490+00:00" +date_saved: "2026-05-05T09:53:18.276870+00:00" instance: "kb-cron" --- diff --git a/data/en.wikipedia.org/wiki/Scientific_protocol-0.md b/data/en.wikipedia.org/wiki/Scientific_protocol-0.md index 847ff7ef1..c324843f6 100644 --- a/data/en.wikipedia.org/wiki/Scientific_protocol-0.md +++ b/data/en.wikipedia.org/wiki/Scientific_protocol-0.md @@ -4,7 +4,7 @@ chunk: 1/1 source: "https://en.wikipedia.org/wiki/Scientific_protocol" category: "reference" tags: "science, encyclopedia" -date_saved: "2026-05-05T06:29:00.933176+00:00" +date_saved: "2026-05-05T09:51:55.594780+00:00" instance: "kb-cron" --- diff --git a/data/en.wikipedia.org/wiki/Sealedenvelope.com-0.md b/data/en.wikipedia.org/wiki/Sealedenvelope.com-0.md new file mode 100644 index 000000000..75f864373 --- /dev/null +++ b/data/en.wikipedia.org/wiki/Sealedenvelope.com-0.md @@ -0,0 +1,18 @@ +--- +title: "Sealedenvelope.com" +chunk: 1/1 +source: "https://en.wikipedia.org/wiki/Sealedenvelope.com" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:51:56.773434+00:00" +instance: "kb-cron" +--- + +Sealedenvelope.com is British collaboration that provides support services for clinical trials. They provide services such as randomization, allocation concealment, code-break services, and case report management through a web-based design. They also perform certain calculations such as power calculations. + + +== References == + + +== External links == +Official website \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Self-selection_bias-0.md b/data/en.wikipedia.org/wiki/Self-selection_bias-0.md new file mode 100644 index 000000000..6e10bc3ae --- /dev/null +++ b/data/en.wikipedia.org/wiki/Self-selection_bias-0.md @@ -0,0 +1,36 @@ +--- +title: "Self-selection bias" +chunk: 1/1 +source: "https://en.wikipedia.org/wiki/Self-selection_bias" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:51:57.984111+00:00" +instance: "kb-cron" +--- + +In statistics, self-selection bias arises in any situation in which individuals select themselves into a group, causing a biased sample with nonprobability sampling. It is commonly used to describe situations where the characteristics of the people which cause them to select themselves in the group create abnormal or undesirable conditions in the group. It is closely related to the non-response bias, describing when the group of people responding has different responses than the group of people not responding. +Self-selection bias is a major problem in research in sociology, psychology, economics and many other social sciences. In such fields, a poll suffering from such bias is termed a self-selected listener opinion poll or "SLOP". +The term is also used in criminology to describe the process by which specific predispositions may lead an offender to choose a criminal career and lifestyle. +While the effects of self-selection bias are closely related to those of selection bias, the problem arises for rather different reasons; thus there may be a purposeful intent on the part of respondents leading to self-selection bias whereas other types of selection bias may arise more inadvertently, possibly as the result of mistakes by those designing any given study. + + +== Explanation == +Self-selection makes determination of causation more difficult. For example, when attempting to assess the effect of a test preparation course in increasing participant's test scores, significantly higher test scores might be observed among students who choose to participate in the preparation course itself. Due to self-selection, there may be a number of differences between the people who choose to take the course and those who choose not to, such as motivation, socioeconomic status, or prior test-taking experience. Due to self-selection according to such factors, a significant difference in mean test scores could be observed between the two populations independent of any ability of the course to affect test scores. An outcome might be that those who elect to do the preparation course would have achieved higher scores in the actual test anyway. If the study measures an improvement in absolute test scores due to participation in the preparation course, they may be skewed to show a higher effect. A relative measure of 'improvement' might improve the reliability of the study somewhat, but only partially. +Self-selection bias causes problems for research about programs or products. In particular, self-selection affects evaluation of whether or not a given program has some effect, and complicates interpretation of market research. +The Roy model provides one of the earliest academic illustrations of the self-selection problem. + + +== See also == +Convenience sampling +Sampling bias +Selection bias + + +== References == + +Jacobs, B., Hartog, J., Vijverberg, W. (2009) "Self-selection bias in estimated wage premiums for earnings risk", Empirical Economics, 37 (2), 271–286. doi:10.1007/s00181-008-0231-0 + + +== External links == +Self-selection bias at Moneyterms +Self-selection bias at the Skeptic's Dictionary \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Sequential_analysis-0.md b/data/en.wikipedia.org/wiki/Sequential_analysis-0.md new file mode 100644 index 000000000..7929fcc4b --- /dev/null +++ b/data/en.wikipedia.org/wiki/Sequential_analysis-0.md @@ -0,0 +1,68 @@ +--- +title: "Sequential analysis" +chunk: 1/1 +source: "https://en.wikipedia.org/wiki/Sequential_analysis" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:51:59.108535+00:00" +instance: "kb-cron" +--- + +In statistics, sequential analysis or sequential hypothesis testing is statistical analysis where the sample size is not fixed in advance. Instead data is evaluated as it is collected, and further sampling is stopped in accordance with a pre-defined stopping rule as soon as significant results are observed. Thus a conclusion may sometimes be reached at a much earlier stage than would be possible with more classical hypothesis testing or estimation, at consequently lower financial and/or human cost. + + +== History == +The method of sequential analysis is first attributed to Abraham Wald with Jacob Wolfowitz, W. Allen Wallis, and Milton Friedman while at Columbia University's Statistical Research Group as a tool for more efficient industrial quality control during World War II. Its value to the war effort was immediately recognised, and led to its receiving a "restricted" classification. At the same time, George Barnard led a group working on optimal stopping in Great Britain. Another early contribution to the method was made by K.J. Arrow with D. Blackwell and M.A. Girshick. +A similar approach was independently developed from first principles at about the same time by Alan Turing, as part of the Banburismus technique used at Bletchley Park, to test hypotheses about whether different messages coded by German Enigma machines should be connected and analysed together. This work remained secret until the early 1980s. +Peter Armitage introduced the use of sequential analysis in medical research, especially in the area of clinical trials. Sequential methods became increasingly popular in medicine following Stuart Pocock's work that provided clear recommendations on how to control Type 1 error rates in sequential designs. + + +== Alpha spending functions == +When researchers repeatedly analyze data as more observations are added, the probability of a Type 1 error increases. Therefore, it is important to adjust the alpha level at each interim analysis, such that the overall Type 1 error rate remains at the desired level. This is conceptually similar to using the Bonferroni correction, but because the repeated looks at the data are dependent, more efficient corrections for the alpha level can be used. Among the earliest proposals is the Pocock boundary. Alternative ways to control the Type 1 error rate exist, such as the Haybittle–Peto bounds, and additional work on determining the boundaries for interim analyses has been done by O'Brien & Fleming and Wang & Tsiatis. +A limitation of corrections such as the Pocock boundary is that the number of looks at the data must be determined before the data is collected, and that the looks at the data should be equally spaced (e.g., after 50, 100, 150, and 200 patients). The alpha spending function approach developed by Demets & Lan does not have these restrictions, and depending on the parameters chosen for the spending function, can be very similar to Pocock boundaries or the corrections proposed by O'Brien and Fleming. Another approach that has no such restrictions at all is based on e-values and e-processes. + + +== Applications of sequential analysis == + + +=== Clinical trials === +In a randomized trial with two treatment groups, group sequential testing may for example be conducted in the following manner: After n subjects in each group are available an interim analysis is conducted. A statistical test is performed to compare the two groups and if the null hypothesis is rejected the trial is terminated; otherwise, the trial continues, another n subjects per group are recruited, and the statistical test is performed again, including all subjects. If the null is rejected, the trial is terminated, and otherwise it continues with periodic evaluations until a maximum number of interim analyses have been performed, at which point the last statistical test is conducted and the trial is discontinued. + + +=== Other applications === +Sequential analysis also has a connection to the problem of gambler's ruin that has been studied by, among others, Huygens in 1657. +Step detection is the process of finding abrupt changes in the mean level of a time series or signal. It is usually considered as a special kind of statistical method known as change point detection. Often, the step is small and the time series is corrupted by some kind of noise, and this makes the problem challenging because the step may be hidden by the noise. Therefore, statistical and/or signal processing algorithms are often required. When the algorithms are run online as the data is coming in, especially with the aim of producing an alert, this is an application of sequential analysis. + + +== Bias == +Trials that are terminated early because they reject the null hypothesis typically overestimate the true effect size. This is because in small samples, only large effect size estimates will lead to a significant effect, and the subsequent termination of a trial. Methods to correct effect size estimates in single trials have been proposed. Note that this bias is mainly problematic when interpreting single studies. In meta-analyses, overestimated effect sizes due to early stopping are balanced by underestimation in trials that stop late, leading Schou & Marschner to conclude that "early stopping of clinical trials is not a substantive source of bias in meta-analyses". +The meaning of p-values in sequential analyses also changes, because when using sequential analyses, more than one analysis is performed, and the typical definition of a p-value as the data “at least as extreme” as is observed needs to be redefined. One solution is to order the p-values of a series of sequential tests based on the time of stopping and how high the test statistic was at a given look, which is known as stagewise ordering, first proposed by Armitage. + + +== See also == +Optimal stopping +Sequential estimation +Sequential probability ratio test +CUSUM + + +== Notes == + + +== References == +Wald, Abraham (1947). Sequential Analysis. New York: John Wiley and Sons. +Bartroff, J., Lai T.L., and Shih, M.-C. (2013) Sequential Experimentation in Clinical Trials: Design and Analysis. Springer. +Ghosh, Bhaskar Kumar (1970). Sequential Tests of Statistical Hypotheses. Reading: Addison-Wesley. +Chernoff, Herman (1972). Sequential Analysis and Optimal Design. SIAM. +Siegmund, David (1985). Sequential Analysis. Springer Series in Statistics. New York: Springer-Verlag. ISBN 978-0-387-96134-7. +Bakeman, R., Gottman, J.M., (1997) Observing Interaction: An Introduction to Sequential Analysis, Cambridge: Cambridge University Press +Jennison, C. and Turnbull, B.W (2000) Group Sequential Methods With Applications to Clinical Trials. Chapman & Hall/CRC. +Whitehead, J. (1997). The Design and Analysis of Sequential Clinical Trials, 2nd Edition. John Wiley & Sons. + + +== External links == +R Package: Wald's Sequential Probability Ratio Test by OnlineMarketr.com +Software for conducting sequential analysis and applications of sequential analysis in the study of group interaction in computer-mediated communication by Dr. Allan Jeong at Florida State University +SAMBO Optimization – a Python framework for sequential, model-based optimization. +Commercial +PASS Sample Size Software includes features for the setup of group sequential designs. \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Seriousness_check-0.md b/data/en.wikipedia.org/wiki/Seriousness_check-0.md index 8e07775d0..22a66a1e4 100644 --- a/data/en.wikipedia.org/wiki/Seriousness_check-0.md +++ b/data/en.wikipedia.org/wiki/Seriousness_check-0.md @@ -4,7 +4,7 @@ chunk: 1/1 source: "https://en.wikipedia.org/wiki/Seriousness_check" category: "reference" tags: "science, encyclopedia" -date_saved: "2026-05-05T06:39:46.199712+00:00" +date_saved: "2026-05-05T09:52:00.364244+00:00" instance: "kb-cron" --- diff --git a/data/en.wikipedia.org/wiki/Set_balancing-0.md b/data/en.wikipedia.org/wiki/Set_balancing-0.md new file mode 100644 index 000000000..04ff364b4 --- /dev/null +++ b/data/en.wikipedia.org/wiki/Set_balancing-0.md @@ -0,0 +1,916 @@ +--- +title: "Set balancing" +chunk: 1/1 +source: "https://en.wikipedia.org/wiki/Set_balancing" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:52:01.568529+00:00" +instance: "kb-cron" +--- + +The set balancing problem in mathematics is the problem of dividing a set to two subsets that have roughly the same characteristics. It arises naturally in design of experiments. +There is a group of subjects. Each subject has several features, which are considered binary. For example: each subject can be either young or old; either black or white; either tall or short; etc. The goal is to divide the subjects to two sub-groups: treatment group (T) and control group (C), such that for each feature, the number of subjects that have this feature in T is roughly equal to the number of subjects that have this feature in C. E.g., both groups should have roughly the same number of young people, the same number of black people, the same number of tall people, etc. + + +== Matrix representation == +Formally, the set balancing problem can be described as follows. + + + + + m + + + {\displaystyle m} + + is the number of subjects in the general population. + + + + + n + + + {\displaystyle n} + + is the number of potential features. +The subjects are described by + + + + A + + + {\displaystyle A} + +, an + + + + n + × + m + + + {\displaystyle n\times m} + + matrix with entries in + + + + + 0 + , + 1 + + + + {\displaystyle {0,1}} + +. Each column represents a subject and each row represents a feature. + + + + + a + + i + , + j + + + = + 1 + + + {\displaystyle a_{i,j}=1} + + if subject + + + + j + + + {\displaystyle j} + + has feature + + + + i + + + {\displaystyle i} + +, and + + + + + a + + i + , + j + + + = + 0 + + + {\displaystyle a_{i,j}=0} + + if subject + + + + j + + + {\displaystyle j} + + does not have feature + + + + i + + + {\displaystyle i} + +. +The partition to groups is described by + + + + b + + + {\displaystyle b} + +, an + + + + m + × + 1 + + + {\displaystyle m\times 1} + + vector with entries in + + + + + − + 1 + , + 1 + + + + {\displaystyle {-1,1}} + +. + + + + + b + + j + + + = + 1 + + + {\displaystyle b_{j}=1} + + if subject + + + + j + + + {\displaystyle j} + + is in the treatment group T and + + + + + b + + j + + + = + − + 1 + + + {\displaystyle b_{j}=-1} + + is subject + + + + j + + + {\displaystyle j} + + is in the control group C. +The balance of features is described by + + + + c + = + A + ⋅ + b + + + {\displaystyle c=A\cdot b} + +. This is an + + + + n + × + 1 + + + {\displaystyle n\times 1} + + vector. The numeric value of + + + + + c + + i + + + + + {\displaystyle c_{i}} + + is the imbalance in feature + + + + i + + + {\displaystyle i} + +: if + + + + + c + + i + + + > + 0 + + + {\displaystyle c_{i}>0} + + then there are more subjects with + + + + i + + + {\displaystyle i} + + in T and if + + + + + c + + i + + + < + 0 + + + {\displaystyle c_{i}<0} + + then there are more subjects with + + + + i + + + {\displaystyle i} + + in C. +The imbalance of a given partition is defined as: + + + + + I + ( + b + ) + = + + | + + + | + + A + ⋅ + b + + | + + + + | + + + ∞ + + + = + + max + + i + ∈ + 1 + … + , + n + + + + | + + + c + + i + + + + | + + + + {\displaystyle I(b)=||A\cdot b||_{\infty }=\max _{i\in 1\dots ,n}|c_{i}|} + + +The set balancing problem is to find a vector + + + + b + + + {\displaystyle b} + + which minimizes the imbalance + + + + I + ( + b + ) + + + {\displaystyle I(b)} + +. + + +== Randomized algorithm == +An approximate solution can be found with the following very simple randomized algorithm: + +Send each subject to the treatment group with probability 1/2. +In matrix formulation: + +Choose the elements of + + + + b + + + {\displaystyle b} + + randomly with probability 1/2 to each value in {1,-1}. +Surprisingly, although this algorithm completely ignores the matrix + + + + A + + + {\displaystyle A} + +, it achieves a small imbalance with high probability when there are many features. Formally, for a random vector + + + + b + + + {\displaystyle b} + +: + + + + + P + r + o + b + + [ + + I + ( + b + ) + ≥ + + + 4 + m + ln + ⁡ + n + + + + ] + + ≤ + + + 2 + n + + + + + {\displaystyle Prob\left[I(b)\geq {\sqrt {4m\ln n}}\right]\leq {\frac {2}{n}}} + + +PROOF: +Let + + + + + k + + i + + + + + {\displaystyle k_{i}} + + be the total number of subjects that have feature + + + + i + + + {\displaystyle i} + + (equivalently, the number of ones in the + + + + i + + + {\displaystyle i} + +-th row of the matrix + + + + A + + + {\displaystyle A} + +). Consider the following two cases: +Easy case: + + + + + k + + i + + + ≤ + + + 4 + m + ln + ⁡ + n + + + + + {\displaystyle k_{i}\leq {\sqrt {4m\ln n}}} + +. Then, with probability 1, the imbalance in feature + + + + i + + + {\displaystyle i} + + (that we marked by + + + + + c + + i + + + + + {\displaystyle c_{i}} + +) is at most + + + + + + 4 + m + ln + ⁡ + n + + + + + {\displaystyle {\sqrt {4m\ln n}}} + +. +Hard case: + + + + + k + + i + + + > + + + 4 + m + ln + ⁡ + n + + + + + {\displaystyle k_{i}>{\sqrt {4m\ln n}}} + +. For every + + + + j + + + {\displaystyle j} + +, let + + + + + X + + j + + + = + + a + + i + , + j + + + + b + + j + + + + + {\displaystyle X_{j}=a_{i,j}b_{j}} + +. Each such + + + + + X + + j + + + + + {\displaystyle X_{j}} + + is a random variable that can be either 1 or -1 with probability 1/2. The imbalance in feature + + + + i + + + {\displaystyle i} + + is: + + + + + c + + i + + + = + + ∑ + + j + = + 1 + + + m + + + + + X + + j + + + + + + {\displaystyle c_{i}=\sum _{j=1}^{m}{X_{j}}} + +. Since the + + + + + X + + j + + + + + {\displaystyle X_{j}} + + are independent random variables, by the Chernoff bound, for every + + + + a + > + 0 + + + {\displaystyle a>0} + +: + + + + + P + r + o + b + + [ + + + | + + + c + + i + + + + | + + ≥ + a + + ] + + ≤ + 2 + exp + ⁡ + ( + − + + a + + 2 + + + + / + + 2 + m + ) + + + {\displaystyle Prob\left[|c_{i}|\geq a\right]\leq 2\exp(-a^{2}/2m)} + + +Select: + + + + a + = + + + 4 + m + ln + ⁡ + n + + + + + {\displaystyle a={\sqrt {4m\ln n}}} + + and get: + + + + + P + r + o + b + + [ + + + | + + + c + + i + + + + | + + ≥ + + + 4 + m + ln + ⁡ + n + + + + ] + + ≤ + + + 2 + + n + + 2 + + + + + + + {\displaystyle Prob\left[|c_{i}|\geq {\sqrt {4m\ln n}}\right]\leq {\frac {2}{n^{2}}}} + + +By the union bound, + + + + + P + r + o + b + + [ + + ∃ + i + : + + | + + + c + + i + + + + | + + ≥ + + + 4 + m + ln + ⁡ + n + + + + ] + + ≤ + + + 2 + n + + + + + {\displaystyle Prob\left[\exists i:|c_{i}|\geq {\sqrt {4m\ln n}}\right]\leq {\frac {2}{n}}} + +. + + +== References == \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Sham_peer_review-0.md b/data/en.wikipedia.org/wiki/Sham_peer_review-0.md new file mode 100644 index 000000000..0cdc8a27b --- /dev/null +++ b/data/en.wikipedia.org/wiki/Sham_peer_review-0.md @@ -0,0 +1,32 @@ +--- +title: "Sham peer review" +chunk: 1/2 +source: "https://en.wikipedia.org/wiki/Sham_peer_review" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:53:19.440577+00:00" +instance: "kb-cron" +--- + +Sham peer review or malicious peer review is a name given to the abuse of a medical peer review process to attack a doctor for personal or other non-medical reasons. The American Medical Association conducted an investigation of medical peer review in 2007 and concluded that while it is easy to allege misconduct and 15% of surveyed physicians indicated that they were aware of peer review misuse or abuse, cases of malicious peer review able to be proven through the legal system are rare. + +== Legal basis == +Those who maintain that sham peer review is a pervasive problem suggest that the Healthcare Quality Improvement Act (HCQIA) of 1986 allows sham reviews by granting significant immunity from liability to doctors and others who participate in peer reviews. This immunity extends to investigative activities as well as to any associated peer review hearing, whether or not it leads to a disciplinary (or other) action. +The definition of a peer review body can be broad, including not only individuals but also (for example, in Oregon), "tissue committees, governing bodies or committees including medical staff committees of a [licensed] health care facility...or any other medical group in connection with bona fide medical research, quality assurance, utilization review, credentialing, education, training, supervision or discipline of physicians or other health care providers." +The California legislature framed its statutes so as to allow "aggrieved physicians the opportunity to prove that the peer review to which they were subject was in fact carried out for improper purposes, i.e., for purposes unrelated to assuring quality care or patient safety". These statutes allow that a peer review can be found in court to have been improper due to bad faith or malice, in which case the peer reviewers' immunities from civil liability "fall by the wayside". +Those who practice sham peer review could draw out the process by legal maneuvering, and the fairness of a peer review that has been unduly delayed has been called into question. Many medical staff laws specify guidelines for the timeliness of peer review, in compliance with JCAHO standards. + +== Medical peer review process == + +The medical peer review system is a quasi-judicial one. It is modeled in some ways on the grand jury / petit jury system. After a complainant asks for an investigation, a review body is assembled for fact-finding. This fact-finding body, called an ad hoc committee, is appointed by the medical Chief of Staff and is composed of other physician staff members chosen at the Chief of Staff's discretion. This ad hoc committee then conducts an investigation in the manner it feels is appropriate. This may include a review of the literature or an outside expert. Thus, there is no standard for impartiality and specifically no standard for due process in the "peer-review 'process' ." +Physicians that are indicted (and sanctioned) have the right to request a hearing. At the hearing, counsel is allowed. A second independent panel of physicians is chosen as the petit jury, and a hearing officer is chosen. The accused physician has the option to demonstrate conflicts of interest and attempt to disqualify jurors based on reasonable suspicions of bias or conflicts of interest in a voir dire process. +Although some medical staff bodies utilize the hospital attorney and accept hospital funds to try peer review cases, the California Medical Association discourages this practice. California has enacted legislation formally requiring the separation of the hospital and medical staff. + +== Alleged cases == +Some physicians allege that sham peer review is often conducted in retaliation for whistleblowing, although one study in 2007 suggested that such events were rare. + +=== Khajavi v. Feather River Anesthesiology Medical Group === +Those who disagree with the AMA point to the case of Nosrat Khajavi. In 1996, Khajavi, an anesthesiologist in Yuba City, California, disagreed with a surgeon over the appropriateness of cataract surgery for a patient and refused to attend during the procedure. Khajavi was subsequently terminated from his anesthesia group. He sued for wrongful termination under California Business & Professions' Code Section 2053, and the suit was allowed by the California Court of Appeals. In 2000, the court held that Khajavi was not protected from termination on the basis of advocating for what he felt was medically appropriate care. The court did not rule on the merits of the dispute. + +=== Mileikowsky v. Tenet === +A doctor was allegedly subject to multiple hearings for the same charges, and his rights to an expedited hearing were allegedly denied while a suspension was in place. On May 15, 2001, the California Medical Association filed an amicus curiae brief to emphasize legal protections meant to prevent physicians being arbitrarily excluded from access to healthcare facilities based on mechanisms such as summary suspension without a speedy hearing. This case was decided on April 18, 2005. The court ruled that the hearing officer in the case could indeed terminate the physician's peer review hearing based on grounds that the physician refused to cooperate on procedural and other matters necessary for the good conduct of the proceedings. Thus, the physician lost his membership and privileges at the hospital. Ironically, the same physician was brought into a peer review hearing at another facility a short time later. The hearing officer in that case also terminated the proceedings, this time due to the physician's failure to turn over certain evidence for use in the hearing. The physician challenged the termination through the court system arguing, contrary to the Tenet appellate court ruling, that California's peer review statutes never intended the hearing officer in peer review hearings to have such powers of termination. The California Supreme Court reviewed the case and agreed in April 2009. The High Court ruled, among other things, that peer review hearing officers must defer the question of termination to the panel of physicians who sit in judgment of each peer review hearing. \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Sham_peer_review-1.md b/data/en.wikipedia.org/wiki/Sham_peer_review-1.md new file mode 100644 index 000000000..85d3799f6 --- /dev/null +++ b/data/en.wikipedia.org/wiki/Sham_peer_review-1.md @@ -0,0 +1,38 @@ +--- +title: "Sham peer review" +chunk: 2/2 +source: "https://en.wikipedia.org/wiki/Sham_peer_review" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:53:19.440577+00:00" +instance: "kb-cron" +--- + +=== Roland Chalifoux === +Roland Chalifoux, member of an advocacy organisation called the Semmelweis Society, had his medical license revoked in Texas in 2004 after numerous incidents including the death of a patient. The Texas State Board of Medical Examiners stated that Chalifoux's practices "constitute such a deviation from the standard of care that revocation of his license is the only sanction that will adequately protect the public". Chalifoux subsequently secured permission to practice in West Virginia, and alleges that the Texas board's actions constitute sham peer review. + +=== Charles Williams, MD === +Six years after Charles Williams, MD, an anesthesiologist was summarily suspended by University Medical Center of Southern Nevada, a federal jury in Las Vegas awarded Dr. Williams $8.8 million as compensation for the due process violations he experienced in his sham peer review. Before the trial, which began May 16, U.S. District Judge Philip Pro made a finding that Ellerton and UMC's medical staff had violated Williams' due process rights. That left only the question of damages for the jury. This case appears to be the highest jury verdict in the nation for sham peer review which has not been overturned. + +=== Richard Chudacoff, MD === +On May 28, 2008, without any notice or opportunity to be heard, the Medical Staff of UMC suspended Dr. Chudacoff's clinical privileges. As a result of this, UMC filed a report against Dr. Chudacoff with the National Practitioner Data Bank claiming that Dr. Chudacoff was a risk to patient safety and had inadequate skills. This led to the virtual destruction of Dr. Chudacoff's career. Dr. Chudacoff sued. U.S. District Court Judge Edward Reed opined that, in Nevada, a physician's hospital privileges are a constitutionally protected property right. The Ninth Circuit Court of Appeals then affirmed that Dr. Chudacoff's due process rights were violated by UMC. As well, the Medical Executive members lost their immunity under the HCQIA for failure to follow their bylaws. The case was settled out of court in favor of Dr. Chudacoff, under the cloak of confidentiality. + +== Development of the Patient Safety Organization (PSO) == +The Patient Safety and Quality Improvement Act of 2005 (Public Law 109-41) allows for the creation of Patient Safety Organizations, quality of care committees that can act in parallel with peer review boards. PSOs were authorized to gather information to be analyzed by hospital administrators, nurses, and physicians as a tool for systems failure analysis. They may be used by any healthcare entity except insurance companies, but must be registered with the AHRQ wing of the US Department of Health and Human Services. +In PSOs, root cause analysis and "near misses" are evaluated in an attempt to avert major errors. Participants in PSOs are immune from prosecution in civil, criminal, and administrative hearings. + +== See also == +False accusations +Subpoena duces tecum + +== References == + +== Further reading == +Steve Twedt (2003-10-26). "The Cost of Courage: How the tables turn on doctors". Post-Gazette. PG Publishing. +Magazine Staff (2006-08-01). "All Is NOT Calm on the Hospital-Medical Staff Front". Southern California Physician. LACMA Services Inc. Archived from the original on 2006-09-02. +Charles Bond (2005-11-15). "Editorial in Response to "What Is Sham Peer Review?"". Medscape General Medicine. 7 (4): 48. +Goldstein H. (2006). "Appraising the performance of performance appraisals". IEEE Spectrum. Archived from the original on October 13, 2007. +Philip L. Merkel (Winter 2004). "Physicians Policing Physicians: the Development of Medical Staff Peer Review Law at California Hospitals" (PDF). University of San Francisco Law Review. 38: 301. +Bryan G. Hall (Fall 2003). "The Health Care Quality Improvement Act of 1986 and Physician Peer Reviews: Success or Failure?". University of South Dakota Elder Law Forum. Archived from the original on June 14, 2009. +"Health Policy in the Courts – California Medical Association Lawsuits – 2006 – January 2007" (PDF). California Medical Association. January 2007. +"A Monster Chase", Marion A. Stahl, 2012. \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Single-arm_study_design-0.md b/data/en.wikipedia.org/wiki/Single-arm_study_design-0.md index beab5ca2c..2650c89a4 100644 --- a/data/en.wikipedia.org/wiki/Single-arm_study_design-0.md +++ b/data/en.wikipedia.org/wiki/Single-arm_study_design-0.md @@ -4,7 +4,7 @@ chunk: 1/1 source: "https://en.wikipedia.org/wiki/Single-arm_study_design" category: "reference" tags: "science, encyclopedia" -date_saved: "2026-05-05T07:01:16.528685+00:00" +date_saved: "2026-05-05T09:52:02.810781+00:00" instance: "kb-cron" --- diff --git a/data/en.wikipedia.org/wiki/Single-subject_design-0.md b/data/en.wikipedia.org/wiki/Single-subject_design-0.md new file mode 100644 index 000000000..4c3b83051 --- /dev/null +++ b/data/en.wikipedia.org/wiki/Single-subject_design-0.md @@ -0,0 +1,81 @@ +--- +title: "Single-subject design" +chunk: 1/1 +source: "https://en.wikipedia.org/wiki/Single-subject_design" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:52:03.952869+00:00" +instance: "kb-cron" +--- + +In design of experiments, single-subject curriculum or single-case research design is a research design most often used in applied fields of psychology, education, and human behaviour in which the subject serves as his/her own control, rather than using another individual/group. Researchers use single-subject design because these designs are sensitive to individual organism differences vs group designs which are sensitive to averages of groups. The logic behind single subject designs is 1) Prediction, 2) Verification, and 3) Replication. The baseline data predicts behaviour by affirming the consequent. Verification refers to demonstrating that the baseline responding would have continued had no intervention been implemented. Replication occurs when a previously observed behaviour changed is reproduced. There can be large numbers of subjects in a research study using single-subject design, however—because the subject serves as their own control, this is still a single-subject design. These designs are used primarily to evaluate the effect of a variety of interventions in applied research. + + +== Design standards == + + +=== Effect size === +Although there are no standards on the specific statistics required for effect size calculation, it is best practice to include an effect size estimate. + + +== Reporting standards == +When reporting on findings obtained through single-subject designs, specific guidelines are used for standardization and to ensure completeness and transparency: + + +== Types of single-subject designs == + + +=== Reversal design === +Reversal design involves repeated measurement of behaviour in a given setting during three consecutive phases (ABA)- baseline, intervention, and return to baseline. Variations include extending the ABA design with repeated reversals (ABAB) and including multiple treatments (ABCABC). AB designs, or reversal designs with no return to baseline, are not considered experimental. Functional control cannot be determined in AB designs because there is no replication. + + +=== Alternating treatments design === +Alternating treatments design (ATD) compares the effects of two or more independent variables on the dependent variable. Variations include a no-treatment control condition and a final best-treatment verification phase. + + +=== Multiple baseline design === +Multiple baseline design involves simultaneous baseline measurement begins on two or more behaviours, settings, or participants. The IV is implemented on one behaviour, setting, or participant, while baseline continues for all others. Variations include the multiple probe design and delayed multiple baseline design. + + +=== Changing criterion design === +Changing criterion designs are used to evaluate the effects of an IV on the gradual improvement of a behavior already in the participant's repertoire. + + +== Interpretation of data == +In order to determine the effect of the independent variable on the dependent variable, the researcher will graph the data collected and visually inspect the differences between phases. If there is a clear distinction between baseline and intervention, and then the data returns to the same trends/level during reversal, a functional relation between the variables is inferred. Sometimes, visual inspection of the data demonstrates results that statistical tests fail to find. Features assessed during visual analysis include: + +Level. The overall average (mean) of the outcome measures within a phase. +Trend. The slope of the best-fitting straight line for the outcome measures within a phase. +Variability. The range, variance, or standard deviation of the outcome measures about the best-fitting line. +Immediacy of Effect. The change in level between the last three data points in one phase and the first three data points of the next. +Overlap. The proportion of data from one phase that overlaps with data from the previous phase. +Consistency of Data Patterns. The extent to which there is consistency in the data patterns from phases with the same conditions. + + +== Limitations == +Research designs are traditionally preplanned so that most of the details about to whom and when the intervention will be introduced are decided prior to the beginning of the study. However, in single-subject designs, these decisions are often made as the data are collected. In addition, there are no widely agreed-upon rules for altering phases, so conflicting ideas could emerge as to how a research experiment should be conducted in single-subject design. +The major criticism of single-subject designs are: + +Carry-over effects: Results from the previous phase carry-over into the next phase. +Order effects: The ordering (sequence) of the intervention or treatment affects what results. +Irreversibility: In some withdrawal designs, once a change in the independent variable occurs, the dependent variable is affected. This cannot be undone by simply removing the independent variable. +Ethical problems: Withdrawal of treatment in the withdrawal design can at times present ethical and feasibility problems. + + +== History == +Historically, single-subject designs have been closely tied to the experimental analysis of behavior and applied behavior analysis. + + +== See also == +N of 1 trial +Single-subject research +Segmented regression +Meta-analysis + + +== References == + + +== Further reading == +Kazdin, Alan (1982). Single-Case Research Designs. New York: Oxford University Press. ISBN 0-19-503021-4. +Ledford, Jennifer R. & Gast, David L. (2018). Single subject research methodology in behavioral sciences: applications in special education and behavioral sciences. Routledge, 2009. \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Single-subject_research-0.md b/data/en.wikipedia.org/wiki/Single-subject_research-0.md new file mode 100644 index 000000000..890981090 --- /dev/null +++ b/data/en.wikipedia.org/wiki/Single-subject_research-0.md @@ -0,0 +1,82 @@ +--- +title: "Single-subject research" +chunk: 1/1 +source: "https://en.wikipedia.org/wiki/Single-subject_research" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:52:05.128250+00:00" +instance: "kb-cron" +--- + +Single-subject research is a group of research methods that are used extensively in the experimental analysis of behavior and applied behavior analysis with both human and non-human participants. This research strategy focuses on one participant and tracks their progress in the research topic over a period of time. Single-subject research allows researchers to track changes in an individual over a large stretch of time instead of observing different people at different stages. This type of research can provide critical data in several fields, specifically psychology. It is most commonly used in experimental and applied analysis of behaviors. This research has been heavily debated over the years. Some believe that this research method is not effective at all while others praise the data that can be collected from it. Principal methods in this type of research are: A-B-A-B designs, Multi-element designs, Multiple Baseline designs, Repeated acquisition designs, Brief experimental designs and Combined designs. +These methods form the heart of the data collection and analytic code of behavior analysis. Behavior analysis is data driven, inductive, and disinclined to hypothetico-deductive methods. + + +== Experimental questions == +Experimental questions are decisive in determining the nature of the experimental design to be selected. There are four basic types of experimental questions: demonstration, comparison, parametric, and component. A demonstration is "Does A cause or influence B?". A comparison is "Does A1 or A2 cause or influence B more?". A parametric question is "How much of A will cause how much change or influence on B?". A component question is "Which part of A{1,2,3} - A1 or A2 or A3... - causes or influences B?" where A is composed of parts that can be separated and tested. +The A-B-A-B design is useful for demonstration questions. + + +== A-B-A-B == + + +=== A-B-A-B === +A-B-A-B designs begin with establishing a baseline (A #1) then introduce a new behavior or treatment (B #1). Then there is a return to the baseline (A #2) by removing B #1. B #2 is a return of the new behavior or treatment. + + +=== A-B === +An AB design is a two-part or phase design composed of a baseline ("A" phase) with no changes and a treatment or intervention ("B") phase. If there is a change then the treatment may be said to have had an effect. However, it is subject to many possible competing hypotheses, making strong conclusions difficult. Variants on the AB design introduce ways to control for the competing hypotheses to allow for stronger conclusions. + + +=== Reversal or A-B-A === +The reversal design is the most powerful of the single-subject research designs showing a strong reversal from baseline ("A") to treatment ("B") and back again. If the variable returns to baseline measure without a treatment then resumes its effects when reapplied, the researcher can have greater confidence in the efficacy of that treatment. However, many interventions cannot be reversed, some for ethical reasons (e.g., involving self-injurious behavior, smoking) and some for practical reasons (they cannot be unlearned, like a skill). +Further ethics notes: It may be unethical to end an experiment on a baseline measure if the treatment is self-sustaining and highly beneficial and/or related to health. Control condition participants may also deserve the benefits of research once all data has been collected. It is a researcher's ethical duty to maximize benefits and to ensure that all participants have access to those benefits when possible. + + +=== A-B-C === +The A-B-C design is a variant that allows for the extension of research questions around component, parametric and comparative questions. + + +== Multielement == +Multi-element designs sometimes referred to as alternating-treatment designs are used in order to ascertain the comparative effect of two treatments. Two treatments are alternated in rapid succession and correlated changes are plotted on a graph to facilitate comparison. Multi-element designs are typically used in Single-subject research to accurately test multiple independent variables at once. + + +== Multiple baseline == +The multiple baseline design was first reported in 1960 as used in basic operant research. It was applied in the late 1960s to human experiments in response to practical and ethical issues that arose in withdrawing apparently successful treatments from human subjects. In it two or more (often three) behaviors, people or settings are plotted in a staggered graph where a change is made to one, but not the other two, and then to the second, but not the third behavior, person or setting. Differential changes that occur to each behavior, person or in each setting help to strengthen what is essentially an AB design with its problematic competing hypotheses. +Multiple baseline tests are used to determine the helpfulness of an intervention. By focusing daily data collection on one participant, researchers can prepare to expand their research. This research method yields a high amount of data that can be analyzed by researchers. This data can then be used to support a researchers hypothesis and/or give insight before moving on to a group research project. + + +== Repeated Acquisition == +In addition to multiple baseline designs, a way to deal with problematic reversibility is the use of repeated acquisitions. + + +== Brief == +A designed favored by applied settings researchers where logistical challenges, time and other limits make research difficult are variants of multi-element and A-B-A-B type designs. + + +== Combined == +Combined Single-subject research is used to gain added knowledge on the research question and are used to make group research run better. The combined design has arisen from a need to obtain answers to more complex research questions. Combining two or more single-case designs, such as A-B-A-B and multiple baseline, may produce such answers. + + +== Multipleprobe == +Popular in Verbal Behavior research, the multipleprobe research design has elements of the other research designs. + + +== Changing-Criterion == +In a changing-criterion research design a criterion for reinforcement is changed across the experiment to demonstrate a functional relation between the reinforcement and the behavior. + + +== See also == +Single-subject design +N of 1 trial +Applied behavior analysis +Verbal behavior +B.F. Skinner + + +== References == + + +== Sources == +Kennedy, Craig H. (2005). Single-case designs for educational research. Boston: Pearson/A & B. ISBN 0-205-34023-7. +HAINS, ANN HIGGINS. “Multi-Element Designs for Early Intervention Research.” Journal of Early Intervention, vol. 15, no. 2, 1991, pp. 185–192., https://doi.org/10.1177/105381519101500207. \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Software_peer_review-0.md b/data/en.wikipedia.org/wiki/Software_peer_review-0.md new file mode 100644 index 000000000..a06ee1773 --- /dev/null +++ b/data/en.wikipedia.org/wiki/Software_peer_review-0.md @@ -0,0 +1,35 @@ +--- +title: "Software peer review" +chunk: 1/1 +source: "https://en.wikipedia.org/wiki/Software_peer_review" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:53:20.598380+00:00" +instance: "kb-cron" +--- + +In software development, peer review is a type of software review in which a work product (document, code, or other) is examined by the author's colleagues, in order to evaluate the work product's technical content and quality. + + +== Purpose == +The purpose of a peer review is to provide "a disciplined engineering practice for detecting and correcting defects in software artifacts, and preventing their leakage into field operations" according to the Capability Maturity Model. +When performed as part of each Software development process activity, peer reviews identify problems that can be fixed early in the lifecycle. That is to say, a peer review that identifies a requirements problem during the Requirements analysis activity is cheaper and easier to fix than during the Software architecture or Software testing activities. +The National Software Quality Experiment, evaluating the effectiveness of peer reviews, finds, "a favorable return on investment for software inspections; savings exceeds costs by 4 to 1". To state it another way, it is four times more costly, on average, to identify and fix a software problem later. + + +== Distinction from other types of software review == +Peer reviews are distinct from management reviews, which are conducted by management representatives rather than by colleagues, and for management and control purposes rather than for technical evaluation. They are also distinct from software audit reviews, which are conducted by personnel external to the project, to evaluate compliance with specifications, standards, contractual agreements, or other criteria. + + +== Review processes == + +Peer review processes exist across a spectrum of formality, with relatively unstructured activities such as "buddy checking" towards one end of the spectrum, and more Informal approaches such as walkthroughs, technical peer reviews, and software inspections, at the other. The IEEE defines formal structures, roles, and processes for each of the last three. +Management representatives are typically not involved in the conduct of a peer review except when included because of specific technical expertise or when the work product under review is a management-level document. This is especially true of line managers of other participants in the review. +Processes for formal peer reviews, such as software inspections, define specific roles for each participant, quantify stages with entry/exit criteria, capture software metrics on the peer review process. + + +== "Open source" reviews == +In the free / open source community, something like peer review has taken place in the engineering and evaluation of computer software. In this context, the rationale for peer review has its equivalent in Linus's law, often phrased: "Given enough eyeballs, all bugs are shallow", meaning "If there are enough reviewers, all problems are easy to solve." Eric S. Raymond has written influentially about peer review in software development. + + +== References == \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Solomon_four-group_design-0.md b/data/en.wikipedia.org/wiki/Solomon_four-group_design-0.md index a9852a599..ac066cec0 100644 --- a/data/en.wikipedia.org/wiki/Solomon_four-group_design-0.md +++ b/data/en.wikipedia.org/wiki/Solomon_four-group_design-0.md @@ -4,7 +4,7 @@ chunk: 1/1 source: "https://en.wikipedia.org/wiki/Solomon_four-group_design" category: "reference" tags: "science, encyclopedia" -date_saved: "2026-05-05T06:29:13.508479+00:00" +date_saved: "2026-05-05T09:52:06.354013+00:00" instance: "kb-cron" --- diff --git a/data/en.wikipedia.org/wiki/Sparsity-of-effects_principle-0.md b/data/en.wikipedia.org/wiki/Sparsity-of-effects_principle-0.md new file mode 100644 index 000000000..9777e96ee --- /dev/null +++ b/data/en.wikipedia.org/wiki/Sparsity-of-effects_principle-0.md @@ -0,0 +1,20 @@ +--- +title: "Sparsity-of-effects principle" +chunk: 1/1 +source: "https://en.wikipedia.org/wiki/Sparsity-of-effects_principle" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:52:07.517443+00:00" +instance: "kb-cron" +--- + +In the statistical analysis of the results from factorial experiments, the sparsity-of-effects principle states that a system is usually dominated by main effects and low-order interactions. Thus it is most likely that main (single factor) effects and two-factor interactions are the most significant responses in a factorial experiment. In other words, higher order interactions such as three-factor interactions are very rare. This is sometimes referred to as the hierarchical ordering principle. The sparsity-of-effects principle actually refers to the idea that only a few effects in a factorial experiment will be statistically significant. +This principle is only valid on the assumption of a factor space far from a stationary point. + + +== See also == +Occam's Razor +Pareto principle + + +== References == \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Spectrum_bias-0.md b/data/en.wikipedia.org/wiki/Spectrum_bias-0.md new file mode 100644 index 000000000..b36a717b8 --- /dev/null +++ b/data/en.wikipedia.org/wiki/Spectrum_bias-0.md @@ -0,0 +1,25 @@ +--- +title: "Spectrum bias" +chunk: 1/1 +source: "https://en.wikipedia.org/wiki/Spectrum_bias" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:52:08.680162+00:00" +instance: "kb-cron" +--- + +In biostatistics, spectrum bias refers to the phenomenon that the performance of a diagnostic test may vary in different clinical settings because each setting has a different mix of patients. Because the performance may be dependent on the mix of patients, performance at one clinic may not be predictive of performance at another clinic. These differences are interpreted as a kind of bias. Mathematically, the spectrum bias is a sampling bias and not a traditional statistical bias; this has led some authors to refer to the phenomenon as spectrum effects, whilst others maintain it is a bias if the true performance of the test differs from that which is 'expected'. Usually the performance of a diagnostic test is measured in terms of its sensitivity and specificity and it is changes in these that are considered when referring to spectrum bias. However, other performance measures such as the likelihood ratios may also be affected by spectrum bias. +Generally spectrum bias is considered to have three causes. The first is due to a change in the case-mix of those patients with the target disorder (disease) and this affects the sensitivity. The second is due to a change in the case-mix of those without the target disorder (disease-free) and this affects the specificity. The third is due to a change in the prevalence, and this affects both the sensitivity and specificity. This final cause is not widely appreciated, but there is mounting empirical evidence as well as theoretical arguments which suggest that it does indeed affect a test's performance. +Examples where the sensitivity and specificity change between different sub-groups of patients may be found with the carcinoembryonic antigen test and urinary dipstick tests. +Diagnostic test performances reported by some studies may be artificially overestimated if it is a case-control design where a healthy population ('fittest of the fit') is compared with a population with advanced disease ('sickest of the sick'); that is two extreme populations are compared, rather than typical healthy and diseased populations. +If properly analyzed, recognition of heterogeneity of subgroups can lead to insights about the test's performance in varying populations. + + +== See also == +Simpson's paradox +Biased sample +Reporting bias +Reference class problem + + +== References == \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Spherical_design-0.md b/data/en.wikipedia.org/wiki/Spherical_design-0.md new file mode 100644 index 000000000..369d9badf --- /dev/null +++ b/data/en.wikipedia.org/wiki/Spherical_design-0.md @@ -0,0 +1,80 @@ +--- +title: "Spherical design" +chunk: 1/1 +source: "https://en.wikipedia.org/wiki/Spherical_design" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:52:09.899202+00:00" +instance: "kb-cron" +--- + +A spherical design, part of combinatorial design theory in mathematics, is a finite set of N points on the d-dimensional unit d-sphere Sd such that the average value of any polynomial f of degree t or less on the set equals the average value of f on the whole sphere (that is, the integral of f over Sd divided by the area or measure of Sd). Such a set is often called a spherical t-design to indicate the value of t, which is a fundamental parameter. The concept of a spherical design is due to Delsarte, Goethals, and Seidel, although these objects were understood as particular examples of cubature formulas earlier. +Spherical designs can be of value in approximation theory, in statistics for experimental design, in combinatorics, and in geometry. The main problem is to find examples, given d and t, that are not too large; however, such examples may be hard to come by. +Spherical t-designs have also recently been appropriated in quantum mechanics in the form of quantum t-designs with various applications to quantum information theory and quantum computing. + + +== Existence of spherical designs == +The existence and structure of spherical designs on the circle were studied in depth by Hong. Shortly thereafter, Seymour and Zaslavsky proved that such designs exist of all sufficiently large sizes; that is, given positive integers d and t, there is a number N(d,t) such that for every N ≥ N(d,t) there exists a spherical t-design of N points in dimension d. However, their proof gave no idea of how big N(d,t) is. +Mimura constructively found conditions in terms of the number of points and the dimension which characterize exactly when spherical 2-designs exist. Maximally sized collections of equiangular lines (up to identification of lines as antipodal points on the sphere) are examples of minimal sized spherical 5-designs. There are many sporadic small spherical designs; many of them are related to finite group actions on the sphere. +In 2013, Bondarenko, Radchenko, and Viazovska obtained the asymptotic upper bound + + + + + N + ( + d + , + t + ) + < + + C + + d + + + + t + + d + + + + + {\displaystyle N(d,t) + + T + + o + b + s + + + ) + ] + . + + + {\displaystyle \mathrm {pval} ={\frac {1}{L+1}}[1+\sum _{l}1(T^{(l)}>T^{obs})].} + + +To explain this procedure, in Step 1, we define the sub-populations of interest: + + + + + I + + 1 + + + + + {\displaystyle I_{1}} + + is the set of students who are in control but their roommate is treated, +and + + + + + I + + 0 + + + + + {\displaystyle I_{0}} + + are the students in control with their roommates also in control. These are known as "focal units". +In Step 2, we define an estimate of the spillover effect as + + + + + + + + Y + ¯ + + + + + 0 + , + 1 + + + − + + + + + Y + ¯ + + + + + 0 + , + 0 + + + + + {\displaystyle {\bar {Y}}_{0,1}-{\bar {Y}}_{0,0}} + +, the difference in outcomes between populations + + + + + I + + 1 + + + , + + I + + 0 + + + + + {\displaystyle I_{1},I_{0}} + +. +Crucially, in randomization inference, we don't need to derive the sampling distribution of this estimator. The validity of the procedure stems from Step 3 where we resample treatment according to the true experimental variation (here, simply permuting the "exposures" 01 and 00) while keeping the outcomes fixed under the null. Finally, in Step 4 we calculate the randomization p-value. The + + + + 1 + + / + + ( + L + + + 1 + ) + + + {\displaystyle 1/(L+1)} + + term is a finite-sample correction to avoid issues with repeated test statistic values. +As mentioned before, the randomization p-value is valid for any finite sample size and does not rely on correct model specification. +This randomization procedure can be extended to arbitrary designs and more general definitions of spillover effects, although care must be taken to properly account for the interference structure between all pairs of units. +The above procedure can also be used to obtain an interval estimate of a constant spillover effect through test inversion. Moreover, the same procedure could be modified for testing whether the "average" spillover effect is zero by using an appropriately studentized test statistic in Step 2. + +==== Exposure mappings ==== + +The next step after redefining the causal estimand of interest is to characterize the probability of spillover exposure for each subject in the analysis, given some vector of treatment assignment. Aronow and Samii (2017) present a method for obtaining a matrix of exposure probabilities for each unit in the analysis. +First, define a diagonal matrix with a vector of treatment assignment probabilities + + + + + P + + = + diag + ⁡ + + ( + + + p + + + + z + + + 1 + + + + + , + + p + + + + z + + + 2 + + + + + , + … + , + + p + + + + z + + + + | + + Ω + + | + + + + + + + ) + + . + + + {\displaystyle \mathbf {P} =\operatorname {diag} \left(p_{\mathbf {z} _{1}},p_{\mathbf {z} _{2}},\dots ,p_{\mathbf {z} _{|\Omega |}}\right).} + + +Second, define an indicator matrix + + + + + I + + + + {\displaystyle \mathbf {I} } + + of whether the unit is exposed to spillover or not. This is done by using an adjacency matrix as shown on the right, where information regarding a network can be transformed into an indicator matrix. This resulting indicator matrix will contain values of + + + + + d + + k + + + + + {\displaystyle d_{k}} + +, the realized values of a random binary variable + + + + + D + + i + + + = + f + + ( + + + Z + + , + + θ + + i + + + + ) + + + + {\displaystyle D_{i}=f\left(\mathbf {Z} ,\theta _{i}\right)} + +, indicating whether that unit has been exposed to spillover or not. +Third, obtain the sandwich product + + + + + + I + + + k + + + + P + + + + I + + + k + + + ′ + + + + + {\displaystyle \mathbf {I} _{k}\mathbf {P} \mathbf {I} _{k}^{\prime }} + +, an N × N matrix which contains two elements: the individual probability of exposure + + + + + π + + i + + + + ( + + d + + k + + + ) + + + + {\displaystyle \pi _{i}\left(d_{k}\right)} + +on the diagonal, and the joint exposure probabilities + + + + + π + + i + j + + + + ( + + d + + k + + + ) + + + + {\displaystyle \pi _{ij}\left(d_{k}\right)} + +on the off diagonals: \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Spillover_(experiment)-2.md b/data/en.wikipedia.org/wiki/Spillover_(experiment)-2.md new file mode 100644 index 000000000..e6db1bc69 --- /dev/null +++ b/data/en.wikipedia.org/wiki/Spillover_(experiment)-2.md @@ -0,0 +1,595 @@ +--- +title: "Spillover (experiment)" +chunk: 3/3 +source: "https://en.wikipedia.org/wiki/Spillover_(experiment)" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:52:11.116942+00:00" +instance: "kb-cron" +--- + + + + + + + I + + + k + + + + P + + + + I + + + k + + + ′ + + + = + + [ + + + + + + + π + + 1 + + + ( + + d + + k + + + ) + + + + + π + + 12 + + + ( + + d + + k + + + ) + + + ⋯ + + + + π + + 1 + N + + + ( + + d + + k + + + ) + + + + + + π + + 21 + + + ( + + d + + k + + + ) + + + + π + + 2 + + + ( + + d + + k + + + ) + + + ⋯ + + + + π + + 2 + N + + + ( + + d + + k + + + ) + + + + + ⋮ + + + ⋮ + + + ⋱ + + + + + + + π + + N + 1 + + + ( + + d + + k + + + ) + + + + π + + N + 2 + + + ( + + d + + k + + + ) + + + + + + + + + π + + N + + + ( + + d + + k + + + ) + + + + + ] + + + + {\displaystyle \mathbf {I} _{k}\mathbf {P} \mathbf {I} _{k}^{\prime }=\left[{\begin{array}{cccc}{\pi _{1}(d_{k})}&\pi _{12}(d_{k})&\cdots &\pi _{1N}(d_{k})\\\pi _{21}(d_{k})&\pi _{2}(d_{k})&\cdots &\pi _{2N}(d_{k})\\\vdots &\vdots &\ddots &\\\pi _{N1}(d_{k})&\pi _{N2}(d_{k})&{}&\pi _{N}(d_{k})\end{array}}\right]} + +In a similar fashion, the joint probability of exposure of i being in exposure condition + + + + + d + + k + + + + + {\displaystyle d_{k}} + + and j being in a different exposure condition + + + + + d + + l + + + + + {\displaystyle d_{l}} + +can be obtained by calculating + + + + + + I + + + k + + + + P + + + + I + + + l + + + ′ + + + + + {\displaystyle \mathbf {I} _{k}\mathbf {P} \mathbf {I} _{l}^{\prime }} + +: + + + + + + + I + + + k + + + + P + + + + I + + + l + + + ′ + + + = + + [ + + + + + + 0 + + + + + + π + + 12 + + + + ( + + + d + + k + + + , + + d + + l + + + + ) + + + + + + … + + + + + + π + + 1 + N + + + + ( + + + d + + k + + + , + + d + + l + + + + ) + + + + + + + + + π + + 21 + + + + ( + + + d + + k + + + , + + d + + l + + + + ) + + + + + + 0 + + + + + … + + + + + + π + + 2 + N + + + + ( + + + d + + k + + + , + + d + + l + + + + ) + + + + + + + + ⋮ + + + + + ⋮ + + + + + ⋱ + + + + + + + + + + + + π + + N + 1 + + + ( + + d + + k + + + , + + d + + l + + + ) + + + + π + + N + 2 + + + ( + + d + + k + + + , + + d + + l + + + ) + + + + 0 + + + + + ] + + + + {\displaystyle \mathbf {I} _{k}\mathbf {P} \mathbf {I} _{l}^{\prime }=\left[{\begin{array}{c c c c }{0}&{\pi _{12}\left(d_{k},d_{l}\right)}&{\dots }&{\pi _{1N}\left(d_{k},d_{l}\right)}\\{\pi _{21}\left(d_{k},d_{l}\right)}&{0}&{\ldots }&{\pi _{2N}\left(d_{k},d_{l}\right)}\\{\vdots }&{\vdots }&{\ddots }&{}\\\pi _{N1}(d_{k},d_{l})&\pi _{N2}(d_{k},d_{l})&&0\end{array}}\right]} + +Notice that the diagonals on the second matrix are 0 because a subject cannot be simultaneously exposed to two different exposure conditions at once, in the same way that a subject cannot reveal two different potential outcomes at once. +The obtained exposure probabilities + + + + π + + + {\displaystyle \pi } + +then can be used for inverse probability weighting (IPW, described below), in an estimator such as the Horvitz–Thompson estimator. +One important caveat is that this procedure excludes all units whose probability of exposure is zero (ex. a unit that is not connected to any other units), since these numbers end up in the denominator of the IPW regression. + +=== Need for inverse probability weights === + +Estimating spillover effects requires additional care: although treatment is directly assigned, spillover status is indirectly assigned and can lead to differential probabilities of spillover assignment for units. For example, a subject with 10 friend connections is more likely to be indirectly exposed to treatment as opposed to a subject with just one friend connection. Not accounting for varying probabilities of spillover exposure can bias estimates of the average spillover effect. +Figure 1 displays an example where units have varying probabilities of being assigned to the spillover condition. Subfigure A displays a network of 25 nodes where the units in green are eligible to receive treatment. Spillovers are defined as sharing at least one edge with a treated unit. For example, if node 16 is treated, nodes 11, 17, and 21 would be classified as spillover units. Suppose three of these six green units are selected randomly to be treated, so that + + + + + + + ( + + + 6 + 3 + + + ) + + + + = + 20 + + + {\displaystyle {\binom {6}{3}}=20} + + different sets of treatment assignments are possible. In this case, subfigure B displays each node's probability of being assigned to the spillover condition. Node 3 is assigned to spillover in 95% of the randomizations because it shares edges with three units that are treated. This node will only be a control node in 5% of randomizations: that is, when the three treated nodes are 14, 16, and 18. Meanwhile, node 15 is assigned to spillover only 50% of the time—if node 14 is not directly treated, node 15 will not be assigned to spillover. + +==== Using inverse probability weights ==== +When analyzing experiments with varying probabilities of assignment, special precautions should be taken. These differences in assignment probabilities may be neutralized by inverse-probability-weighted (IPW) regression, where each observation is weighted by the inverse of its likelihood of being assigned to the treatment condition observed using the Horvitz-Thompson estimator. This approach addresses the bias that might arise if potential outcomes were systematically related to assignment probabilities. The downside of this estimator is that it may be fraught with sampling variability if some observations are accorded a high amount of weight (i.e. a unit with a low probability of being spillover is assigned to the spillover condition by chance). + +==== Regression approaches ==== +In non-experimental settings, estimating the variability of a spillover effect creates additional difficulty. When the research study has a fixed unit of clustering, such as a school or household, researchers can use traditional standard error adjustment tools like cluster-robust standard errors, which allow for correlations in error terms within clusters but not across them. In other settings, however, there is no fixed unit of clustering. In order to conduct hypothesis testing in these settings, the use of randomization inference is recommended. This technique allows one to generate p-values and confidence intervals even when spillovers do not adhere to a fixed unit of clustering but nearby units tend to be assigned to similar spillover conditions, as in the case of fuzzy clustering. + +== See also == +Social multiplier effect + +== References == \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Standard_treatment-0.md b/data/en.wikipedia.org/wiki/Standard_treatment-0.md new file mode 100644 index 000000000..4c5e81290 --- /dev/null +++ b/data/en.wikipedia.org/wiki/Standard_treatment-0.md @@ -0,0 +1,26 @@ +--- +title: "Standard treatment" +chunk: 1/1 +source: "https://en.wikipedia.org/wiki/Standard_treatment" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:52:12.282392+00:00" +instance: "kb-cron" +--- + +The standard treatment, also known as the standard of care, is the medical treatment that is normally provided to people with a given condition. In many scientific studies, the control group receives the standard treatment rather than a placebo while a treatment group receives the experimental treatment. After the clinical trial, researchers compare the outcomes of the two groups to see if the experimental treatment is better than, as good as or not as beneficial as the standard treatment. + + +== Active (positive) concurrent control == +In an active control or positive control trial, subjects are randomly assigned to the test treatment or to an active control treatment. Such clinical trials are usually double-blind, but this is not always possible; many oncology trials, for example, are considered difficult or impossible to blind because of different regimens, different routes of administration, and different toxicities. Active control trials can have two distinct objectives with respect to showing efficacy: (1) to show efficacy of the test treatment by showing it is as good as a known effective treatment or (2) to show efficacy by showing superiority of the test treatment to the active control. They may also be used with the primary objective of comparing the efficacy and/or safety of the two treatments. Whether the purpose of the trial is to show efficacy of the new treatment or to compare two treatments, the question of whether the trial would be capable of distinguishing effective from less effective or ineffective treatments is critical. + + +== See also == +Drug development +New drug application +U.S. Food and Drug Administration (FDA) +European Medicines Agency +International Conference on Harmonisation of Technical Requirements for Registration of Pharmaceuticals for Human Use + + +== References == \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Statistical_hypothesis_test-0.md b/data/en.wikipedia.org/wiki/Statistical_hypothesis_test-0.md new file mode 100644 index 000000000..e3a1bdc49 --- /dev/null +++ b/data/en.wikipedia.org/wiki/Statistical_hypothesis_test-0.md @@ -0,0 +1,31 @@ +--- +title: "Statistical hypothesis test" +chunk: 1/8 +source: "https://en.wikipedia.org/wiki/Statistical_hypothesis_test" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:52:13.457599+00:00" +instance: "kb-cron" +--- + +A statistical hypothesis test is a method of statistical inference used to decide whether the data provide sufficient evidence to reject a particular hypothesis. A statistical hypothesis test typically involves a calculation of a test statistic. Then a decision is made, either by comparing the test statistic to a critical value or equivalently by evaluating a p-value computed from the test statistic. Roughly 100 specialized statistical tests are in use and noteworthy. + +== History == + +While hypothesis testing was popularized early in the 20th century, early forms were used in the 1700s. The first use is credited to John Arbuthnot (1710), followed by Pierre-Simon Laplace (1770s), in analyzing the human sex ratio at birth; see § Human sex ratio. + +=== Choice of null hypothesis === +Paul Meehl has argued that the epistemological importance of the choice of null hypothesis has gone largely unacknowledged. When the null hypothesis is predicted by theory, a more precise experiment will be a more severe test of the underlying theory. When the null hypothesis defaults to "no difference" or "no effect", a more precise experiment is a less severe test of the theory that motivated performing the experiment. An examination of the origins of the latter practice may therefore be useful: +1778: Pierre Laplace compares the birthrates of boys and girls in multiple European cities. He states: "it is natural to conclude that these possibilities are very nearly in the same ratio". Thus, the null hypothesis in this case that the birthrates of boys and girls should be equal given "conventional wisdom". +1900: Karl Pearson develops the chi squared test to determine "whether a given form of frequency curve will effectively describe the samples drawn from a given population." Thus the null hypothesis is that a population is described by some distribution predicted by theory. He uses as an example the numbers of five and sixes in the Weldon dice throw data. +1904: Karl Pearson develops the concept of "contingency" in order to determine whether outcomes are independent of a given categorical factor. Here the null hypothesis is by default that two things are unrelated (e.g. scar formation and death rates from smallpox). The null hypothesis in this case is no longer predicted by theory or conventional wisdom, but is instead the principle of indifference that led Fisher and others to dismiss the use of "inverse probabilities". + +== Modern origins and early controversy == +Modern significance testing is largely the product of Karl Pearson (p-value, Pearson's chi-squared test), William Sealy Gosset (Student's t-distribution), and Ronald Fisher ("null hypothesis", analysis of variance, "significance test"), while hypothesis testing was developed by Jerzy Neyman and Egon Pearson (son of Karl). Ronald Fisher began his life in statistics as a Bayesian (Zabell 1992), but Fisher soon grew disenchanted with the subjectivity involved (namely use of the principle of indifference when determining prior probabilities), and sought to provide a more "objective" approach to inductive inference. +Fisher emphasized rigorous experimental design and methods to extract a result from few samples assuming Gaussian distributions. Neyman (who teamed with the younger Pearson) emphasized mathematical rigor and methods to obtain more results from many samples and a wider range of distributions. Modern hypothesis testing is an inconsistent hybrid of the Fisher vs Neyman/Pearson formulation, methods and terminology developed in the early 20th century. +Fisher popularized the "significance test". He required a null-hypothesis (corresponding to a population frequency distribution) and a sample. His (now familiar) calculations determined whether to reject the null-hypothesis or not. Significance testing did not utilize an alternative hypothesis so there was no concept of a Type II error (false negative). +The p-value was devised as an informal, but objective, index meant to help a researcher determine (based on other knowledge) whether to modify future experiments or strengthen one's faith in the null hypothesis. Hypothesis testing (and Type I/II errors) was devised by Neyman and Pearson as a more objective alternative to Fisher's p-value, also meant to determine researcher behaviour, but without requiring any inductive inference by the researcher. +Neyman & Pearson considered a different problem to Fisher (which they called "hypothesis testing"). They initially considered two simple hypotheses (both with frequency distributions). They calculated two probabilities and typically selected the hypothesis associated with the higher probability (the hypothesis more likely to have generated the sample). Their method always selected a hypothesis. It also allowed the calculation of both types of error probabilities. +Fisher and Neyman/Pearson clashed bitterly. Neyman/Pearson considered their formulation to be an improved generalization of significance testing (the defining paper was abstract; Mathematicians have generalized and refined the theory for decades). Fisher thought that it was not applicable to scientific research because often, during the course of the experiment, it is discovered that the initial assumptions about the null hypothesis are questionable due to unexpected sources of error. He believed that the use of rigid reject/accept decisions based on models formulated before data is collected was incompatible with this common scenario faced by scientists and attempts to apply this method to scientific research would lead to mass confusion. +The dispute between Fisher and Neyman–Pearson was waged on philosophical grounds, characterized by a philosopher as a dispute over the proper role of models in statistical inference. +Events intervened: Neyman accepted a position in the University of California, Berkeley in 1938, breaking his partnership with Pearson and separating the disputants (who had occupied the same building). World War II provided an intermission in the debate. The dispute between Fisher and Neyman terminated (unresolved after 27 years) with Fisher's death in 1962. Neyman wrote a well-regarded eulogy. Some of Neyman's later publications reported p-values and significance levels. \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Statistical_hypothesis_test-1.md b/data/en.wikipedia.org/wiki/Statistical_hypothesis_test-1.md new file mode 100644 index 000000000..0039717f7 --- /dev/null +++ b/data/en.wikipedia.org/wiki/Statistical_hypothesis_test-1.md @@ -0,0 +1,41 @@ +--- +title: "Statistical hypothesis test" +chunk: 2/8 +source: "https://en.wikipedia.org/wiki/Statistical_hypothesis_test" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:52:13.457599+00:00" +instance: "kb-cron" +--- + +=== Null hypothesis significance testing (NHST) === +The modern version of hypothesis testing is generally called the null hypothesis significance testing (NHST) and is a hybrid of the Fisher approach with the Neyman-Pearson approach. In 2000, Raymond S. Nickerson wrote an article stating that NHST was (at the time) "arguably the most widely used method of analysis of data collected in psychological experiments and has been so for about 70 years" and that it was at the same time "very controversial". +This fusion resulted from confusion by writers of statistical textbooks (as predicted by Fisher) beginning in the 1940s (but signal detection, for example, still uses the Neyman/Pearson formulation). Great conceptual differences and many caveats in addition to those mentioned above were ignored. Neyman and Pearson provided the stronger terminology, the more rigorous mathematics and the more consistent philosophy, but the subject taught today in introductory statistics has more similarities with Fisher's method than theirs. +Sometime around 1940, authors of statistical text books began combining the two approaches by using the p-value in place of the test statistic (or data) to test against the Neyman–Pearson "significance level". + +== Philosophy == +Hypothesis testing and philosophy intersect. Inferential statistics, which includes hypothesis testing, is applied probability. Both probability and its application are intertwined with philosophy. Philosopher David Hume wrote, "All knowledge degenerates into probability." Competing practical definitions of probability reflect philosophical differences. The most common application of hypothesis testing is in the scientific interpretation of experimental data, which is naturally studied by the philosophy of science. +Fisher and Neyman opposed the subjectivity of probability. Their views contributed to the objective definitions. The core of their historical disagreement was philosophical. +Many of the philosophical criticisms of hypothesis testing are discussed by statisticians in other contexts, particularly correlation does not imply causation and the design of experiments. +Hypothesis testing is of continuing interest to philosophers. + +== Education == + +Statistics is increasingly being taught in schools with hypothesis testing being one of the elements taught. Many conclusions reported in the popular press (political opinion polls to medical studies) are based on statistics. Some writers have stated that statistical analysis of this kind allows for thinking clearly about problems involving mass data, as well as the effective reporting of trends and inferences from said data, but caution that writers for a broad public should have a solid understanding of the field in order to use the terms and concepts correctly. An introductory college statistics class places much emphasis on hypothesis testing – perhaps half of the course. Such fields as literature and divinity now include findings based on statistical analysis (see the Bible Analyzer). An introductory statistics class teaches hypothesis testing as a cookbook process. Hypothesis testing is also taught at the postgraduate level. Statisticians learn how to create good statistical test procedures (like z, Student's t, F and chi-squared). Statistical hypothesis testing is considered a mature area within statistics, but a limited amount of development continues. +An academic study states that the cookbook method of teaching introductory statistics leaves no time for history, philosophy or controversy. Hypothesis testing has been taught as received unified method. Surveys showed that graduates of the class were filled with philosophical misconceptions (on all aspects of statistical inference) that persisted among instructors. While the problem was addressed more than a decade ago, and calls for educational reform continue, students still graduate from statistics classes holding fundamental misconceptions about hypothesis testing. Ideas for improving the teaching of hypothesis testing include encouraging students to search for statistical errors in published papers, teaching the history of statistics and emphasizing the controversy in a generally dry subject. +Raymond S. Nickerson commented: + +The debate about NHST has its roots in unresolved disagreements among major contributors to the development of theories of inferential statistics on which modern approaches are based. Gigerenzer et al. (1989) have reviewed in considerable detail the controversy between R. A. Fisher on the one hand and Jerzy Neyman and Egon Pearson on the other as well as the disagreements between both of these views and those of the followers of Thomas Bayes. They noted the remarkable fact that little hint of the historical and ongoing controversy is to be found in most textbooks that are used to teach NHST to its potential users. The resulting lack of an accurate historical perspective and understanding of the complexity and sometimes controversial philosophical foundations of various approaches to statistical inference may go a long way toward explaining the apparent ease with which statistical tests are misused and misinterpreted. + +== Performing a frequentist hypothesis test in practice == +The typical steps involved in performing a frequentist hypothesis test in practice are: + +Define a hypothesis (claim which is testable using data). +Select a relevant statistical test with associated test statistic T. +Derive the distribution of the test statistic under the null hypothesis from the assumptions. In standard cases this will be a well-known result. For example, the test statistic might follow a Student's t distribution with known degrees of freedom, or a normal distribution with known mean and variance. +Select a significance level (α), the maximum acceptable false positive rate. Common values are 5% and 1%. +Compute from the observations the observed value tobs of the test statistic T. +Decide to either reject the null hypothesis in favor of the alternative or not reject it. The Neyman-Pearson decision rule is to reject the null hypothesis H0 if the observed value tobs is in the critical region, and not to reject the null hypothesis otherwise. + +=== Practical example === +The difference in the two processes applied to the radioactive suitcase example (below): \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Statistical_hypothesis_test-2.md b/data/en.wikipedia.org/wiki/Statistical_hypothesis_test-2.md new file mode 100644 index 000000000..ac9ccefcf --- /dev/null +++ b/data/en.wikipedia.org/wiki/Statistical_hypothesis_test-2.md @@ -0,0 +1,71 @@ +--- +title: "Statistical hypothesis test" +chunk: 3/8 +source: "https://en.wikipedia.org/wiki/Statistical_hypothesis_test" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:52:13.457599+00:00" +instance: "kb-cron" +--- + +"The Geiger-counter reading is 10. The limit is 9. Check the suitcase." +"The Geiger-counter reading is high; 97% of safe suitcases have lower readings. The limit is 95%. Check the suitcase." +The former report is adequate, the latter gives a more detailed explanation of the data and the reason why the suitcase is being checked. +Not rejecting the null hypothesis does not mean the null hypothesis is "accepted" per se (though Neyman and Pearson used that word in their original writings; see the Interpretation section). +The processes described here are perfectly adequate for computation. They seriously neglect the design of experiments considerations. +It is particularly critical that appropriate sample sizes be estimated before conducting the experiment. +The phrase "test of significance" was coined by statistician Ronald Fisher. + +=== Interpretation === +When the null hypothesis is true and statistical assumptions are met, the probability that the p-value will be less than or equal to the significance level + + + + α + + + {\displaystyle \alpha } + + is at most + + + + α + + + {\displaystyle \alpha } + +. This ensures that the hypothesis test maintains its specified false positive rate (provided that statistical assumptions are met). +The p-value is the probability that a test statistic which is at least as extreme as the one obtained would occur under the null hypothesis. At a significance level of 0.05, a fair coin would be expected to (incorrectly) reject the null hypothesis (that it is fair) in 1 out of 20 tests on average. The p-value does not provide the probability that either the null hypothesis or its opposite is correct (a common source of confusion). +If the p-value is less than the chosen significance threshold (equivalently, if the observed test statistic is in the critical region), then we say the null hypothesis is rejected at the chosen level of significance. If the p-value is not less than the chosen significance threshold (equivalently, if the observed test statistic is outside the critical region), then the null hypothesis is not rejected at the chosen level of significance. +In the "lady tasting tea" example (below), Fisher required the lady to properly categorize all of the cups of tea to justify the conclusion that the result was unlikely to result from chance. His test revealed that if the lady was effectively guessing at random (the null hypothesis), there was a 1.4% chance that the observed results (perfectly ordered tea) would occur. + +=== Use and importance === +Statistics are helpful in analyzing most collections of data. This is equally true of hypothesis testing which can justify conclusions even when no scientific theory exists. In the Lady tasting tea example, it was "obvious" that no difference existed between (milk poured into tea) and (tea poured into milk). The data contradicted the "obvious". +Real world applications of hypothesis testing include: + +Testing whether more men than women suffer from nightmares +Establishing authorship of documents +Evaluating the effect of the full moon on behavior +Determining the range at which a bat can detect an insect by echo +Deciding whether hospital carpeting results in more infections +Selecting the best means to stop smoking +Checking whether bumper stickers reflect car owner behavior +Testing the claims of handwriting analysts +Statistical hypothesis testing plays an important role in the whole of statistics and in statistical inference. For example, Lehmann (1992) in a review of the fundamental paper by Neyman and Pearson (1933) says: "Nevertheless, despite their shortcomings, the new paradigm formulated in the 1933 paper, and the many developments carried out within its framework continue to play a central role in both the theory and practice of statistics and can be expected to do so in the foreseeable future". +Significance testing has been the favored statistical tool in some experimental social sciences (over 90% of articles in the Journal of Applied Psychology during the early 1990s). Other fields have favored the estimation of parameters (e.g. effect size). Significance testing is used as a substitute for the traditional comparison of predicted value and experimental result at the core of the scientific method. When theory is only capable of predicting the sign of a relationship, a directional (one-sided) hypothesis test can be configured so that only a statistically significant result supports theory. This form of theory appraisal is the most heavily criticized application of hypothesis testing. + +=== Cautions === +"If the government required statistical procedures to carry warning labels like those on drugs, most inference methods would have long labels indeed." This caution applies to hypothesis tests and alternatives to them. +The successful hypothesis test is associated with a probability and a type-I error rate. The conclusion might be wrong. +The conclusion of the test is only as solid as the sample upon which it is based. The design of the experiment is critical. A number of unexpected effects have been observed including: + +The clever Hans effect. A horse appeared to be capable of doing simple arithmetic. +The Hawthorne effect. Industrial workers were more productive in better illumination, and most productive in worse. +The placebo effect. Pills with no medically active ingredients were remarkably effective. +A statistical analysis of misleading data produces misleading conclusions. The issue of data quality can be more subtle. In forecasting for example, there is no agreement on a measure of forecast accuracy. In the absence of a consensus measurement, no decision based on measurements will be without controversy. +Publication bias: Statistically nonsignificant results may be less likely to be published, which can bias the literature. +Multiple testing: When multiple true null hypothesis tests are conducted at once without adjustment, the overall probability of Type I error is higher than the nominal alpha level. +Those making critical decisions based on the results of a hypothesis test are prudent to look at the details rather than the conclusion alone. In the physical sciences most results are fully accepted only when independently confirmed. The general advice concerning statistics is, "Figures never lie, but liars figure" (anonymous). + +== Definition of terms == \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Statistical_hypothesis_test-3.md b/data/en.wikipedia.org/wiki/Statistical_hypothesis_test-3.md new file mode 100644 index 000000000..307e3c334 --- /dev/null +++ b/data/en.wikipedia.org/wiki/Statistical_hypothesis_test-3.md @@ -0,0 +1,49 @@ +--- +title: "Statistical hypothesis test" +chunk: 4/8 +source: "https://en.wikipedia.org/wiki/Statistical_hypothesis_test" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:52:13.457599+00:00" +instance: "kb-cron" +--- + +The following definitions are mainly based on the exposition in the book by Lehmann and Romano: + +Statistical hypothesis: A statement about the parameters describing a population (not a sample). +Test statistic: A value calculated from a sample without any unknown parameters, often to summarize the sample for comparison purposes. +Simple hypothesis: Any hypothesis which specifies the population distribution completely. +Composite hypothesis: Any hypothesis which does not specify the population distribution completely. +Null hypothesis (H0) +Positive data: Data that enable the investigator to reject a null hypothesis. +Alternative hypothesis (H1) + +Critical values of a statistical test are the boundaries of the acceptance region of the test. The acceptance region is the set of values of the test statistic for which the null hypothesis is not rejected. Depending on the shape of the acceptance region, there can be one or more than one critical value. +Region of rejection / Critical region: The set of values of the test statistic for which the null hypothesis is rejected. +Power of a test (1 − β) +Size: For simple hypotheses, this is the test's probability of incorrectly rejecting the null hypothesis. The false positive rate. For composite hypotheses this is the supremum of the probability of rejecting the null hypothesis over all cases covered by the null hypothesis. The complement of the false positive rate is termed specificity in biostatistics. ("This is a specific test. Because the result is positive, we can confidently say that the patient has the condition.") See sensitivity and specificity and type I and type II errors for exhaustive definitions. +Significance level of a test (α) +p-value +Statistical significance test: A predecessor to the statistical hypothesis test (see the Origins section). An experimental result was said to be statistically significant if a sample was sufficiently inconsistent with the (null) hypothesis. This was variously considered common sense, a pragmatic heuristic for identifying meaningful experimental results, a convention establishing a threshold of statistical evidence or a method for drawing conclusions from data. The statistical hypothesis test added mathematical rigor and philosophical consistency to the concept by making the alternative hypothesis explicit. The term is loosely used for the modern version which is now part of statistical hypothesis testing. +Conservative test: A test is conservative if, when constructed for a given nominal significance level, the true probability of incorrectly rejecting the null hypothesis is never greater than the nominal level. +Exact test +A statistical hypothesis test compares a test statistic (z or t for examples) to a threshold. The test statistic (the formula found in the table below) is based on optimality. For a fixed level of Type I error rate, use of these statistics minimizes Type II error rates (equivalent to maximizing power). The following terms describe tests in terms of such optimality: + +Most powerful test: For a given size or significance level, the test with the greatest power (probability of rejection) for a given value of the parameter(s) being tested, contained in the alternative hypothesis. +Uniformly most powerful test (UMP) + +== Nonparametric bootstrap hypothesis testing == + +Bootstrap-based resampling methods can be used for null hypothesis testing. A bootstrap creates numerous simulated samples by randomly resampling (with replacement) the original, combined sample data, assuming the null hypothesis is correct. The bootstrap is very versatile as it is distribution-free and it does not rely on restrictive parametric assumptions, but rather on empirical approximate methods with asymptotic guarantees. Traditional parametric hypothesis tests are more computationally efficient but make stronger structural assumptions. In situations where computing the probability of the test statistic under the null hypothesis is hard or impossible (due to perhaps inconvenience or lack of knowledge of the underlying distribution), the bootstrap offers a viable method for statistical inference. + +== Examples == + +=== Human sex ratio === + +The earliest use of statistical hypothesis testing is generally credited to the question of whether male and female births are equally likely (null hypothesis), which was addressed in the 1700s by John Arbuthnot (1710), and later by Pierre-Simon Laplace (1770s). +Arbuthnot examined birth records in London for each of the 82 years from 1629 to 1710, and applied the sign test, a simple non-parametric test. In every year, the number of males born in London exceeded the number of females. Considering more male or more female births as equally likely, the probability of the observed outcome is 0.582, or about 1 in 4,836,000,000,000,000,000,000,000; in modern terms, this is the p-value. Arbuthnot concluded that this is too small to be due to chance and must instead be due to divine providence: "From whence it follows, that it is Art, not Chance, that governs." In modern terms, he rejected the null hypothesis of equally likely male and female births at the p = 1/282 significance level. +Laplace considered the statistics of almost half a million births. The statistics showed an excess of boys compared to girls. He concluded by calculation of a p-value that the excess was a real, but unexplained, effect. + +=== Lady tasting tea === + +In a famous example of hypothesis testing, known as the Lady tasting tea, Dr. Muriel Bristol, a colleague of Fisher, claimed to be able to tell whether the tea or the milk was added first to a cup. Fisher proposed to give her eight cups, four of each variety, in random order. One could then ask what the probability was for her getting the number she got correct, but just by chance. The null hypothesis was that the Lady had no such ability. The test statistic was a simple count of the number of successes in selecting the four cups. The critical region was the single case of 4 successes of 4 possible based on a conventional probability criterion (< 5%). A pattern of 4 successes corresponds to 1 out of 70 possible combinations (p≈ 1.4%). Fisher asserted that no alternative hypothesis was (ever) required. The lady correctly identified every cup, which would be considered a statistically significant result. \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Statistical_hypothesis_test-4.md b/data/en.wikipedia.org/wiki/Statistical_hypothesis_test-4.md new file mode 100644 index 000000000..650003790 --- /dev/null +++ b/data/en.wikipedia.org/wiki/Statistical_hypothesis_test-4.md @@ -0,0 +1,430 @@ +--- +title: "Statistical hypothesis test" +chunk: 5/8 +source: "https://en.wikipedia.org/wiki/Statistical_hypothesis_test" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:52:13.457599+00:00" +instance: "kb-cron" +--- + +=== Courtroom trial === +A statistical test procedure is comparable to a criminal trial; a defendant is considered not guilty as long as his or her guilt is not proven. The prosecutor tries to prove the guilt of the defendant. Only when there is enough evidence for the prosecution is the defendant convicted. +In the start of the procedure, there are two hypotheses + + + + + H + + 0 + + + + + {\displaystyle H_{0}} + +: "the defendant is not guilty", and + + + + + H + + 1 + + + + + {\displaystyle H_{1}} + +: "the defendant is guilty". The first one, + + + + + H + + 0 + + + + + {\displaystyle H_{0}} + +, is called the null hypothesis. The second one, + + + + + H + + 1 + + + + + {\displaystyle H_{1}} + +, is called the alternative hypothesis. It is the alternative hypothesis that one hopes to support. +The hypothesis of innocence is rejected only when an error is very unlikely, because one does not want to convict an innocent defendant. Such an error is called error of the first kind (i.e., the conviction of an innocent person), and the occurrence of this error is controlled to be rare. As a consequence of this asymmetric behaviour, an error of the second kind (acquitting a person who committed the crime), is more common. + +A criminal trial can be regarded as either or both of two decision processes: guilty vs not guilty or evidence vs a threshold ("beyond a reasonable doubt"). In one view, the defendant is judged; in the other view the performance of the prosecution (which bears the burden of proof) is judged. A hypothesis test can be regarded as either a judgment of a hypothesis or as a judgment of evidence. + +=== Clairvoyant card game === +A person (the subject) is tested for clairvoyance. They are shown the back face of a randomly chosen playing card 25 times and asked which of the four suits it belongs to. The number of hits, or correct answers, is called X. +As we try to find evidence of their clairvoyance, for the time being the null hypothesis is that the person is not clairvoyant. The alternative is: the person is (more or less) clairvoyant. +If the null hypothesis is valid, the only thing the test person can do is guess. For every card, the probability (relative frequency) of any single suit appearing is 1/4. If the alternative is valid, the test subject will predict the suit correctly with probability greater than 1/4. We will call the probability of guessing correctly p. The hypotheses, then, are: + +null hypothesis + + + + + : + + + + H + + 0 + + + : + p + = + + + + 1 + 4 + + + + + + {\displaystyle {\text{:}}\qquad H_{0}:p={\tfrac {1}{4}}} + + (just guessing) +and + +alternative hypothesis + + + + + : + + + H + + 1 + + + : + p + > + + + + 1 + 4 + + + + + + {\displaystyle {\text{:}}H_{1}:p>{\tfrac {1}{4}}} + + (true clairvoyant). +When the test subject correctly predicts all 25 cards, we will consider them clairvoyant, and reject the null hypothesis. Thus also with 24 or 23 hits. With only 5 or 6 hits, on the other hand, there is no cause to consider them so. But what about 12 hits, or 17 hits? What is the critical number, c, of hits, at which point we consider the subject to be clairvoyant? How do we determine the critical value c? With the choice c=25 (i.e. we only accept clairvoyance when all cards are predicted correctly) we're more critical than with c=10. In the first case almost no test subjects will be recognized to be clairvoyant, in the second case, a certain number will pass the test. In practice, one decides how critical one will be. That is, one decides how often one accepts an error of the first kind – a false positive, or Type I error. With c = 25 the probability of such an error is: + + + + + P + ( + + reject + + + H + + 0 + + + ∣ + + H + + 0 + + + + is valid + + ) + = + P + + ( + + X + = + 25 + ∣ + p + = + + + 1 + 4 + + + + ) + + = + + + ( + + + 1 + 4 + + + ) + + + 25 + + + ≈ + + 10 + + − + 15 + + + + + {\displaystyle P({\text{reject }}H_{0}\mid H_{0}{\text{ is valid}})=P\left(X=25\mid p={\frac {1}{4}}\right)=\left({\frac {1}{4}}\right)^{25}\approx 10^{-15}} + +, +and hence, very small. The probability of a false positive is the probability of randomly guessing correctly all 25 times. +Being less critical, with c = 10, gives: + + + + + P + ( + + reject + + + H + + 0 + + + ∣ + + H + + 0 + + + + is valid + + ) + = + P + + ( + + X + ≥ + 10 + ∣ + p + = + + + 1 + 4 + + + + ) + + = + + ∑ + + k + = + 10 + + + 25 + + + P + + ( + + X + = + k + ∣ + p + = + + + 1 + 4 + + + + ) + + = + + ∑ + + k + = + 10 + + + 25 + + + + + + ( + + + 25 + k + + + ) + + + + + + ( + + 1 + − + + + 1 + 4 + + + + ) + + + 25 + − + k + + + + + ( + + + 1 + 4 + + + ) + + + k + + + ≈ + 0.0713 + + + {\displaystyle P({\text{reject }}H_{0}\mid H_{0}{\text{ is valid}})=P\left(X\geq 10\mid p={\frac {1}{4}}\right)=\sum _{k=10}^{25}P\left(X=k\mid p={\frac {1}{4}}\right)=\sum _{k=10}^{25}{\binom {25}{k}}\left(1-{\frac {1}{4}}\right)^{25-k}\left({\frac {1}{4}}\right)^{k}\approx 0.0713} + +. +Thus, c = 10 yields a much greater probability of false positive. +Before the test is actually performed, the maximum acceptable probability of a Type I error (α) is determined. Typically, values in the range of 1% to 5% are selected. (If the maximum acceptable error rate is zero, an infinite number of correct guesses is required.) Depending on this Type 1 error rate, the critical value c is calculated. For example, if we select an error rate of 1%, c is calculated thus: + + + + + P + ( + + reject + + + H + + 0 + + + ∣ + + H + + 0 + + + + is valid + + ) + = + P + + ( + + X + ≥ + c + ∣ + p + = + + + 1 + 4 + + + + ) + + ≤ + 0.01 + + + {\displaystyle P({\text{reject }}H_{0}\mid H_{0}{\text{ is valid}})=P\left(X\geq c\mid p={\frac {1}{4}}\right)\leq 0.01} + +. +From all the numbers c, with this property, we choose the smallest, in order to minimize the probability of a Type II error, a false negative. For the above example, we select: + + + + c + = + 13 + + + {\displaystyle c=13} + +. \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Statistical_hypothesis_test-5.md b/data/en.wikipedia.org/wiki/Statistical_hypothesis_test-5.md new file mode 100644 index 000000000..ef3ea09fd --- /dev/null +++ b/data/en.wikipedia.org/wiki/Statistical_hypothesis_test-5.md @@ -0,0 +1,27 @@ +--- +title: "Statistical hypothesis test" +chunk: 6/8 +source: "https://en.wikipedia.org/wiki/Statistical_hypothesis_test" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:52:13.457599+00:00" +instance: "kb-cron" +--- + +== Variations and sub-classes == +Statistical hypothesis testing is a key technique of both frequentist inference and Bayesian inference, although the two types of inference have notable differences. Statistical hypothesis tests define a procedure that controls (fixes) the probability of incorrectly deciding that a default position (null hypothesis) is incorrect. The procedure is based on how likely it would be for a set of observations to occur if the null hypothesis were true. This probability of making an incorrect decision is not the probability that the null hypothesis is true, nor whether any specific alternative hypothesis is true. This contrasts with other possible techniques of decision theory in which the null and alternative hypothesis are treated on a more equal basis. +One naïve Bayesian approach to hypothesis testing is to base decisions on the posterior probability, but this fails when comparing point and continuous hypotheses. Other approaches to decision making, such as Bayesian decision theory, attempt to balance the consequences of incorrect decisions across all possibilities, rather than concentrating on a single null hypothesis. A number of other approaches to reaching a decision based on data are available via decision theory and optimal decisions, some of which have desirable properties. Hypothesis testing, though, is a dominant approach to data analysis in many fields of science. Extensions to the theory of hypothesis testing include the study of the power of tests, i.e. the probability of correctly rejecting the null hypothesis given that it is false. Such considerations can be used for the purpose of sample size determination prior to the collection of data. + +== Neyman–Pearson hypothesis testing == +An example of Neyman–Pearson hypothesis testing (or null hypothesis statistical significance testing) can be made by a change to the radioactive suitcase example. If the "suitcase" is actually a shielded container for the transportation of radioactive material, then a test might be used to select among three hypotheses: no radioactive source present, one present, two (all) present. The test could be required for safety, with actions required in each case. The Neyman–Pearson lemma of hypothesis testing says that a good criterion for the selection of hypotheses is the ratio of their probabilities (a likelihood ratio). A simple method of solution is to select the hypothesis with the highest probability for the Geiger counts observed. The typical result matches intuition: few counts imply no source, many counts imply two sources and intermediate counts imply one source. Notice also that usually there are problems for proving a negative. Null hypotheses should be at least falsifiable. +Neyman–Pearson theory can accommodate both prior probabilities and the costs of actions resulting from decisions. The former allows each test to consider the results of earlier tests (unlike Fisher's significance tests). The latter allows the consideration of economic issues (for example) as well as probabilities. A likelihood ratio remains a good criterion for selecting among hypotheses. +The two forms of hypothesis testing are based on different problem formulations. The original test is analogous to a true/false question; the Neyman–Pearson test is more like multiple choice. In the view of Tukey the former produces a conclusion on the basis of only strong evidence while the latter produces a decision on the basis of available evidence. While the two tests seem quite different both mathematically and philosophically, later developments lead to the opposite claim. Consider many tiny radioactive sources. The hypotheses become 0,1,2,3... grains of radioactive sand. There is little distinction between none or some radiation (Fisher) and 0 grains of radioactive sand versus all of the alternatives (Neyman–Pearson). The major Neyman–Pearson paper of 1933 also considered composite hypotheses (ones whose distribution includes an unknown parameter). An example proved the optimality of the (Student's) t-test, "there can be no better test for the hypothesis under consideration" (p 321). Neyman–Pearson theory was proving the optimality of Fisherian methods from its inception. +Fisher's significance testing has proven a popular flexible statistical tool in application with little mathematical growth potential. Neyman–Pearson hypothesis testing is claimed as a pillar of mathematical statistics, creating a new paradigm for the field. It also stimulated new applications in statistical process control, detection theory, decision theory and game theory. Both formulations have been successful, but the successes have been of a different character. +The dispute over formulations is unresolved. Science primarily uses Fisher's (slightly modified) formulation as taught in introductory statistics. Statisticians study Neyman–Pearson theory in graduate school. Mathematicians are proud of uniting the formulations. Philosophers consider them separately. Learned opinions deem the formulations variously competitive (Fisher vs Neyman), incompatible or complementary. The dispute has become more complex since Bayesian inference has achieved respectability. +The terminology is inconsistent. Hypothesis testing can mean any mixture of two formulations that both changed with time. Any discussion of significance testing vs hypothesis testing is doubly vulnerable to confusion. +Fisher thought that hypothesis testing was a useful strategy for performing industrial quality control, however, he strongly disagreed that hypothesis testing could be useful for scientists. +Hypothesis testing provides a means of finding test statistics used in significance testing. The concept of power is useful in explaining the consequences of adjusting the significance level and is heavily used in sample size determination. The two methods remain philosophically distinct. They usually (but not always) produce the same mathematical answer. The preferred answer is context dependent. While the existing merger of Fisher and Neyman–Pearson theories has been heavily criticized, modifying the merger to achieve Bayesian goals has been considered. + +== Criticism == + +Much of the criticisms of statistical hypothesis testing can be summarized by the following issues: \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Statistical_hypothesis_test-6.md b/data/en.wikipedia.org/wiki/Statistical_hypothesis_test-6.md new file mode 100644 index 000000000..b2965b29a --- /dev/null +++ b/data/en.wikipedia.org/wiki/Statistical_hypothesis_test-6.md @@ -0,0 +1,25 @@ +--- +title: "Statistical hypothesis test" +chunk: 7/8 +source: "https://en.wikipedia.org/wiki/Statistical_hypothesis_test" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:52:13.457599+00:00" +instance: "kb-cron" +--- + +The interpretation of a p-value is dependent upon stopping rule and definition of multiple comparison. The former often changes during the course of a study and the latter is unavoidably ambiguous. (i.e. "p values depend on both the (data) observed and on the other possible (data) that might have been observed but weren't"). +Confusion resulting (in part) from combining the methods of Fisher and Neyman–Pearson which are conceptually distinct. +Emphasis on statistical significance to the exclusion of estimation and confirmation by repeated experiments. +Rigidly requiring statistical significance as a criterion for publication, resulting in publication bias. Most of the criticism is indirect. Rather than being wrong, statistical hypothesis testing is misunderstood, overused and misused. +When used to detect whether a difference exists between groups, a paradox arises. As improvements are made to experimental design (e.g. increased precision of measurement and sample size), the test becomes more lenient. Unless one accepts the absurd assumption that all sources of noise in the data cancel out completely, the chance of finding statistical significance in either direction approaches 100%. However, this absurd assumption that the mean difference between two groups cannot be zero implies that the data cannot be independent and identically distributed (i.i.d.) because the expected difference between any two subgroups of i.i.d. random variates is zero; therefore, the i.i.d. assumption is also absurd. +Layers of philosophical concerns. The probability of statistical significance is a function of decisions made by experimenters/analysts. If the decisions are based on convention they are termed arbitrary or mindless while those not so based may be termed subjective. To minimize type II errors, large samples are recommended. In psychology practically all null hypotheses are claimed to be false for sufficiently large samples so "...it is usually nonsensical to perform an experiment with the sole aim of rejecting the null hypothesis." "Statistically significant findings are often misleading" in psychology. Statistical significance does not imply practical significance, and correlation does not imply causation. Casting doubt on the null hypothesis is thus far from directly supporting the research hypothesis. +"[I]t does not tell us what we want to know". Lists of dozens of complaints are available. +Critics and supporters are largely in factual agreement regarding the characteristics of null hypothesis significance testing (NHST): While it can provide critical information, it is inadequate as the sole tool for statistical analysis. Successfully rejecting the null hypothesis may offer no support for the research hypothesis. The continuing controversy concerns the selection of the best statistical practices for the near-term future given the existing practices. However, adequate research design can minimize this issue. Critics would prefer to ban NHST completely, forcing a complete departure from those practices, while supporters suggest a less absolute change. +Controversy over significance testing, and its effects on publication bias in particular, has produced several results. The American Psychological Association has strengthened its statistical reporting requirements after review, medical journal publishers have recognized the obligation to publish some results that are not statistically significant to combat publication bias, and a journal (Journal of Articles in Support of the Null Hypothesis) has been created to publish such results exclusively. Textbooks have added some cautions, and increased coverage of the tools necessary to estimate the size of the sample required to produce significant results. Few major organizations have abandoned use of significance tests although some have discussed doing so. For instance, in 2023, the editors of the Journal of Physiology "strongly recommend the use of estimation methods for those publishing in The Journal" (meaning the magnitude of the effect size (to allow readers to judge whether a finding has practical, physiological, or clinical relevance) and confidence intervals to convey the precision of that estimate), saying "Ultimately, it is the physiological importance of the data that those publishing in The Journal of Physiology should be most concerned with, rather than the statistical significance." +P-values are random variables. Therefore, the decision of a statistical test is a random variable; to understand its stability, approaches including the following have been proposed: + +bootstrapping the "reproducibility probability". +Bootstrapping the sampling distribution of the p-values + +== Alternatives == \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Statistical_hypothesis_test-7.md b/data/en.wikipedia.org/wiki/Statistical_hypothesis_test-7.md new file mode 100644 index 000000000..4021a458d --- /dev/null +++ b/data/en.wikipedia.org/wiki/Statistical_hypothesis_test-7.md @@ -0,0 +1,33 @@ +--- +title: "Statistical hypothesis test" +chunk: 8/8 +source: "https://en.wikipedia.org/wiki/Statistical_hypothesis_test" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:52:13.457599+00:00" +instance: "kb-cron" +--- + +A unifying position of critics is that statistics should not lead to an accept-reject conclusion or decision, but to an estimated value with an interval estimate; this data-analysis philosophy is broadly referred to as estimation statistics. Estimation statistics can be accomplished with either frequentist or Bayesian methods. +Critics of significance testing have advocated basing inference less on p-values and more on confidence intervals for effect sizes for importance, prediction intervals for confidence, replications and extensions for replicability, meta-analyses for generality :. But none of these suggested alternatives inherently produces a decision. Lehmann said that hypothesis testing theory can be presented in terms of conclusions/decisions, probabilities, or confidence intervals: "The distinction between the ... approaches is largely one of reporting and interpretation." +Bayesian inference is one proposed alternative to significance testing. (Nickerson cited 10 sources suggesting it, including Rozeboom (1960)). For example, Bayesian parameter estimation can provide rich information about the data from which researchers can draw inferences, while using uncertain priors that exert only minimal influence on the results when enough data is available. Psychologist John K. Kruschke has suggested Bayesian estimation as an alternative for the t-test and has also contrasted Bayesian estimation for assessing null values with Bayesian model comparison for hypothesis testing. Two competing models/hypotheses can be compared using Bayes factors. Bayesian methods could be criticized for requiring information that is seldom available in the cases where significance testing is most heavily used. Neither the prior probabilities nor the probability distribution of the test statistic under the alternative hypothesis are often available in the social sciences. +Advocates of a Bayesian approach sometimes claim that the goal of a researcher is most often to objectively assess the probability that a hypothesis is true based on the data they have collected. Neither Fisher's significance testing, nor Neyman–Pearson hypothesis testing can provide this information, and do not claim to. The probability a hypothesis is true can only be derived from use of Bayes' Theorem, which was unsatisfactory to both the Fisher and Neyman–Pearson camps due to the explicit use of subjectivity in the form of the prior probability. Fisher's strategy is to sidestep this with the p-value (an objective index based on the data alone) followed by inductive inference, while Neyman–Pearson devised their approach of inductive behaviour. + +== See also == + +== References == + +== Further reading == +Lehmann E.L. (1992) "Introduction to Neyman and Pearson (1933) On the Problem of the Most Efficient Tests of Statistical Hypotheses". In: Breakthroughs in Statistics, Volume 1, (Eds Kotz, S., Johnson, N.L.), Springer-Verlag. ISBN 0-387-94037-5 (followed by reprinting of the paper) +Neyman, J.; Pearson, E.S. (1933). "On the Problem of the Most Efficient Tests of Statistical Hypotheses". Philosophical Transactions of the Royal Society A. 231 (694–706): 289–337. Bibcode:1933RSPTA.231..289N. doi:10.1098/rsta.1933.0009. + +== External links == + +"Statistical hypotheses, verification of", Encyclopedia of Mathematics, EMS Press, 2001 [1994] +Bayesian critique of classical hypothesis testing +Critique of classical hypothesis testing highlighting long-standing qualms of statisticians +Statistical Tests Overview: How to choose the correct statistical test +[1] Statistical Analysis based Hypothesis Testing Method in Biological Knowledge Discovery; Md. Naseef-Ur-Rahman Chowdhury, Suvankar Paul, Kazi Zakia Sultana + +=== Online calculators === +Some p-value and hypothesis test calculators. \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Steiner_system-0.md b/data/en.wikipedia.org/wiki/Steiner_system-0.md new file mode 100644 index 000000000..8da961fe4 --- /dev/null +++ b/data/en.wikipedia.org/wiki/Steiner_system-0.md @@ -0,0 +1,63 @@ +--- +title: "Steiner system" +chunk: 1/4 +source: "https://en.wikipedia.org/wiki/Steiner_system" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:52:14.665238+00:00" +instance: "kb-cron" +--- + +In combinatorial mathematics, a Steiner system (named after Jakob Steiner) is a type of block design, specifically a t-design with λ = 1 and t = 2 or (recently) t ≥ 2. +A Steiner system with parameters t, k, n, written S(t,k,n), is an n-element set S together with a set of k-element subsets of S (called blocks) with the property that each t-element subset of S is contained in exactly one block. In an alternative notation for block designs, an S(t,k,n) would be a t-(n,k,1) design. +This definition is relatively new. The classical definition of Steiner systems also required that k = t + 1. An S(2,3,n) was (and still is) called a Steiner triple (or triad) system, while an S(3,4,n) is called a Steiner quadruple system, and so on. With the generalization of the definition, this naming system is no longer strictly adhered to. +Long-standing problems in design theory were whether there exist any nontrivial Steiner systems (nontrivial meaning t < k < n) with t ≥ 6; also whether infinitely many have t = 4 or 5. Both existences were proved by Peter Keevash in 2014. His proof is non-constructive and, as of 2019, no actual Steiner systems are known for large values of t. + +== Types of Steiner systems == +A finite projective plane of order q, with the lines as blocks, is an S(2, q + 1, q2 + q + 1), since it has q2 + q + 1 points, each line passes through q + 1 points, and each pair of distinct points lies on exactly one line. +A finite affine plane of order q, with the lines as blocks, is an S(2, q, q2). An affine plane of order q can be obtained from a projective plane of the same order by removing one block and all of the points in that block from the projective plane. Choosing different blocks to remove in this way can lead to non-isomorphic affine planes. +An S(3,4,n) is called a Steiner quadruple system. A necessary and sufficient condition for the existence of an S(3,4,n) is that n + + + + ≡ + + + {\displaystyle \equiv } + + 2 or 4 (mod 6). The abbreviation SQS(n) is often used for these systems. Up to isomorphism, SQS(8) and SQS(10) are unique, there are 4 SQS(14)s and 1,054,163 SQS(16)s. +An S(4,5,n) is called a Steiner quintuple system. A necessary condition for the existence of such a system is that n + + + + ≡ + + + {\displaystyle \equiv } + + 3 or 5 (mod 6) which comes from considerations that apply to all the classical Steiner systems. An additional necessary condition is that n + + + + ≢ + + + {\displaystyle \not \equiv } + + 4 (mod 5), which comes from the fact that the number of blocks must be an integer. Sufficient conditions are not known. There is a unique Steiner quintuple system of order 11, but none of order 15 or order 17. Systems are known for orders 23, 35, 47, 71, 83, 107, 131, 167 and 243. The smallest order for which the existence is not known (as of 2011) is 21. + +=== Steiner triple systems === +An S(2,3,n) is called a Steiner triple system, and its blocks are called triples. It is common to see the abbreviation STS(n) for a Steiner triple system of order n. The total number of pairs is n(n-1)/2, of which three appear in a triple, and so the total number of triples is n(n−1)/6. This shows that n must be of the form 6k+1 or 6k + 3 for some k. The fact that this condition on n is sufficient for the existence of an S(2,3,n) was proved by Raj Chandra Bose and T. Skolem. The projective plane of order 2 (the Fano plane) is an STS(7) and the affine plane of order 3 is an STS(9). Up to isomorphism, the STS(7) and STS(9) are unique, there are two STS(13)s, 80 STS(15)s, and 11,084,874,829 STS(19)s. +We can define a multiplication on the set S using the Steiner triple system by setting aa = a for all a in S, and ab = c if {a,b,c} is a triple. This makes S an idempotent, commutative quasigroup. It has the additional property that ab = c implies bc = a and ca = b. Conversely, any (finite) quasigroup with these properties arises from a Steiner triple system. Commutative idempotent quasigroups satisfying this additional property are called Steiner quasigroups. + +=== Resolvable Steiner systems === +Some of the S(2,3,n) systems can have their triples partitioned into (n-1)/2 sets each having (n/3) pairwise disjoint triples. This is called resolvable and such systems are called Kirkman triple systems after Thomas Kirkman, who studied such resolvable systems before Steiner. Dale Mesner, Earl Kramer, and others investigated collections of Steiner triple systems that are mutually disjoint (i.e., no two Steiner systems in such a collection share a common triplet). It is known (Bays 1917, Kramer & Mesner 1974) that seven different S(2,3,9) systems can be generated to together cover all 84 triplets on a 9-set; it was also known by them that there are 15360 different ways to find such 7-sets of solutions, which reduce to two non-isomorphic solutions under relabeling, with multiplicities 6720 and 8640 respectively. +The corresponding question for finding thirteen different disjoint S(2,3,15) systems was asked by James Sylvester in 1860 as an extension of the Kirkman's schoolgirl problem, namely whether Kirkman's schoolgirls could march for an entire term of 13 weeks with no triplet of girls being repeated over the whole term. The question was solved by RHF Denniston in 1974, who constructed Week 1 as follows: + +Day 1 ABJ CEM FKL HIN DGO +Day 2 ACH DEI FGM JLN BKO +Day 3 ADL BHM GIK CFN EJO +Day 4 AEG BIL CJK DMN FHO +Day 5 AFI BCD GHJ EKN LMO +Day 6 AKM DFJ EHL BGN CIO +Day 7 BEF CGL DHK IJM ANO \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Steiner_system-1.md b/data/en.wikipedia.org/wiki/Steiner_system-1.md new file mode 100644 index 000000000..811f9b70c --- /dev/null +++ b/data/en.wikipedia.org/wiki/Steiner_system-1.md @@ -0,0 +1,543 @@ +--- +title: "Steiner system" +chunk: 2/4 +source: "https://en.wikipedia.org/wiki/Steiner_system" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:52:14.665238+00:00" +instance: "kb-cron" +--- + +for girls labeled A to O, and constructed each subsequent week's solution from its immediate predecessor by changing A to B, B to C, ... L to M and M back to A, all while leaving N and O unchanged. The Week 13 solution, upon undergoing that relabeling, returns to the Week 1 solution. Denniston reported in his paper that the search he employed took 7 hours on an Elliott 4130 computer at the University of Leicester, and he immediately ended the search on finding the solution above, not looking to establish uniqueness. The number of non-isomorphic solutions to Sylvester's problem remains unknown as of 2021. + +== Properties == +It is clear from the definition of S(t, k, n) that + + + + 1 + < + t + < + k + < + n + + + {\displaystyle 1 + + T + + 0 + + + + + {\displaystyle t>T_{0}} + +. We aim to estimate + + + + ( + + α + + 1 + + T + + 0 + + + + + 1 + + + . + . + . + . + . + . + + α + + 1 + T + + + ) + + + {\displaystyle (\alpha _{1T_{0}+1}......\alpha _{1T})} + +. +Imposing some structure + + + + + + Y + + i + t + + + N + + + = + + δ + + t + + + + + + θ + + t + + + + Z + + i + + + + + + λ + + t + + + + μ + + i + + + + + + ε + + i + t + + + + + {\displaystyle Y_{it}^{N}=\delta _{t}+\theta _{t}Z_{i}+\lambda _{t}\mu _{i}+\varepsilon _{it}} + + +and assuming there exist some optimal weights + + + + + w + + 2 + + + , + … + , + + w + + J + + + + + {\displaystyle w_{2},\ldots ,w_{J}} + + such that + + + + + + Y + + 1 + t + + + = + + ∑ + + j + = + 2 + + + J + + + + w + + j + + + + Y + + j + t + + + + + {\displaystyle Y_{1t}=\sum _{j=2}^{J}w_{j}Y_{jt}} + + +for + + + + t + ⩽ + + T + + 0 + + + + + {\displaystyle t\leqslant T_{0}} + +, the synthetic controls approach suggests using these weights to estimate the counterfactual + + + + + + Y + + 1 + t + + + N + + + = + + ∑ + + j + = + 2 + + + J + + + + w + + j + + + + Y + + j + t + + + + + {\displaystyle Y_{1t}^{N}=\sum _{j=2}^{J}w_{j}Y_{jt}} + + +for + + + + t + > + + T + + 0 + + + + + {\displaystyle t>T_{0}} + +. So under some regularity conditions, such weights would provide estimators for the treatment effects of interest. In essence, the method uses the idea of matching and using the training data pre-intervention to set up the weights and hence a relevant control post-intervention. +Synthetic controls have been used in a number of empirical applications, ranging from studies examining natural catastrophes and growth, or civil conflicts and growth, studies that examine the effect of vaccine mandates on childhood immunization, and studies linking political murders to house prices. Recently, the synthetic control method is actively used in new drug development when evaluating the causal impact of a treatment or intervention, especially in situations where randomized controlled trials (RCTs) are not feasible. + + +== See also == +Difference in difference +Regression discontinuity +Instrumental variables estimation + + +== References == \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Taguchi_methods-0.md b/data/en.wikipedia.org/wiki/Taguchi_methods-0.md new file mode 100644 index 000000000..98a0d3de1 --- /dev/null +++ b/data/en.wikipedia.org/wiki/Taguchi_methods-0.md @@ -0,0 +1,61 @@ +--- +title: "Taguchi methods" +chunk: 1/2 +source: "https://en.wikipedia.org/wiki/Taguchi_methods" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:52:21.761240+00:00" +instance: "kb-cron" +--- + +Taguchi methods (Japanese: タグチメソッド) are statistical methods, sometimes called robust design methods, developed by Genichi Taguchi to improve the quality of manufactured goods, and more recently also applied to engineering, biotechnology, marketing and advertising. Professional statisticians have welcomed the goals and improvements brought about by Taguchi methods, particularly by Taguchi's development of designs for studying variation, but have criticized the inefficiency of some of Taguchi's proposals. +Taguchi's work includes three principal contributions to statistics: + +A specific loss function +The philosophy of off-line quality control; and +Innovations in the design of experiments. + +== Loss functions == + +=== Loss functions in the statistical theory === +Traditionally, statistical methods have relied on mean-unbiased estimators of treatment effects: Under the conditions of the Gauss–Markov theorem, least squares estimators have minimum variance among all mean-unbiased linear estimators. The emphasis on comparisons of means also draws (limiting) comfort from the law of large numbers, according to which the sample means converge to the true mean. Fisher's textbook on the design of experiments emphasized comparisons of treatment means. +However, loss functions were avoided by Ronald A. Fisher. + +=== Taguchi's use of loss functions === +Taguchi knew statistical theory mainly from the followers of Ronald A. Fisher, who also avoided loss functions. +Reacting to Fisher's methods in the design of experiments, Taguchi interpreted Fisher's methods as being adapted for seeking to improve the mean outcome of a process. Indeed, Fisher's work had been largely motivated by programmes to compare agricultural yields under different treatments and blocks, and such experiments were done as part of a long-term programme to improve harvests. +However, Taguchi realised that in much industrial production, there is a need to produce an outcome on target, for example, to machine a hole to a specified diameter, or to manufacture a cell to produce a given voltage. He also realised, as had Walter A. Shewhart and others before him, that excessive variation lay at the root of poor manufactured quality and that reacting to individual items inside and outside specification was counterproductive. +He therefore argued that quality engineering should start with an understanding of quality costs in various situations. In much conventional industrial engineering, the quality costs are simply represented by the number of items outside specification multiplied by the cost of rework or scrap. However, Taguchi insisted that manufacturers broaden their horizons to consider cost to society. Though the short-term costs may simply be those of non-conformance, any item manufactured away from nominal would result in some loss to the customer or the wider community through early wear-out; difficulties in interfacing with other parts, themselves probably wide of nominal; or the need to build in safety margins. These losses are externalities and are usually ignored by manufacturers, which are more interested in their private costs than social costs. Such externalities prevent markets from operating efficiently, according to analyses of public economics. Taguchi argued that such losses would inevitably find their way back to the originating corporation (in an effect similar to the tragedy of the commons), and that by working to minimise them, manufacturers would enhance brand reputation, win markets and generate profits. +Such losses are, of course, very small when an item is near to negligible. Donald J. Wheeler characterised the region within specification limits as where we deny that losses exist. As we diverge from nominal, losses grow until the point where losses are too great to deny and the specification limit is drawn. All these losses are, as W. Edwards Deming would describe them, unknown and unknowable, but Taguchi wanted to find a useful way of representing them statistically. Taguchi specified three situations: + +Larger the better (for example, agricultural yield); +Smaller the better (for example, carbon dioxide emissions); and +On-target, minimum-variation (for example, a mating part in an assembly). +The first two cases are represented by simple monotonic loss functions. In the third case, Taguchi adopted a squared-error loss function for several reasons: + +It is the first "symmetric" term in the Taylor series expansion of real analytic loss-functions. +Total loss is measured by the variance. For uncorrelated random variables, as variance is additive the total loss is an additive measurement of cost. +The squared-error loss function is widely used in statistics, following Gauss's use of the squared-error loss function in justifying the method of least squares. + +=== Reception of Taguchi's ideas by statisticians === +Though many of Taguchi's concerns and conclusions are welcomed by statisticians and economists, some ideas have been especially criticized. For example, Taguchi's recommendation that industrial experiments maximise some signal-to-noise ratio (representing the magnitude of the mean of a process compared to its variation) has been criticized. + +== Off-line quality control == + +=== Taguchi's rule for manufacturing === +Taguchi realized that the best opportunity to eliminate variation of the final product quality is during the design of a product and its manufacturing process. Consequently, he developed a strategy for quality engineering that can be used in both contexts. The process has three stages: + +System design +Parameter (measure) design +Tolerance design + +==== System design ==== +This is design at the conceptual level, involving creativity and innovation. + +==== Parameter design ==== +Once the concept is established, the nominal values of the various dimensions and design parameters need to be set, the detail design phase of conventional engineering. Taguchi's radical insight was that the exact choice of values required is under-specified by the performance requirements of the system. In many circumstances, this allows the parameters to be chosen so as to minimize the effects on performance arising from variation in manufacture, environment and cumulative damage. This is sometimes called robustification. +Robust parameter designs consider controllable and uncontrollable noise variables; they seek to exploit relationships and optimize settings that minimize the effects of the noise variables. + +==== Tolerance design ==== + +With a successfully completed parameter design, and an understanding of the effect that the various parameters have on performance, resources can be focused on reducing and controlling variation in the critical few dimensions. \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Taguchi_methods-1.md b/data/en.wikipedia.org/wiki/Taguchi_methods-1.md new file mode 100644 index 000000000..c4502bc42 --- /dev/null +++ b/data/en.wikipedia.org/wiki/Taguchi_methods-1.md @@ -0,0 +1,46 @@ +--- +title: "Taguchi methods" +chunk: 2/2 +source: "https://en.wikipedia.org/wiki/Taguchi_methods" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:52:21.761240+00:00" +instance: "kb-cron" +--- + +== Design of experiments == +Taguchi developed his experimental theories independently. Taguchi read works following R. A. Fisher only in 1954. + +=== Outer arrays === +Taguchi's designs aimed to allow greater understanding of variation than did many of the traditional designs from the analysis of variance (following Fisher). Taguchi contended that conventional sampling is inadequate here as there is no way of obtaining a random sample of future conditions. In Fisher's design of experiments and analysis of variance, experiments aim to reduce the influence of nuisance factors to allow comparisons of the mean treatment-effects. Variation becomes even more central in Taguchi's thinking. +Taguchi proposed extending each experiment with an "outer array" (possibly an orthogonal array); the "outer array" should simulate the random environment in which the product would function. This is an example of judgmental sampling. Many quality specialists have been using "outer arrays". +Later innovations in outer arrays resulted in "compounded noise." This involves combining a few noise factors to create two levels in the outer array: First, noise factors that drive output lower, and second, noise factors that drive output higher. "Compounded noise" simulates the extremes of noise variation but uses fewer experimental runs than would previous Taguchi designs. + +=== Management of interactions === + +==== Interactions, as treated by Taguchi ==== +Many of the orthogonal arrays that Taguchi has advocated are saturated arrays, allowing no scope for estimation of interactions. This is a continuing topic of controversy. However, this is only true for "control factors" or factors in the "inner array". By combining an inner array of control factors with an outer array of "noise factors", Taguchi's approach provides "full information" on control-by-noise interactions, it is claimed. Taguchi argues that such interactions have the greatest importance in achieving a design that is robust to noise factor variation. The Taguchi approach provides more complete interaction information than typical fractional factorial designs, its adherents claim. + +Followers of Taguchi argue that the designs offer rapid results and that interactions can be eliminated by proper choice of quality characteristics. That notwithstanding, a "confirmation experiment" offers protection against any residual interactions. If the quality characteristic represents the energy transformation of the system, then the "likelihood" of control factor-by-control factor interactions is greatly reduced, since "energy" is "additive". + +==== Inefficiencies of Taguchi's designs ==== +Interactions are part of the real world. In Taguchi's arrays, interactions are confounded and difficult to resolve. +Statisticians in response surface methodology (RSM) advocate the "sequential assembly" of designs: In the RSM approach, a screening design is followed by a "follow-up design" that resolves only the confounded interactions judged worth resolution. A second follow-up design may be added (time and resources allowing) to explore possible high-order univariate effects of the remaining variables, as high-order univariate effects are less likely in variables already eliminated for having no linear effect. With the economy of screening designs and the flexibility of follow-up designs, sequential designs have great statistical efficiency. The sequential designs of response surface methodology require far fewer experimental runs than would a sequence of Taguchi's designs. + +== Assessment == +Genichi Taguchi has made valuable contributions to statistics and engineering. His emphasis on loss to society, techniques for investigating variation in experiments, and his overall strategy of system, parameter and tolerance design have been influential in improving manufactured quality worldwide. + +== See also == +Design of experiments – Design of tasks +Optimal design – Experimental design that is optimal with respect to some statistical criterionPages displaying short descriptions of redirect targets +Orthogonal array – Type of mathematical array +Quality management – Business process to aid consistent product fitness +Response surface methodology – Statistical approach +Sales process engineering – Systematic design of sales processes +Six Sigma – Business process improvement technique +Engineering tolerance – Permissible limit or limits of variation +Probabilistic design – Discipline within engineering design + +== References == + +== Bibliography == \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Technical_peer_review-0.md b/data/en.wikipedia.org/wiki/Technical_peer_review-0.md new file mode 100644 index 000000000..a95076fe6 --- /dev/null +++ b/data/en.wikipedia.org/wiki/Technical_peer_review-0.md @@ -0,0 +1,43 @@ +--- +title: "Technical peer review" +chunk: 1/1 +source: "https://en.wikipedia.org/wiki/Technical_peer_review" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:53:22.988147+00:00" +instance: "kb-cron" +--- + +In engineering, technical peer review is a well defined review process for finding and correcting defects conducted by a team of peers with assigned roles. Technical peer reviews are carried out by peers representing areas of life cycle affected by material being reviewed (usually limited to 6 or fewer people). Technical peer reviews are held within development phases, between milestone reviews, on completed products, or on completed portions of products. A technical peer review may also be called an engineering peer review, a product peer review, a peer review/inspection or an inspection. + + +== Overview == +The purpose of a technical peer review is to remove defects as early as possible in the development process. By removing defects at their origin (e.g., requirements and design documents, test plans and procedures, software code, etc.), technical peer reviews prevent defects from propagating through multiple phases and work products and reduce the overall amount of rework necessary on projects. Improved team efficiency is a side effect (e.g., by improving team communication, integrating the viewpoints of various engineering specialty disciplines, more quickly bringing new members up to speed, and educating project members about effective development practices). +In CMMI, peer reviews are used as a principal means of verification in the Verification process area and as an objective evaluation method in the Process and Product Quality Assurance process area. The results of technical peer reviews can be reported at milestone reviews. +Peer reviews are distinct from management reviews, which are conducted by management representatives rather than by colleagues and for management and control purposes rather than for technical evaluation. This is especially true of line managers of the author or other participants in the review. A policy of encouraging management to stay out of peer reviews encourages the peer review team to concentrate on the product being reviewed and not on the people or personalities involved. +They are also distinct from software audit reviews, which are conducted by personnel external to the project, to evaluate compliance with specifications, standards, contractual agreements, or other criteria. A software peer review is a type of technical peer review. The IEEE defines formal structures, roles, and processes for software peer reviews. + + +== Roles of participants == +Moderator +Responsible for conducting the technical peer review process and collecting inspection data. The moderator plays a key role in all stages of the process except rework and is typically required to perform several duties during a technical peer review in addition to inspectors' tasks. +Inspectors +Responsible for finding defects in work product from a general point of view, as well as defects that affect their area of expertise. +Author +Provides information about work product during all stages of process. The author is responsible for correcting all major defects and any minor and trivial defects that cost and schedule permit, as well as performing the duties of an inspector. +Reader +Guides team through work product during the technical peer review meeting. The reader reads or paraphrases work product in detail and also may perform the duties of an inspector. +Recorder +Accurately records each defect found during inspection meeting on the Inspection Defect List, and may also perform the duties of an inspector. + + +== Vested interest of reviewers == +There are two philosophies about the vested interest of the inspectors in the product under review. On one hand, project personnel who have a vested interest in the work product under review have the most knowledge of the product and are motivated to find and fix defects. On the other hand, personnel from outside the project who do not have a vested interest in the work product bring objectivity and a fresh viewpoint to the technical peer review team. +Each inspector is invited to disclose vested interests to the rest of the technical peer review panel so the moderator can exercise sound judgement in evaluating the inspector's inputs. + + +== References == + + +== External links == +Products Comparison \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Theory-driven_evaluation-0.md b/data/en.wikipedia.org/wiki/Theory-driven_evaluation-0.md index 9edde1f59..6032e915d 100644 --- a/data/en.wikipedia.org/wiki/Theory-driven_evaluation-0.md +++ b/data/en.wikipedia.org/wiki/Theory-driven_evaluation-0.md @@ -4,7 +4,7 @@ chunk: 1/2 source: "https://en.wikipedia.org/wiki/Theory-driven_evaluation" category: "reference" tags: "science, encyclopedia" -date_saved: "2026-05-05T07:01:20.252360+00:00" +date_saved: "2026-05-05T09:52:22.992674+00:00" instance: "kb-cron" --- diff --git a/data/en.wikipedia.org/wiki/Theory-driven_evaluation-1.md b/data/en.wikipedia.org/wiki/Theory-driven_evaluation-1.md index db4ad55e2..763af07af 100644 --- a/data/en.wikipedia.org/wiki/Theory-driven_evaluation-1.md +++ b/data/en.wikipedia.org/wiki/Theory-driven_evaluation-1.md @@ -4,7 +4,7 @@ chunk: 2/2 source: "https://en.wikipedia.org/wiki/Theory-driven_evaluation" category: "reference" tags: "science, encyclopedia" -date_saved: "2026-05-05T07:01:20.252360+00:00" +date_saved: "2026-05-05T09:52:22.992674+00:00" instance: "kb-cron" --- diff --git a/data/en.wikipedia.org/wiki/Treatment_and_control_groups-0.md b/data/en.wikipedia.org/wiki/Treatment_and_control_groups-0.md new file mode 100644 index 000000000..3fd55644c --- /dev/null +++ b/data/en.wikipedia.org/wiki/Treatment_and_control_groups-0.md @@ -0,0 +1,31 @@ +--- +title: "Treatment and control groups" +chunk: 1/1 +source: "https://en.wikipedia.org/wiki/Treatment_and_control_groups" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:52:24.158140+00:00" +instance: "kb-cron" +--- + +In the design of experiments, hypotheses are applied to experimental units in a treatment group. In comparative experiments, members of a control group receive a standard treatment, a placebo, or no treatment at all. There may be more than one treatment group, more than one control group, or both. +A placebo control group can be used to support a double-blind study, in which some subjects are given an ineffective treatment (in medical studies typically a sugar pill) to minimize differences in the experiences of subjects in the different groups; this is done in a way that ensures no participant in the experiment (subject or experimenter) knows to which group each subject belongs. In such cases, a third, non-treatment control group can be used to measure the placebo effect directly, as the difference between the responses of placebo subjects and untreated subjects, perhaps paired by age group or other factors (such as being twins). +For the conclusions drawn from the results of an experiment to have validity, it is essential that the items or patients assigned to treatment and control groups be representative of the same population. In some experiments, such as many in agriculture or psychology, this can be achieved by randomly assigning items from a common population to one of the treatment and control groups. In studies of twins involving just one treatment group and a control group, it is statistically efficient to do this random assignment separately for each pair of twins, so that one is in the treatment group and one in the control group. +In some medical studies, where it may be unethical not to treat patients who present with symptoms, controls may be given a standard treatment, rather than no treatment at all. An alternative is to select controls from a wider population, provided that this population is well-defined and that those presenting with symptoms at the clinic are representative of those in the wider population. Another method to reduce ethical concerns would be to test early-onset symptoms, with enough time later to offer real treatments to the control subjects, and let those subjects know the first treatments are "experimental" and might not be as effective as later treatments, again with the understanding there would be ample time to try other remedies. + + +== Relevance == +A clinical control group can be a placebo arm or it can involve an old method used to address a clinical outcome when testing a new idea. For example in a study released by the British Medical Journal, in 1995 studying the effects of strict blood pressure control versus more relaxed blood pressure control in diabetic patients, the clinical control group was the diabetic patients that did not receive tight blood pressure control. In order to qualify for the study, the patients had to meet the inclusion criteria and not match the exclusion criteria. Once the study population was determined, the patients were placed in either the experimental group (strict blood pressure control <150/80mmHg) versus non strict blood pressure control (<180/110). There were a wide variety of ending points for patients such as death, myocardial infarction, stroke, etc. The study was stopped before completion because strict blood pressure control was so much superior to the clinical control group which had relaxed blood pressure control. The study was no longer considered ethical because tight blood pressure control was so much more effective at preventing end points that the clinical control group had to be discontinued. +The clinical control group is not always a placebo group. Sometimes the clinical control group can involve comparing a new drug to an older drug in a superiority trial. In a superiority trial, the clinical control group is the older medication rather than the new medication. For example in the ALLHAT trial, Thiazide diuretics were demonstrated to be superior to calcium channel blockers or angiotensin-converting enzyme inhibitors in reducing cardiovascular events in high risk patients with hypertension. In the ALLHAT study, the clinical control group was not a placebo it was ACEI or Calcium Channel Blockers. +Overall, clinical control groups can either be a placebo or an old standard of therapy. + + +== See also == +Scientific control +Wait list control group +Blocking (statistics) +Hawthorne effect +Lessebo effect + + +== References == \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Type_I_and_type_II_errors-0.md b/data/en.wikipedia.org/wiki/Type_I_and_type_II_errors-0.md new file mode 100644 index 000000000..6d98c4287 --- /dev/null +++ b/data/en.wikipedia.org/wiki/Type_I_and_type_II_errors-0.md @@ -0,0 +1,113 @@ +--- +title: "Type I and type II errors" +chunk: 1/5 +source: "https://en.wikipedia.org/wiki/Type_I_and_type_II_errors" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:52:25.345633+00:00" +instance: "kb-cron" +--- + +Type I error, or a false positive, is the incorrect rejection of a true null hypothesis in statistical hypothesis testing. A type II error, or a false negative, is the incorrect failure to reject a false null hypothesis. +An analysis commits a Type I error when some baseline assumption is incorrectly rejected because of new, misleading information. Meanwhile, a Type II error is made when such an assumption is maintained, due to flawed or insufficent data, when better measurements would have shown it to be untrue. For example, in the context of medical testing, if we consider the null hypothesis to be "This patient does not have the disease," a diagnosis that the disease is present when it is not is a Type I error, while a diagnosis that the patient does not have the disease when it is present would be a Type II error. The manner in which a null hypothesis frames contextually default expectations influences the specific ways in which type I errors and type II errors manifest, and this varies by context and application. Generally the risk of such errors cannot be entirely eliminated, only traded-off between the two types, by for example changing the significance threshold. +Knowledge of type I errors and type II errors is applied widely in fields of medical science, biometrics and computer science. Minimising these errors is an object of study within statistical theory, though complete elimination of either is impossible when relevant outcomes are not determined by known, observable, causal processes. + +== Definition == + +=== Statistical background === +In statistical test theory, the notion of a statistical error is an integral part of hypothesis testing. The test goes about choosing about two competing propositions called null hypothesis, denoted by + + + + + H + + 0 + + + + + {\textstyle H_{0}} + + and alternative hypothesis, denoted by + + + + + H + + 1 + + + + + {\textstyle H_{1}} + +. This is conceptually similar to the judgement in a court trial. The null hypothesis corresponds to the position of the defendant: just as he is presumed to be innocent until proven guilty, so is the null hypothesis presumed to be true until the data provide convincing evidence against it. The alternative hypothesis corresponds to the position against the defendant. Specifically, the null hypothesis also involves the absence of a difference or the absence of an association. Thus, the null hypothesis can never be that there is a difference or an association. +If the result of the test corresponds with reality, then a correct decision has been made. However, if the result of the test does not correspond with reality, then an error has occurred. There are two situations in which the decision is wrong. The null hypothesis may be true, whereas we reject + + + + + H + + 0 + + + + + {\textstyle H_{0}} + +. On the other hand, the alternative hypothesis + + + + + H + + 1 + + + + + {\textstyle H_{1}} + + may be true, whereas we do not reject + + + + + H + + 0 + + + + + {\textstyle H_{0}} + +. Two types of error are distinguished: type I error and type II error. + +=== Type I error === +The first kind of error is the mistaken rejection of a null hypothesis as the result of a test procedure. This kind of error is called a type I error (false positive) and is sometimes called an error of the first kind. In terms of the courtroom example, a type I error corresponds to convicting an innocent defendant. + +=== Type II error === +The second kind of error is the failure to reject the null hypothesis as the result of a test procedure. This sort of error is called a type II error (false negative) and is also referred to as an error of the second kind. In terms of the courtroom example, a type II error corresponds to acquitting a criminal. + +=== Crossover error rate === +The crossover error rate (CER) is the point at which type I errors and type II errors are equal. A system with a lower CER value provides more accuracy than a system with a higher CER value. With all else being equal, having the rate of type I errors and type II errors being equal (i.e. the CER) will result in the lowest overall error rate. + +=== False positive and false negative === + +In terms of false positives and false negatives, a positive result corresponds to rejecting the null hypothesis, while a negative result corresponds to failing to reject the null hypothesis; "false" means the conclusion drawn is incorrect. Thus, a type I error is equivalent to a false positive, and a type II error is equivalent to a false negative. + +=== Table of error types === +Tabulated relations between truth/falseness of the null hypothesis and outcomes of the test: + +== Error rate == + +A perfect test would have zero false positives and zero false negatives. However, statistical methods are probabilistic, and it cannot be known for certain whether statistical conclusions are correct. Whenever there is uncertainty, there is the possibility of making an error. Considering this, all statistical hypothesis tests have a probability of making type I and type II errors. + +The type I error rate is the probability of rejecting the null hypothesis given that it is true. The test is designed to keep the type I error rate below a prespecified bound called the significance level, usually denoted by the Greek letter α (alpha) and is also called the alpha level. Usually, the significance level is set to 0.05 (5%), implying that it is acceptable to have a 5% probability of incorrectly rejecting the true null hypothesis. +The rate of the type II error is denoted by the Greek letter β (beta) and related to the power of a test, which equals 1−β. +These two types of error rates are traded off against each other: for any given sample set, the effort to reduce one type of error generally results in increasing the other type of error. \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Type_I_and_type_II_errors-1.md b/data/en.wikipedia.org/wiki/Type_I_and_type_II_errors-1.md new file mode 100644 index 000000000..3baa5fd7c --- /dev/null +++ b/data/en.wikipedia.org/wiki/Type_I_and_type_II_errors-1.md @@ -0,0 +1,243 @@ +--- +title: "Type I and type II errors" +chunk: 2/5 +source: "https://en.wikipedia.org/wiki/Type_I_and_type_II_errors" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:52:25.345633+00:00" +instance: "kb-cron" +--- + +=== The quality of hypothesis test === +The same idea can be expressed in terms of the rate of correct results and therefore used to minimize error rates and improve the quality of hypothesis test. To reduce the probability of committing a type I error, making the alpha value more stringent is both simple and efficient. For example, setting the alpha value at 0.01, instead of 0.05. To decrease the probability of committing a type II error, which is closely associated with analyses' power, either increasing the test's sample size or relaxing the alpha level, ex. setting the alpha level to 0.1 instead of 0.05, could increase the analyses' power. A test statistic is robust if the type I error rate is controlled. +Varying different threshold (cut-off) values could also be used to make the test either more specific or more sensitive, which in turn elevates the test quality. For example, imagine a medical test, in which an experimenter might measure the concentration of a certain protein in the blood sample. The experimenter could adjust the threshold (black vertical line in the figure) and people would be diagnosed as having diseases if any number is detected above this certain threshold. According to the image, changing the threshold would result in changes in false positives and false negatives, corresponding to movement on the curve. + +== Example == +Since in a real experiment it is impossible to avoid all type I and type II errors, it is important to consider the amount of risk one is willing to take to falsely reject H0 or accept H0. The solution to this question would be to report the p-value or significance level α of the statistic. For example, if the p-value of a test statistic result is 0.0596, then there is a probability of 5.96% that we falsely reject H0 given it is true. Or, if we say, the statistic is performed at level α, like 0.05, then we allow to falsely reject H0 at 5%. A significance level α of 0.05 is relatively common, but there is no general rule that fits all scenarios. + +=== Vehicle speed measuring === +The speed limit of a freeway in the United States is 120 kilometers per hour (75 mph). A device is set to measure the speed of passing vehicles. Suppose that the device will conduct three measurements of the speed of a passing vehicle, recording as a random sample X1, X2, X3. The traffic police will or will not fine the drivers depending on the average speed + + + + + + + X + ¯ + + + + + + {\displaystyle {\bar {X}}} + +. That is to say, the test statistic + + + + + T + = + + + + + X + + 1 + + + + + + X + + 2 + + + + + + X + + 3 + + + + 3 + + + = + + + + X + ¯ + + + + + + {\displaystyle T={\frac {X_{1}+X_{2}+X_{3}}{3}}={\bar {X}}} + + +In addition, we suppose that the measurements X1, X2, X3 are modeled as normal distribution N(μ,2). Then, T should follow N(μ,2/ + + + + + + 3 + + + + + {\displaystyle {\sqrt {3}}} + +) and the parameter μ represents the true speed of passing vehicle. In this experiment, the null hypothesis H0 and the alternative hypothesis H1 should be +H0: μ=120 against H1: μ>120. +If we perform the statistic level at α=0.05, then a critical value c should be calculated to solve + + + + + P + + ( + + Z + ⩾ + + + + c + − + 120 + + + 2 + + 3 + + + + + + ) + + = + 0.05 + + + {\displaystyle P\left(Z\geqslant {\frac {c-120}{\frac {2}{\sqrt {3}}}}\right)=0.05} + + +According to change-of-units rule for the normal distribution. Referring to Z-table, we can get + + + + + + + + c + − + 120 + + + 2 + + 3 + + + + + = + 1.645 + ⇒ + c + = + 121.9 + + + {\displaystyle {\frac {c-120}{\frac {2}{\sqrt {3}}}}=1.645\Rightarrow c=121.9} + + +Here, the critical region. That is to say, if the recorded speed of a vehicle is greater than critical value 121.9, the driver will be fined. However, there are still 5% of the drivers are falsely fined since the recorded average speed is greater than 121.9 but the true speed does not pass 120, which we say, a type I error. +The type II error corresponds to the case that the true speed of a vehicle is over 120 kilometers per hour but the driver is not fined. For example, if the true speed of a vehicle μ=125, the probability that the driver is not fined can be calculated as + + + + + P + = + ( + T + < + 121.9 + + | + + μ + = + 125 + ) + = + P + + ( + + + + + T + − + 125 + + + 2 + + 3 + + + + + < + + + + 121.9 + − + 125 + + + 2 + + 3 + + + + + + ) + + = + ϕ + ( + − + 2.68 + ) + = + 0.0036 + + + {\displaystyle P=(T<121.9|\mu =125)=P\left({\frac {T-125}{\frac {2}{\sqrt {3}}}}<{\frac {121.9-125}{\frac {2}{\sqrt {3}}}}\right)=\phi (-2.68)=0.0036} + + +which means, if the true speed of a vehicle is 125, the driver has the probability of 0.36% to avoid the fine when the statistic is performed at level α=0.05, since the recorded average speed is lower than 121.9. If the true speed is closer to 121.9 than 125, then the probability of avoiding the fine will also be higher. +The tradeoffs between type I error and type II error should also be considered. That is, in this case, if the traffic police do not want to falsely fine innocent drivers, the level α can be set to a smaller value, like 0.01. However, if that is the case, more drivers whose true speed is over 120 kilometers per hour, like 125, would be more likely to avoid the fine. + +== Etymology == +In 1928, Jerzy Neyman (1894–1981) and Egon Pearson (1895–1980), both eminent statisticians, discussed the problems associated with "deciding whether or not a particular sample may be judged as likely to have been randomly drawn from a certain population": and, as Florence Nightingale David remarked, "it is necessary to remember the adjective 'random' [in the term 'random sample'] should apply to the method of drawing the sample and not to the sample itself". +They identified "two sources of error", namely: + +In 1930, they elaborated on these two sources of error, remarking that \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Type_I_and_type_II_errors-2.md b/data/en.wikipedia.org/wiki/Type_I_and_type_II_errors-2.md new file mode 100644 index 000000000..703bdd5bf --- /dev/null +++ b/data/en.wikipedia.org/wiki/Type_I_and_type_II_errors-2.md @@ -0,0 +1,46 @@ +--- +title: "Type I and type II errors" +chunk: 3/5 +source: "https://en.wikipedia.org/wiki/Type_I_and_type_II_errors" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:52:25.345633+00:00" +instance: "kb-cron" +--- + +in testing hypotheses two considerations must be kept in view, we must be able to reduce the chance of rejecting a true hypothesis to as low a value as desired; the test must be so devised that it will reject the hypothesis tested when it is likely to be false. +In 1933, they observed that these "problems are rarely presented in such a form that we can discriminate with certainty between the true and false hypothesis". They also noted that, in deciding whether to fail to reject, or reject a particular hypothesis amongst a "set of alternative hypotheses", H1, H2..., it was easy to make an error, + +[and] these errors will be of two kinds: + +In all of the papers co-written by Neyman and Pearson the expression H0 always signifies "the hypothesis to be tested". +In the same paper they call these two sources of error, errors of type I and errors of type II respectively. + +== Related terms == + +=== Null hypothesis === + +It is standard practice for statisticians to conduct tests in order to determine whether or not a "speculative hypothesis" concerning the observed phenomena of the world (or its inhabitants) can be supported. The results of such testing determine whether a particular set of results agrees reasonably (or does not agree) with the speculated hypothesis. +On the basis that it is always assumed, by statistical convention, that the speculated hypothesis is wrong, and the so-called "null hypothesis" that the observed phenomena simply occur by chance (and that, as a consequence, the speculated agent has no effect) – the test will determine whether this hypothesis is right or wrong. This is why the hypothesis under test is often called the null hypothesis (most likely, coined by Fisher (1935, p. 19)), because it is this hypothesis that is to be either nullified or not nullified by the test. When the null hypothesis is nullified, it is possible to conclude that data support the "alternative hypothesis" (which is the original speculated one). +The consistent application by statisticians of Neyman and Pearson's convention of representing "the hypothesis to be tested" (or "the hypothesis to be nullified") with the expression H0 has led to circumstances where many understand the term "the null hypothesis" as meaning "the nil hypothesis" – a statement that the results in question have arisen through chance. This is not necessarily the case – the key restriction, as per Fisher (1966), is that "the null hypothesis must be exact, that is free from vagueness and ambiguity, because it must supply the basis of the 'problem of distribution', of which the test of significance is the solution." As a consequence of this, in experimental science the null hypothesis is generally a statement that a particular treatment has no effect; in observational science, it is that there is no difference between the value of a particular measured variable, and that of an experimental prediction. + +=== Statistical significance === +If the probability of obtaining a result as extreme as the one obtained, supposing that the null hypothesis were true, is lower than a pre-specified cut-off probability (for example, 5%), then the result is said to be statistically significant and the null hypothesis is rejected. +British statistician Sir Ronald Aylmer Fisher (1890–1962) stressed that the null hypothesis + +is never proved or established, but is possibly disproved, in the course of experimentation. Every experiment may be said to exist only in order to give the facts a chance of disproving the null hypothesis. + +=== Type S and M errors === +To address issues with null hypothesis testing, Andrew Gelman, John Carlin and others have suggested Type S and Type M errors to add to consideration of significant results. +Type S errors are errors of sign. The Type S error rate corresponds to the probability that if a significant result is obtained, the effect is estimated in the wrong direction of the true effect. This can occur often with low powered testing setups. +Type M errors are errors of magnitude. This is addressed through an "exaggeration factor", an assessment of the expected ratio of the absolute values of the estimate to the true value, conditional on a significant result being obtained. This is important because using a significance test to screen results in selection bias, which can lead to drastically overestimating effect sizes. + +== Application domains == + +=== Medicine === +In the practice of medicine, the differences between the applications of screening and testing are considerable. + +==== Medical screening ==== +Screening involves relatively cheap tests that are given to large populations, none of whom manifest any clinical indication of disease (e.g., Pap smears). +Testing involves far more expensive, often invasive, procedures that are given only to those who manifest some clinical indication of disease, and are most often applied to confirm a suspected diagnosis. +For example, most states in the US require newborns to be screened for phenylketonuria and hypothyroidism, among other congenital disorders. \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Type_I_and_type_II_errors-3.md b/data/en.wikipedia.org/wiki/Type_I_and_type_II_errors-3.md new file mode 100644 index 000000000..199bd0e50 --- /dev/null +++ b/data/en.wikipedia.org/wiki/Type_I_and_type_II_errors-3.md @@ -0,0 +1,42 @@ +--- +title: "Type I and type II errors" +chunk: 4/5 +source: "https://en.wikipedia.org/wiki/Type_I_and_type_II_errors" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:52:25.345633+00:00" +instance: "kb-cron" +--- + +Hypothesis: "The newborns have phenylketonuria and hypothyroidism". +Null hypothesis (H0): "The newborns do not have phenylketonuria and hypothyroidism". +Type I error (false positive): The true fact is that the newborns do not have phenylketonuria and hypothyroidism but we consider they have the disorders according to the data. +Type II error (false negative): The true fact is that the newborns have phenylketonuria and hypothyroidism but we consider they do not have the disorders according to the data. +Although they display a high rate of false positives, the screening tests are considered valuable because they greatly increase the likelihood of detecting these disorders at a far earlier stage. +The simple blood tests used to screen possible blood donors for HIV and hepatitis have a significant rate of false positives; however, physicians use much more expensive and far more precise tests to determine whether a person is actually infected with either of these viruses. +Perhaps the most widely discussed false positives in medical screening come from the breast cancer screening procedure mammography. The US rate of false positive mammograms is up to 15%, the highest in world. One consequence of the high false positive rate in the US is that, in any 10-year period, half of the American women screened receive a false positive mammogram. False positive mammograms are costly, with over $100 million spent annually in the U.S. on follow-up testing and treatment. They also cause women unneeded anxiety. As a result of the high false positive rate in the US, as many as 90–95% of women who get a positive mammogram do not have the condition. The lowest rate in the world is in the Netherlands, 1%. The lowest rates are generally in Northern Europe where mammography films are read twice and a high threshold for additional testing is set (the high threshold decreases the power of the test). +The ideal population screening test would be cheap, easy to administer, and produce zero false negatives, if possible. Such tests usually produce more false positives, which can subsequently be sorted out by more sophisticated (and expensive) testing. + +==== Medical testing ==== +False negatives and false positives are significant issues in medical testing. + +Hypothesis: "The patients have the specific disease". +Null hypothesis (H0): "The patients do not have the specific disease". +Type I error (false positive): The true fact is that the patients do not have a specific disease but the physician judges the patient is ill according to the test reports. +Type II error (false negative): The true fact is that the disease is actually present but the test reports provide a falsely reassuring message to patients and physicians that the disease is absent. +False positives can also produce serious and counter-intuitive problems when the condition being searched for is rare, as in screening. If a test has a false positive rate of one in ten thousand, but only one in a million samples (or people) is a true positive, most of the positives detected by that test will be false. The probability that an observed positive result is a false positive may be calculated using Bayes' theorem. +False negatives produce serious and counter-intuitive problems, especially when the condition being searched for is common. If a test with a false negative rate of only 10% is used to test a population with a true occurrence rate of 70%, many of the negatives detected by the test will be false. +This sometimes leads to inappropriate or inadequate treatment of both the patient and their disease. A common example is relying on cardiac stress tests to detect coronary atherosclerosis, even though cardiac stress tests are known to only detect limitations of coronary artery blood flow due to advanced stenosis. + +=== Biometrics === +Biometric matching, such as for fingerprint recognition, facial recognition or iris recognition, is susceptible to type I and type II errors. + +Hypothesis: "The input does not identify someone in the searched list of people". +Null hypothesis: "The input does identify someone in the searched list of people". +Type I error (false reject rate): The true fact is that the person is someone in the searched list but the system concludes that the person is not according to the data. +Type II error (false match rate): The true fact is that the person is not someone in the searched list but the system concludes that the person is someone whom we are looking for according to the data. +The probability of type I errors is called the "false reject rate" (FRR) or false non-match rate (FNMR), while the probability of type II errors is called the "false accept rate" (FAR) or false match rate (FMR). +If the system is designed to rarely match suspects then the probability of type II errors can be called the "false alarm rate". On the other hand, if the system is used for validation (and acceptance is the norm) then the FAR is a measure of system security, while the FRR measures user inconvenience level. + +=== Law === +In criminal legal proceedings, there is enormous emphasis on ensuring that if any error is committed it will be Type II error (letting an otherwise-culpable criminal defendant go free) rather than type I error (punishing an innocent person for a crime he or she did not commit). This emphasis is the reason for high burdens of proof (guilty beyond a reasonable doubt), careful scrutiny of the prosecution's characterizations of inculpatory evidence or testimony, and skepticism or exclusion as to evidence that may be more prejudicial than probative (the Rule 403 balancing test). \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Type_I_and_type_II_errors-4.md b/data/en.wikipedia.org/wiki/Type_I_and_type_II_errors-4.md new file mode 100644 index 000000000..86fd18dac --- /dev/null +++ b/data/en.wikipedia.org/wiki/Type_I_and_type_II_errors-4.md @@ -0,0 +1,67 @@ +--- +title: "Type I and type II errors" +chunk: 5/5 +source: "https://en.wikipedia.org/wiki/Type_I_and_type_II_errors" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:52:25.345633+00:00" +instance: "kb-cron" +--- + +Substantial scholarship dating back centuries discusses the grave consequences of miscarriages of justice in criminal proceedings, not only for the defendant, but for the perception of fairness of the entire judicial system and trust from community stakeholders that allegations of criminal activity will be heard seriously but also fairly. English jurist William Blackstone invented Blackstone's ratio of 10:1 to describe the concept that a fair system might let ten guilty defendants walk free rather than let more than one innocent person be jailed. +In recent years, legal scholarship and mainstream jurisprudence has adopted the Type I and Type II error taxonomy to have a more rigorous vocabulary with which to discuss judicial mistakes and wrongful convictions. The U.S. Supreme Court used the Type I vs. Type II taxonomy in its discussion of errors in Ballew v. Georgia and judges and law professors increasingly use this dichotomous designation rather than the more casual "wrongly convicted" or "erroneously acquitted," which were terms used often in earlier scholarship. +Recent scholarship and judicial concern often centers on the size and unanimity of juries as safeguards against Type I error (incorrectly convicting the innocent defendant). +Small juries of less than twelve have been criticized by the U.S. Supreme Court for being more likely to produce Type I errors; meanwhile, some judges have also criticized the empaneling of more than the typical twelve jurors as problematic (Judge Anderson of the Wisconsin Court of Appeals wrote a notable dissent on this topic in 1993: "Absent a legislative pronouncement of public policy permitting a defendant to acquiesce to be tried by a jury greater than twelve, it is plain error to permit more than twelve jurors to deliberate."). As to unanimity, on April 20, 2020, the U.S. Supreme Court ruled the Sixth Amendment requires a unanimous jury verdict to convict a defendant of a serious offense, citing concern around Type I error as among the primary reasons for requiring unanimous verdicts in serious criminal matters. + +=== Security screening === + +False positives are routinely found every day in airport security screening, which are ultimately visual inspection systems. The installed security alarms are intended to prevent weapons being brought onto aircraft; yet they are often set to such high sensitivity that they alarm many times a day for minor items, such as keys, belt buckles, loose change, mobile phones, and tacks in shoes. + +Hypothesis: "The item is a weapon". +Null hypothesis: "The item is not a weapon". +Type I error (false positive): The true fact is that the item is not a weapon but the system still sounds an alarm. +Type II error (false negative) The true fact is that the item is a weapon but the system keeps silent at this time. +The ratio of false positives (identifying an innocent traveler as a terrorist) to true positives (detecting a would-be terrorist) is, therefore, very high; and because almost every alarm is a false positive, the positive predictive value of these screening tests is very low. +The relative cost of false results determines the likelihood that test creators allow these events to occur. As the cost of a false negative in this scenario is extremely high (not detecting a bomb being brought onto a plane could result in hundreds of deaths) whilst the cost of a false positive is relatively low (a reasonably simple further inspection) the most appropriate test is one with a low statistical specificity but high statistical sensitivity (one that allows a high rate of false positives in return for minimal false negatives). + +=== Computers === +The notions of false positives and false negatives have a wide currency in the realm of computers and computer applications, including computer security, spam filtering, malware, optical character recognition, and many others. +For example, in the case of spam filtering: + +Hypothesis: "The message is spam". +Null hypothesis: "The message is not spam". +Type I error (false positive): Spam filtering or spam blocking techniques wrongly classify a legitimate email message as spam and, as a result, interfere with its delivery. +Type II error (false negative): Spam email is not detected as spam, but is classified as non-spam. +While most anti-spam tactics can block or filter a high percentage of unwanted emails, doing so without creating significant false-positive results is a much more demanding task. A low number of false negatives is an indicator of the efficiency of spam filtering. + +== See also == + +Binary classification – Dividing things between two categories +Detection theory – Means to measure signal processing ability +Ethics in mathematics – Emerging field of applied ethics +False discovery rate – Statistical method for handling multiple comparisons +False positive paradox – Logic error due to ignoring the base ratePages displaying short descriptions of redirect targets +Family-wise error rate – Probability of making type I errors when performing multiple hypotheses tests +Information retrieval performance measures – Finding information for an information needPages displaying short descriptions of redirect targets +Lemma (mathematics) – Theorem for proving more complex theorems +Jerzy Neyman – Polish American mathematician +Neyman–Pearson lemma – Theorem about the power of the likelihood ratio test +Null hypothesis – Position that there is no relationship between two phenomena +Probability of a hypothesis for Bayesian inference – Method of statistical inference +Egon Pearson – British statistician (1895–1980) +Precision and recall – Pattern-recognition performance metrics +Prosecutor's fallacy – Logic error due to ignoring the base ratePages displaying short descriptions of redirect targets +Prozone phenomenon – Immunologic phenomenon occurring in high antigen or antibody levelsPages displaying short descriptions of redirect targets +Receiver operating characteristic – Diagnostic plot of binary classifier ability +Sensitivity and specificity – Statistical measure of a binary classification +Statisticians' and engineers' cross-reference of statistical terms – Terms used by electrical engineers in statistical signal processing studies +Testing hypotheses suggested by the data – Problem of circular reasoning in statistics +Type III error – Term in statistical hypothesis testing + +== References == + +== Bibliography == + +== External links == +Bias and Confounding – presentation by Nigel Paneth, Graduate School of Public Health, University of Pittsburgh +[2] – Web page by Prof. Ronny Gunnarsson with online calculators to adjust level of significance for multiple testing. \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/U.S._Government_peer_review_policies-0.md b/data/en.wikipedia.org/wiki/U.S._Government_peer_review_policies-0.md new file mode 100644 index 000000000..894752a78 --- /dev/null +++ b/data/en.wikipedia.org/wiki/U.S._Government_peer_review_policies-0.md @@ -0,0 +1,36 @@ +--- +title: "U.S. Government peer review policies" +chunk: 1/1 +source: "https://en.wikipedia.org/wiki/U.S._Government_peer_review_policies" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:53:24.170660+00:00" +instance: "kb-cron" +--- + +Most federal regulatory agencies in the United States government must comply with specific peer review requirements before the agencies publicly disseminate certain scientific information. These requirements were published in a Peer Review Bulletin issued by the White House Office of Management and Budget (OMB), which establishes "government-wide standards concerning when peer review is required and, if required, what type of peer review processes are appropriate." + + +== Bulletin == +OMB's peer review bulletin requires that US federal regulatory agencies submit all "influential scientific information" to peer review before the information is publicly disseminated. The Bulletin defines "scientific information" as: + +factual inputs, data, models, analyses, technical information, or scientific assessments related to such disciplines as the behavioral and social sciences, public health and medical sciences, life and earth sciences, engineering, or physical sciences. +This Bulletin defines "influential scientific information" as + +scientific information the agency reasonably can determine will have or does have a clear and substantial impact on important public policies or private sector decisions. In the term 'influential scientific information,' the term 'influential' should be interpreted consistently with OMB's government-wide information quality guidelines and the information quality guidelines of the agency. +As noted in the preceding quotation, the Bulletin must be read in conjunction with "OMB's government-wide information quality guidelines and the information quality guidelines of the agency." These guidelines govern the quality of all information disseminated by most US government regulatory agencies. These guidelines are required by a US statute enacted in 2001 called the Data Quality Act and also known as the Information Quality Act ("IQA"). OMB states that it prepared the peer review Bulletin pursuant to OMB's authority under the IQA. +The peer review Bulletin provides detailed guidelines for peer review of influential scientific information. The Bulletin applies more stringent peer review requirements to "highly influential scientific assessments," + +which are a subset of influential scientific information. A scientific assessment is an evaluation of a body of scientific or technical knowledge that typically synthesizes multiple factual inputs, data, models, assumptions, and/or applies best professional judgment to bridge uncertainties in the available information. + + +== Comparison == +The peer review Bulletin's specific guidelines differ in several respects from traditional peer review practices at most journals. For example, the Bulletin requires public disclosure of peer reviewers' identities when they are reviewing highly influential scientific assessments. The Bulletin's summary of some of these requirements is set forth below: + +In general, an agency conducting a peer review of a highly influential scientific assessment must ensure that the peer review process is transparent by making available to the public the written charge to the peer reviewers, the peer reviewers' names, the peer reviewers' report(s), and the agency's response to the peer reviewers' report(s). ... This Bulletin requires agencies to adopt or adapt the committee selection policies employed by the National Academy of Sciences (NAS). +The peer review Bulletin specifically addresses the effect of publication in a refereed scientific journal as well the variations and limitations with peer review: + +Publication in a refereed scientific journal may mean that adequate peer review has been performed. However, the intensity of peer review is highly variable across journals. There will be cases in which an agency determines that a more rigorous or transparent review process is necessary. For instance, an agency may determine a particular journal review process did not address questions (e.g., the extent of uncertainty inherent in a finding) that the agency determines should be addressed before disseminating that information. As such, prior "peer review and publication is not by itself sufficient grounds for determining that no further review is necessary. + + +== References == \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/United_States_Office_of_Research_Integrity-0.md b/data/en.wikipedia.org/wiki/United_States_Office_of_Research_Integrity-0.md index 3bc560f43..cb781c52c 100644 --- a/data/en.wikipedia.org/wiki/United_States_Office_of_Research_Integrity-0.md +++ b/data/en.wikipedia.org/wiki/United_States_Office_of_Research_Integrity-0.md @@ -4,7 +4,7 @@ chunk: 1/1 source: "https://en.wikipedia.org/wiki/United_States_Office_of_Research_Integrity" category: "reference" tags: "science, encyclopedia" -date_saved: "2026-05-05T07:02:55.217424+00:00" +date_saved: "2026-05-05T09:53:25.437954+00:00" instance: "kb-cron" --- diff --git a/data/en.wikipedia.org/wiki/Up-and-down_design-0.md b/data/en.wikipedia.org/wiki/Up-and-down_design-0.md new file mode 100644 index 000000000..66860127b --- /dev/null +++ b/data/en.wikipedia.org/wiki/Up-and-down_design-0.md @@ -0,0 +1,364 @@ +--- +title: "Up-and-down design" +chunk: 1/4 +source: "https://en.wikipedia.org/wiki/Up-and-down_design" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:52:26.719636+00:00" +instance: "kb-cron" +--- + +Up-and-down designs (UDDs) are a family of statistical experiment designs used in dose-finding experiments in science, engineering, and medical research. Dose-finding experiments have binary responses: each individual outcome can be described as one of two possible values, such as success vs. failure or toxic vs. non-toxic. Mathematically the binary responses are coded as 1 and 0. The goal of dose-finding experiments is to estimate the strength of treatment (i.e., the "dose") that would trigger the "1" response a pre-specified proportion of the time. This dose can be envisioned as a percentile of the distribution of response thresholds. An example where dose-finding is used is in an experiment to estimate the LD50 of some toxic chemical with respect to mice. +Dose-finding designs are sequential and response-adaptive: the dose at a given point in the experiment depends upon previous outcomes, rather than be fixed a priori. Dose-finding designs are generally more efficient for this task than fixed designs, but their properties are harder to analyze, and some require specialized design software. UDDs use a discrete set of doses rather than vary the dose continuously. They are relatively simple to implement, and are also among the best understood dose-finding designs. Despite this simplicity, UDDs generate random walks with intricate properties. The original UDD aimed to find the median threshold by increasing the dose one level after a "0" response, and decreasing it one level after a "1" response. Hence the name "up-and-down". Other UDDs break this symmetry in order to estimate percentiles other than the median, or are able to treat groups of subjects rather than one at a time. +UDDs were developed in the 1940s by several research groups independently. The 1950s and 1960s saw rapid diversification with UDDs targeting percentiles other than the median, and expanding into numerous applied fields. The 1970s to early 1990s saw little UDD methods research, even as the design continued to be used extensively. A revival of UDD research since the 1990s has provided deeper understanding of UDDs and their properties, and new and better estimation methods. +UDDs are still used extensively in the two applications for which they were originally developed: psychophysics where they are used to estimate sensory thresholds and are often known as fixed forced-choice staircase procedures, and explosive sensitivity testing, where the median-targeting UDD is often known as the Bruceton test. UDDs are also very popular in toxicity and anesthesiology research. They are also considered a viable choice for Phase I clinical trials. + +== Mathematical description == + +=== Definition === +Let + + + + n + + + {\displaystyle n} + + be the sample size of a UDD experiment, and assuming for now that subjects are treated one at a time. Then the doses these subjects receive, denoted as random variables + + + + + X + + 1 + + + , + … + , + + X + + n + + + + + {\displaystyle X_{1},\ldots ,X_{n}} + +, are chosen from a discrete, finite set of + + + + M + + + {\displaystyle M} + + increasing dose levels + + + + + + X + + + = + + { + + + d + + 1 + + + , + … + , + + d + + M + + + : + + + d + + 1 + + + < + ⋯ + < + + d + + M + + + + } + + . + + + {\displaystyle {\mathcal {X}}=\left\{d_{1},\ldots ,d_{M}:\ d_{1}<\cdots + 0.5 + + + {\displaystyle F^{*}>0.5} + + and one for + + + + + F + + ∗ + + + < + 0.5 + , + + + {\displaystyle F^{*}<0.5,} + + whose rules are shown below: \ No newline at end of file diff --git a/data/en.wikipedia.org/wiki/Up-and-down_design-2.md b/data/en.wikipedia.org/wiki/Up-and-down_design-2.md new file mode 100644 index 000000000..6c33b959c --- /dev/null +++ b/data/en.wikipedia.org/wiki/Up-and-down_design-2.md @@ -0,0 +1,1194 @@ +--- +title: "Up-and-down design" +chunk: 3/4 +source: "https://en.wikipedia.org/wiki/Up-and-down_design" +category: "reference" +tags: "science, encyclopedia" +date_saved: "2026-05-05T09:52:26.719636+00:00" +instance: "kb-cron" +--- + + + + + + X + + i + + + 1 + + + = + + + + + + d + + m + + + 1 + + + + + + + if + + + + + + Y + + i + + + = + 0 + + + & + + + + + 'heads' + + + ; + + + + + + d + + m + − + 1 + + + + + + + if + + + + + Y + _ + i + = + 1 + ; + + + + + + d + + m + + + + + + + if + + + + + + Y + + i + + + = + 0 + + + & + + + + + 'tails' + + + . + + + + + + + {\displaystyle X_{i+1}={\begin{array}{ll}d_{m+1}&{\textrm {if}}\ \ Y_{i}=0\ \ \&\ \ {\textrm {'heads'}};\\d_{m-1}&{\textrm {if}}\ \ Y\_i=1;\\d_{m}&{\textrm {if}}\ \ Y_{i}=0\ \ \&\ \ {\textrm {'tails'}}.\\\end{array}}} + + +The heads probability + + + + b + + + {\displaystyle b} + + can take any value in + + + + [ + 0 + , + 1 + ] + + + {\displaystyle [0,1]} + +. The balance point is + + + + + + + + + b + + ( + + 1 + − + + F + + ∗ + + + + ) + + + + = + + + + F + + ∗ + + + + + + + + F + + ∗ + + + + + = + + + + + b + + 1 + + + b + + + + ∈ + [ + 0 + , + 0.5 + ] + . + + + + + + + {\displaystyle {\begin{array}{rcl}b\left(1-F^{*}\right)&=&F^{*}\\F^{*}&=&{\frac {b}{1+b}}\in [0,0.5].\end{array}}} + + +The BCD balance point can made identical to a target rate + + + + + F + + − + 1 + + + ( + Γ + ) + + + {\displaystyle F^{-1}(\Gamma )} + + by setting the heads probability to + + + + b + = + Γ + + / + + ( + 1 + − + Γ + ) + + + {\displaystyle b=\Gamma /(1-\Gamma )} + +. For example, for + + + + Γ + = + 0.3 + + + {\displaystyle \Gamma =0.3} + + set + + + + b + = + 3 + + / + + 7 + + + {\displaystyle b=3/7} + +. Setting + + + + b + = + 1 + + + {\displaystyle b=1} + + makes this design identical to the classical UDD, and inverting the rules by imposing the coin toss upon positive rather than negative outcomes, produces above-median balance points. Versions with two coins, one for each outcome, have also been published, but they do not seem to offer an advantage over the simpler single-coin BCD. + +=== Group (cohort) UDDs === +Some dose-finding experiments, such as phase I trials, require a waiting period of weeks before determining each individual outcome. It may preferable then, to be able treat several subjects at once or in rapid succession. With group UDDs, the transition rules apply rules to cohorts of fixed size + + + + s + + + {\displaystyle s} + + rather than to individuals. + + + + + X + + i + + + + + {\displaystyle X_{i}} + + becomes the dose given to cohort + + + + i + + + {\displaystyle i} + +, and + + + + + Y + + i + + + + + {\displaystyle Y_{i}} + + is the number of positive responses in the + + + + i + + + {\displaystyle i} + +-th cohort, rather than a binary outcome. Given that the + + + + i + + + {\displaystyle i} + +-th cohort is treated at + + + + + X + + i + + + = + + d + + m + + + + + {\displaystyle X_{i}=d_{m}} + + on the interior of + + + + + + X + + + + + {\displaystyle {\mathcal {X}}} + + the + + + + i + + + 1 + + + {\displaystyle i+1} + +-th cohort is assigned to + + + + + + X + + i + + + 1 + + + = + + + { + + + + + d + + m + + + 1 + + + + + + + if + + + + + + Y + + i + + + ≤ + l + ; + + + + + + d + + m + − + 1 + + + + + + + if + + + + + + Y + + i + + + ≥ + u + ; + + + + + + d + + m + + + + + + + if + + + + + l + < + + Y + + i + + + < + u + . + + + + + + + + + {\displaystyle X_{i+1}={\begin{cases}d_{m+1}&{\textrm {if}}\ \ Y_{i}\leq l;\\d_{m-1}&{\textrm {if}}\ \ Y_{i}\geq u;\\d_{m}&{\textrm {if}}\ \ l