diff --git a/_index.db b/_index.db
index 639ea03d7..8c5ba85f0 100644
Binary files a/_index.db and b/_index.db differ
diff --git a/data/en.wikipedia.org/wiki/Sampling_(statistics)-0.md b/data/en.wikipedia.org/wiki/Sampling_(statistics)-0.md
new file mode 100644
index 000000000..135b540cd
--- /dev/null
+++ b/data/en.wikipedia.org/wiki/Sampling_(statistics)-0.md
@@ -0,0 +1,29 @@
+---
+title: "Sampling (statistics)"
+chunk: 1/8
+source: "https://en.wikipedia.org/wiki/Sampling_(statistics)"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T03:17:18.009917+00:00"
+instance: "kb-cron"
+---
+
+In statistics, quality assurance, and survey methodology, sampling is the selection of a subset of individuals from within a statistical population to estimate characteristics of the whole population. The subset, called a statistical sample (or sample, for short), is meant to reflect the whole population, and statisticians attempt to collect samples that are representative of the population. 
+Sampling has lower costs and faster data collection compared to a census recording data from the entire population (in many cases, collecting the whole population is impossible, like getting sizes of all stars in the universe). Thus, it can provide insights in cases where it is infeasible to measure an entire population.
+Each observation measures one or more properties (such as weight, location, colour or mass) of independent objects or individuals. In survey sampling, weights can be applied to the data to adjust for the sample design, particularly in stratified sampling. Results from probability theory and statistical theory are employed to guide the practice. In business and medical research, sampling is widely used for gathering information about a population. Acceptance sampling is used to determine if a production lot of material meets the governing specifications.
+
+== History ==
+Random sampling by using lots is an old idea, mentioned several times in the Bible. In 1786, Pierre Simon Laplace estimated the population of France by using a sample, along with ratio estimator. He also computed probabilistic estimates of the error. These were not expressed as modern confidence intervals but as the sample size that would be needed to achieve a particular upper bound on the sampling error with probability 1000/1001. His estimates used Bayes' theorem with a uniform prior probability and assumed that his sample was random. Alexander Ivanovich Chuprov introduced sample surveys to Imperial Russia in the 1870s.
+In the US, the 1936 Literary Digest prediction of a Republican win in the presidential election went badly awry, due to severe bias [1]. More than two million people responded to the study with their names obtained through magazine subscription lists and telephone directories. It was not appreciated that these lists were heavily biased towards Republicans and the resulting sample, though very large, was deeply flawed.
+Elections in Singapore have adopted this practice since the 2015 election, also known as the sample counts, whereas according to the Elections Department (ELD), their country's election commission, sample counts help reduce speculation and misinformation, while helping election officials to check against the election result for that electoral division. While the reported sample counts yield a fairly accurate indicative result with a 4% margin of error at a 95% confidence interval, ELD reminded the public that sample counts are separate from official results, and only the returning officer will declare the official results once vote counting is complete.
+
+== Population definition ==
+Successful statistical practice is based on focused problem definition. In sampling, this includes defining the "population" from which our sample is drawn. A population can be defined as including all people or items with the characteristics one wishes to understand. Because there is very rarely enough time or money to gather information from everyone or everything in a population, the goal becomes finding a representative sample (or subset) of that population.
+Sometimes what defines a population is obvious. For example, a manufacturer needs to decide whether a batch of material from production is of high enough quality to be released to the customer or should be scrapped or reworked due to poor quality. In this case, the batch is the population.
+Although the population of interest often consists of physical objects, sometimes it is necessary to sample over time, space, or some combination of these dimensions. For instance, an investigation of supermarket staffing could examine checkout line length at various times, or a study on endangered penguins might aim to understand their usage of various hunting grounds over time. For the time dimension, the focus may be on periods or discrete occasions.
+In other cases, the examined 'population' may be even less tangible. For example, Joseph Jagger studied the behaviour of roulette wheels at a casino in Monte Carlo, and used this to identify a biased wheel. In this case, the 'population' Jagger wanted to investigate was the overall behaviour of the wheel (i.e. the probability distribution of its results over infinitely many trials), while his 'sample' was formed from observed results from that wheel. Similar considerations arise when taking repeated measurements of properties of materials such as the electrical conductivity of copper.
+This situation often arises when seeking knowledge about the cause system of which the observed population is an outcome. In such cases, sampling theory may treat the observed population as a sample from a larger 'superpopulation'. For example, a researcher might study the success rate of a new 'quit smoking' program on a test group of 100 patients, in order to predict the effects of the program if it were made available nationwide. Here the superpopulation is "everybody in the country, given access to this treatment" – a group that does not yet exist since the program is not yet available to all.
+The population from which the sample is drawn may not be the same as the population from which information is desired. Often there is a large but not complete overlap between these two groups due to frame issues etc. (see below). Sometimes they may be entirely separate – for instance, one might study rats in order to get a better understanding of human health, or one might study records from people born in 2008 in order to make predictions about people born in 2009.
+Time spent in making the sampled population and population of concern precise is often well spent because it raises many issues, ambiguities, and questions that would otherwise have been overlooked at this stage.
+
+== Sampling frame ==
\ No newline at end of file
diff --git a/data/en.wikipedia.org/wiki/Sampling_(statistics)-1.md b/data/en.wikipedia.org/wiki/Sampling_(statistics)-1.md
new file mode 100644
index 000000000..33b1b0bb8
--- /dev/null
+++ b/data/en.wikipedia.org/wiki/Sampling_(statistics)-1.md
@@ -0,0 +1,41 @@
+---
+title: "Sampling (statistics)"
+chunk: 2/8
+source: "https://en.wikipedia.org/wiki/Sampling_(statistics)"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T03:17:18.009917+00:00"
+instance: "kb-cron"
+---
+
+In the most straightforward case, such as the sampling of a batch of material from production (acceptance sampling by lots), it would be most desirable to identify and measure every single item in the population and to include any one of them in our sample. However, in the more general case this is not usually possible or practical. There is no way to identify all rats in the set of all rats. Where voting is not compulsory, there is no way to identify which people will vote at a forthcoming election (in advance of the election). These imprecise populations are not amenable to sampling in any of the ways below and to which we could apply statistical theory.
+As a remedy, we seek a sampling frame which has the property that we can identify every single element and include any in our sample. The most straightforward type of frame is a list of elements of the population (preferably the entire population) with appropriate contact information. For example, in an opinion poll, possible sampling frames include an electoral register and a telephone directory.
+A probability sample is a sample in which every unit in the population has a chance (greater than zero) of being selected in the sample, and this probability can be accurately determined. The combination of these traits makes it possible to produce unbiased estimates of population totals, by weighting sampled units according to their probability of selection.
+
+Example: We want to estimate the total income of adults living in a given street. We visit each household in that street, identify all adults living there, and randomly select one adult from each household. (For example, we can allocate each person a random number, generated from a uniform distribution between 0 and 1, and select the person with the highest number in each household). We then interview the selected person and find their income.
+People living on their own are certain to be selected, so we simply add their income to our estimate of the total. But a person living in a household of two adults has only a one-in-two chance of selection. To reflect this, when we come to such a household, we would count the selected person's income twice towards the total. (The person who is selected from that household can be loosely viewed as also representing the person who isn't selected.)
+
+In the above example, not everybody has the same probability of selection; what makes it a probability sample is the fact that each person's probability is known. When every element in the population does have the same probability of selection, this is known as an 'equal probability of selection' (EPS) design. Such designs are also referred to as 'self-weighting' because all sampled units are given the same weight.
+Probability sampling includes: simple random sampling, systematic sampling, stratified sampling, probability-proportional-to-size sampling, and cluster or multistage sampling. These various ways of probability sampling have two things in common:
+
+Every element has a known nonzero probability of being sampled and
+involves random selection at some point.
+
+=== Nonprobability sampling ===
+
+Nonprobability sampling is any sampling method where some elements of the population have no chance of selection (these are sometimes referred to as 'out of coverage'/'undercovered'), or where the probability of selection cannot be accurately determined. It involves the selection of elements based on assumptions regarding the population of interest, which forms the criteria for selection. Hence, because the selection of elements is nonrandom, nonprobability sampling does not allow the estimation of sampling errors. These conditions give rise to exclusion bias, placing limits on how much information a sample can provide about the population. Information about the relationship between sample and population is limited, making it difficult to extrapolate from the sample to the population.
+
+Example: We visit every household in a given street, and interview the first person to answer the door. In any household with more than one occupant, this is a nonprobability sample, because some people are more likely to answer the door (e.g., an unemployed person who spends most of their time at home is more likely to answer than an employed housemate who might be at work when the interviewer calls) and it's not practical to calculate these probabilities.
+
+Nonprobability sampling methods include convenience sampling, quota sampling, snowball sampling, and purposive sampling. In addition, nonresponse effects may turn any probability design into a nonprobability design if the characteristics of nonresponse are not well understood, since nonresponse effectively modifies each element's probability of being sampled.
+
+== Sampling methods ==
+Within any of the types of frames identified above, a variety of sampling methods can be employed individually or in combination. Factors commonly influencing the choice between these designs include:
+
+Nature and quality of the frame
+Availability of auxiliary information about units on the frame
+Accuracy requirements, and the need to measure accuracy
+Whether detailed analysis of the sample is expected
+Cost/operational concerns
+
+=== Simple random sampling ===
\ No newline at end of file
diff --git a/data/en.wikipedia.org/wiki/Sampling_(statistics)-2.md b/data/en.wikipedia.org/wiki/Sampling_(statistics)-2.md
new file mode 100644
index 000000000..63889d083
--- /dev/null
+++ b/data/en.wikipedia.org/wiki/Sampling_(statistics)-2.md
@@ -0,0 +1,26 @@
+---
+title: "Sampling (statistics)"
+chunk: 3/8
+source: "https://en.wikipedia.org/wiki/Sampling_(statistics)"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T03:17:18.009917+00:00"
+instance: "kb-cron"
+---
+
+In a simple random sample (SRS) of a given size, all subsets of a sampling frame have an equal probability of being selected. Each element of the frame thus has an equal probability of selection: the frame is not subdivided or partitioned. Furthermore, any given pair of elements has the same chance of selection as any other such pair (and similarly for triples, and so on). This minimizes bias and simplifies analysis of results. In particular, the variance between individual results within the sample is a good indicator of variance in the overall population, which makes it relatively easy to estimate the accuracy of results.
+Simple random sampling can be vulnerable to sampling error because the randomness of the selection may result in a sample that does not reflect the makeup of the population. For instance, a simple random sample of ten people from a given country will on average produce five men and five women, but any given trial is likely to overrepresent one sex and underrepresent the other. Systematic and stratified techniques attempt to overcome this problem by "using information about the population" to choose a more "representative" sample.
+Also, simple random sampling can be cumbersome and tedious when sampling from a large target population. In some cases, investigators are interested in research questions specific to subgroups of the population. For example, researchers might be interested in examining whether cognitive ability as a predictor of job performance is equally applicable across racial groups. Simple random sampling cannot accommodate the needs of researchers in this situation, because it does not provide subsamples of the population, and other sampling strategies, such as stratified sampling, can be used instead.
+
+=== Systematic sampling ===
+
+Systematic sampling (also known as interval sampling) relies on arranging the study population according to some ordering scheme, and then selecting elements at regular intervals through that ordered list. Systematic sampling involves a random start and then proceeds with the selection of every kth element from then onwards. In this case, k=(population size/sample size). It is important that the starting point is not automatically the first in the list, but is instead randomly chosen from within the first to the kth element in the list. A simple example would be to select every 10th name from the telephone directory (an 'every 10th' sample, also referred to as 'sampling with a skip of 10').
+As long as the starting point is randomized, systematic sampling is a type of probability sampling. It is easy to implement and the stratification induced can make it efficient, if the variable by which the list is ordered is correlated with the variable of interest. 'Every 10th' sampling is especially useful for efficient sampling from databases.
+For example, suppose we wish to sample people from a long street that starts in a poor area (house No. 1) and ends in an expensive district (house No. 1000). A simple random selection of addresses from this street could easily end up with too many from the high end and too few from the low end (or vice versa), leading to an unrepresentative sample. Selecting (e.g.) every 10th street number along the street ensures that the sample is spread evenly along the length of the street, representing all of these districts. (If we always start at house #1 and end at #991, the sample is slightly biased towards the low end; by randomly selecting the start between #1 and #10, this bias is eliminated.)
+However, systematic sampling is especially vulnerable to periodicities in the list. If periodicity is present and the period is a multiple or factor of the interval used, the sample is especially likely to be unrepresentative of the overall population, making the scheme less accurate than simple random sampling.
+For example, consider a street where the odd-numbered houses are all on the north (expensive) side of the road, and the even-numbered houses are all on the south (cheap) side. Under the sampling scheme given above, it is impossible to get a representative sample; either the houses sampled will all be from the odd-numbered, expensive side, or they will all be from the even-numbered, cheap side, unless the researcher has previous knowledge of this bias and avoids it by a using a skip which ensures jumping between the two sides (any odd-numbered skip).
+Another drawback of systematic sampling is that even in scenarios where it is more accurate than SRS, its theoretical properties make it difficult to quantify that accuracy. (In the two examples of systematic sampling that are given above, much of the potential sampling error is due to variation between neighbouring houses – but because this method never selects two neighbouring houses, the sample will not give us any information on that variation.)
+As described above, systematic sampling is an EPS method, because all elements have the same probability of selection (in the example given, one in ten). It is not 'simple random sampling' because different subsets of the same size have different selection probabilities – e.g., the set {4,14,24,...,994} has a one-in-ten probability of selection, but the set {4,13,24,34,...} has zero probability of selection.
+Systematic sampling can also be adapted to a non-EPS approach; for an example, see discussion of PPS samples below.
+
+=== Stratified sampling ===
\ No newline at end of file
diff --git a/data/en.wikipedia.org/wiki/Sampling_(statistics)-3.md b/data/en.wikipedia.org/wiki/Sampling_(statistics)-3.md
new file mode 100644
index 000000000..1b6f3a141
--- /dev/null
+++ b/data/en.wikipedia.org/wiki/Sampling_(statistics)-3.md
@@ -0,0 +1,42 @@
+---
+title: "Sampling (statistics)"
+chunk: 4/8
+source: "https://en.wikipedia.org/wiki/Sampling_(statistics)"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T03:17:18.009917+00:00"
+instance: "kb-cron"
+---
+
+When the population embraces a number of distinct categories, the frame can be organized by these categories into separate "strata." Each stratum is then sampled as an independent sub-population, out of which individual elements can be randomly selected. The ratio of the size of this random selection (or sample) to the size of the population is called a sampling fraction. There are several potential benefits to stratified sampling.
+First, dividing the population into distinct, independent strata can enable researchers to draw inferences about specific subgroups that may be lost in a more generalized random sample.
+Second, utilizing a stratified sampling method can lead to more efficient statistical estimates (provided that strata are selected based upon relevance to the criterion in question, instead of availability of the samples). Even if a stratified sampling approach does not lead to increased statistical efficiency, such a tactic will not result in less efficiency than would simple random sampling, provided that each stratum is proportional to the group's size in the population.
+Third, it is sometimes the case that data are more readily available for individual, pre-existing strata within a population than for the overall population; in such cases, using a stratified sampling approach may be more convenient than aggregating data across groups (though this may potentially be at odds with the previously noted importance of utilizing criterion-relevant strata).
+Finally, since each stratum is treated as an independent population, different sampling approaches can be applied to different strata, potentially enabling researchers to use the approach best suited (or most cost-effective) for each identified subgroup within the population.
+There are, however, some potential drawbacks to using stratified sampling. First, identifying strata and implementing such an approach can increase the cost and complexity of sample selection, as well as leading to increased complexity of population estimates. Second, when examining multiple criteria, stratifying variables may be related to some, but not to others, further complicating the design, and potentially reducing the utility of the strata. Finally, in some cases (such as designs with a large number of strata, or those with a specified minimum sample size per group), stratified sampling can potentially require a larger sample than would other methods (although in most cases, the required sample size would be no larger than would be required for simple random sampling).
+
+A stratified sampling approach is most effective when three conditions are met
+
+Variability within strata are minimized
+Variability between strata are maximized
+The variables upon which the population is stratified are strongly correlated with the desired dependent variable.
+Advantages over other sampling methods
+Focuses on important subpopulations and ignores irrelevant ones.
+Allows use of different sampling techniques for different subpopulations.
+Improves the accuracy/efficiency of estimation.
+Permits greater balancing of statistical power of tests of differences between strata by sampling equal numbers from strata varying widely in size.
+Disadvantages
+Requires selection of relevant stratification variables which can be difficult.
+Is not useful when there are no homogeneous subgroups.
+Can be expensive to implement.
+Poststratification
+Stratification is sometimes introduced after the sampling phase in a process called "poststratification". This approach is typically implemented due to a lack of prior knowledge of an appropriate stratifying variable or when the experimenter lacks the necessary information to create a stratifying variable during the sampling phase. Although the method is susceptible to the pitfalls of post hoc approaches, it can provide several benefits in the right situation. Implementation usually follows a simple random sample. In addition to allowing for stratification on an ancillary variable, poststratification can be used to implement weighting, which can improve the precision of a sample's estimates.
+
+Oversampling
+Choice-based sampling or oversampling is one of the stratified sampling strategies. In choice-based sampling, the data are stratified on the target and a sample is taken from each stratum so that rarer target classes will be more represented in the sample. The model is then built on this biased sample. The effects of the input variables on the target are often estimated with more precision with the choice-based sample even when a smaller overall sample size is taken, compared to a random sample. The results usually must be adjusted to correct for the oversampling.
+
+=== Probability-proportional-to-size sampling ===
+
+In some cases the sample designer has access to an "auxiliary variable" or "size measure", believed to be correlated to the variable of interest, for each element in the population. These data can be used to improve accuracy in sample design. One option is to use the auxiliary variable as a basis for stratification, as discussed above.
+Another option is probability proportional to size ('PPS') sampling, in which the selection probability for each element is set to be proportional to its size measure, up to a maximum of 1. In a simple PPS design, these selection probabilities can then be used as the basis for Poisson sampling. However, this has the drawback of variable sample size, and different portions of the population may still be over- or under-represented due to chance variation in selections.
+Systematic sampling theory can be used to create a probability proportionate to size sample. This is done by treating each count within the size variable as a single sampling unit. Samples are then identified by selecting at even intervals among these counts within the size variable. This method is sometimes called PPS-sequential or monetary unit sampling in the case of audits or forensic sampling.
\ No newline at end of file
diff --git a/data/en.wikipedia.org/wiki/Sampling_(statistics)-4.md b/data/en.wikipedia.org/wiki/Sampling_(statistics)-4.md
new file mode 100644
index 000000000..54c78c0af
--- /dev/null
+++ b/data/en.wikipedia.org/wiki/Sampling_(statistics)-4.md
@@ -0,0 +1,32 @@
+---
+title: "Sampling (statistics)"
+chunk: 5/8
+source: "https://en.wikipedia.org/wiki/Sampling_(statistics)"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T03:17:18.009917+00:00"
+instance: "kb-cron"
+---
+
+Example: Suppose we have six schools with populations of 150, 180, 200, 220, 260, and 490 students respectively (total 1500 students), and we want to use student population as the basis for a PPS sample of size three. To do this, we could allocate the first school numbers 1 to 150, the second school 151 to 330 (= 150 + 180), the third school 331 to 530, and so on to the last school (1011 to 1500). We then generate a random start between 1 and 500 (equal to 1500/3) and count through the school populations by multiples of 500. If our random start was 137, we would select the schools which have been allocated numbers 137, 637, and 1137, i.e. the first, fourth, and sixth schools.
+
+The PPS approach can improve accuracy for a given sample size by concentrating sample on large elements that have the greatest impact on population estimates. PPS sampling is commonly used for surveys of businesses, where element size varies greatly and auxiliary information is often available – for instance, a survey attempting to measure the number of guest-nights spent in hotels might use each hotel's number of rooms as an auxiliary variable. In some cases, an older measurement of the variable of interest can be used as an auxiliary variable when attempting to produce more current estimates.
+
+=== Cluster sampling ===
+
+Sometimes it is more cost-effective to select respondents in groups ('clusters'). Sampling is often clustered by geography or by time periods (nearly all samples are in some sense 'clustered' in time – although this is rarely taken into account in the analysis). For instance, if surveying households within a city, we might choose to select 100 city blocks and then interview every household within the selected blocks.
+Clustering can reduce travel and administrative costs. In the example above, an interviewer can make a single trip to visit several households in one block, rather than having to drive to a different block for each household.
+It also means that one does not need a sampling frame listing all elements in the target population. Instead, clusters can be chosen from a cluster-level frame, with an element-level frame created only for the selected clusters. In the example above, the sample only requires a block-level city map for initial selections, and then a household-level map of the 100 selected blocks, rather than a household-level map of the whole city.
+Cluster sampling (also known as clustered sampling) generally increases the variability of sample estimates above that of simple random sampling, depending on how the clusters differ between one another as compared to the within-cluster variation. For this reason, cluster sampling requires a larger sample than SRS to achieve the same level of accuracy – but cost savings from clustering might still make this a cheaper option.
+Cluster sampling is commonly implemented as multistage sampling. This is a complex form of cluster sampling in which two or more levels of units are embedded one in the other. The first stage consists of constructing the clusters that will be used to sample from. In the second stage, a sample of primary units is randomly selected from each cluster (rather than using all units contained in all selected clusters). In following stages, in each of those selected clusters, additional samples of units are selected, and so on. All ultimate units (individuals, for instance) selected at the last step of this procedure are then surveyed. This technique, thus, is essentially the process of taking random subsamples of preceding random samples.
+Multistage sampling can substantially reduce sampling costs, where the complete population list would need to be constructed (before other sampling methods could be applied). By eliminating the work involved in describing clusters that are not selected, multistage sampling can reduce the large costs associated with traditional cluster sampling. However, each sample may not be a full representative of the whole population.
+
+=== Quota sampling ===
+
+In quota sampling, the population is first segmented into mutually exclusive sub-groups, just as in stratified sampling. Then judgement is used to select the subjects or units from each segment based on a specified proportion. For example, an interviewer may be told to sample 200 females and 300 males between the age of 45 and 60.
+It is this second step which makes the technique one of nonprobability sampling. In quota sampling the selection of the sample is non-random. For example, interviewers might be tempted to interview those who look most helpful. The problem is that these samples may be biased because not everyone gets a chance of selection. This random element is its greatest weakness and quota versus probability has been a matter of controversy for several years.
+
+=== Minimax sampling ===
+In imbalanced datasets, where the sampling ratio does not follow the population statistics, one can resample the dataset in a conservative manner called minimax sampling. The minimax sampling has its origin in Anderson minimax ratio whose value is proved to be 0.5: in a binary classification, the class-sample sizes should be chosen equally. This ratio can be proved to be minimax ratio only under the assumption of LDA classifier with Gaussian distributions. The notion of minimax sampling is recently developed for a general class of classification rules, called class-wise smart classifiers. In this case, the sampling ratio of classes is selected so that the worst case classifier error over all the possible population statistics for class prior probabilities, would be the best.
+
+=== Accidental sampling ===
\ No newline at end of file
diff --git a/data/en.wikipedia.org/wiki/Sampling_(statistics)-5.md b/data/en.wikipedia.org/wiki/Sampling_(statistics)-5.md
new file mode 100644
index 000000000..111ed4d0a
--- /dev/null
+++ b/data/en.wikipedia.org/wiki/Sampling_(statistics)-5.md
@@ -0,0 +1,50 @@
+---
+title: "Sampling (statistics)"
+chunk: 6/8
+source: "https://en.wikipedia.org/wiki/Sampling_(statistics)"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T03:17:18.009917+00:00"
+instance: "kb-cron"
+---
+
+Accidental sampling (sometimes known as grab, convenience, or opportunity sampling) is a type of nonprobability sampling which involves the sample being drawn from that part of the population which is close to hand. That is, a population is selected because it is readily available and convenient. It may be through meeting the person or including a person in the sample when one meets them or chosen by finding them through technological means such as the internet or through phone. The researcher using such a sample cannot scientifically make generalizations about the total population from this sample because it would not be representative enough. For example, if the interviewer were to conduct such a survey at a shopping center early in the morning on a given day, the people that they could interview would be limited to those given there at that given time, which would not represent the views of other members of society in such an area, if the survey were to be conducted at different times of day and several times per week. This type of sampling is most useful for pilot testing. Several important considerations for researchers using convenience samples include:
+
+Are there controls within the research design or experiment which can serve to lessen the impact of a non-random convenience sample, thereby ensuring the results will be more representative of the population?
+Is there good reason to believe that a particular convenience sample would or should respond or behave differently than a random sample from the same population?
+Is the question being asked by the research one that can adequately be answered using a convenience sample?
+In social science research, snowball sampling is a similar technique, where existing study subjects are used to recruit more subjects into the sample. Some variants of snowball sampling, such as respondent driven sampling, allow calculation of selection probabilities and are probability sampling methods under certain conditions.
+
+=== Voluntary sampling ===
+
+The voluntary sampling method is a type of nonprobability sampling. Volunteers choose to complete a survey.
+Volunteers may be invited through advertisements in social media. The target population for advertisements can be selected by characteristics like location, age, sex, income, occupation, education, or interests using tools provided by the social medium. The advertisement may include a message about the research and link to a survey. After following the link and completing the survey, the volunteer submits the data to be included in the sample population. This method can reach a global population but is limited by the campaign budget. Volunteers outside the invited population may also be included in the sample.
+It is difficult to make generalizations from this sample because it may not represent the total population. Often, volunteers have a strong interest in the main topic of the survey.
+
+=== Line-intercept sampling ===
+
+Line-intercept sampling is a method of sampling elements in a region whereby an element is sampled if a chosen line segment, called a "transect", intersects the element.
+
+=== Panel sampling ===
+Panel sampling is the method of first selecting a group of participants through a random sampling method and then asking that group for (potentially the same) information several times over a period of time. Therefore, each participant is interviewed at two or more time points; each period of data collection is called a "wave". The method was developed by sociologist Paul Lazarsfeld in 1938 as a means of studying political campaigns. This longitudinal sampling-method allows estimates of changes in the population, for example with regard to chronic illness to job stress to weekly food expenditures. Panel sampling can also be used to inform researchers about within-person health changes due to age or to help explain changes in continuous dependent variables such as spousal interaction. There have been several proposed methods of analyzing panel data, including MANOVA, growth curves, and structural equation modeling with lagged effects.
+
+=== Snowball sampling ===
+
+Snowball sampling involves finding a small group of initial respondents and using them to recruit more respondents. It is particularly useful in cases where the population is hidden or difficult to enumerate.
+
+=== Theoretical sampling ===
+
+Theoretical sampling occurs when samples are selected on the basis of the results of the data collected so far with a goal of developing a deeper understanding of the area or develop theories. An initial, general sample is first collected with the goal of investigating general trends, where further sampling may consist of extreme or very specific cases might be selected in order to maximize the likelihood a phenomenon will actually be observable.
+
+=== Active sampling ===
+In active sampling, the samples which are used for training a machine learning algorithm are actively selected, also compare active learning (machine learning).
+
+=== Judgmental selection ===
+Judgement sampling, also known as expert or purposive sampling, is a type of non-random sampling where samples are selected based on the opinion of an expert, who can select participants based on how valuable the information they provide is.
+
+=== Haphazard sampling ===
+
+Haphazard sampling refers to the idea of using human judgement to simulate randomness. Despite samples being hand-picked, the goal is to ensure that no conscious bias exists within the choice of samples, but often fails due to selection bias. Haphazard sampling is generally opted for due to its convenience, when the tools or capacity to perform other sampling methods may not exist.
+The major weakness of such samples is that they often do not represent the characteristics of the entire population, but just a segment of the population. Because of this unbalanced representation, results from haphazard sampling are often biased.
+
+== Replacement of selected units ==
\ No newline at end of file
diff --git a/data/en.wikipedia.org/wiki/Sampling_(statistics)-6.md b/data/en.wikipedia.org/wiki/Sampling_(statistics)-6.md
new file mode 100644
index 000000000..3495bb9a6
--- /dev/null
+++ b/data/en.wikipedia.org/wiki/Sampling_(statistics)-6.md
@@ -0,0 +1,77 @@
+---
+title: "Sampling (statistics)"
+chunk: 7/8
+source: "https://en.wikipedia.org/wiki/Sampling_(statistics)"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T03:17:18.009917+00:00"
+instance: "kb-cron"
+---
+
+Sampling schemes may be without replacement ('WOR' – no element can be selected more than once in the same sample) or with replacement ('WR' – an element may appear multiple times in the one sample). For example, if we catch fish, measure them, and immediately return them to the water before continuing with the sample, this is a WR design, because we might end up catching and measuring the same fish more than once. However, if we do not return the fish to the water or tag and release each fish after catching it, this becomes a WOR design.
+
+== Sample size determination ==
+
+Formulas, tables, and power function charts are well known approaches to determine sample size.
+Steps for using sample size tables:
+
+Postulate the effect size of interest, α, and β.
+Check sample size table
+Select the table corresponding to the selected α
+Locate the row corresponding to the desired power
+Locate the column corresponding to the estimated effect size.
+The intersection of the column and row is the minimum sample size required.
+
+== Sampling and data collection ==
+Good data collection involves:
+
+Following the defined sampling process
+Keeping the data in time order
+Noting comments and other contextual events
+Recording non-responses
+
+== Applications of sampling ==
+Sampling enables the selection of right data points from within the larger data set to estimate the characteristics of the whole population. For example, there are about 600 million tweets produced every day. It is not necessary to look at all of them to determine the topics that are discussed during the day, nor is it necessary to look at all the tweets to determine the sentiment on each of the topics. A theoretical formulation for sampling Twitter data has been developed.
+In manufacturing different types of sensory data such as acoustics, vibration, pressure, current, voltage, and controller data are available at short time intervals. To predict down-time it may not be necessary to look at all the data but a sample may be sufficient.
+
+== Errors in sample surveys ==
+
+Survey results are typically subject to some error. Total errors can be classified into sampling errors and non-sampling errors. The term "error" here includes systematic biases as well as random errors.
+
+=== Sampling errors and biases ===
+Sampling errors and biases are induced by the sample design. They include:
+
+Selection bias: When the true selection probabilities differ from those assumed in calculating the results.
+Random sampling error: Random variation in the results due to the elements in the sample being selected at random.
+
+=== Non-sampling error ===
+
+Non-sampling errors are other errors which can impact final survey estimates, caused by problems in data collection, processing, or sample design. Such errors may include:
+
+Over-coverage: inclusion of data from outside of the population
+Under-coverage: sampling frame does not include elements in the population.
+Measurement error: e.g., when respondents misunderstand a question, or find it difficult to answer
+Processing error: mistakes in data coding
+Non-response or Participation bias: failure to obtain complete data from all selected individuals
+After sampling, a review is held of the exact process followed in sampling, rather than that intended, in order to study any effects that any divergences might have on subsequent analysis.
+A particular problem involves non-response. Two major types of non-response exist:
+
+unit nonresponse (lack of completion of any part of the survey)
+item non-response (submission or participation in survey but failing to complete one or more components/questions of the survey)
+In survey sampling, many of the individuals identified as part of the sample may be unwilling to participate, not have the time to participate (opportunity cost), or survey administrators may not have been able to contact them. In this case, there is a risk of differences between respondents and nonrespondents, leading to biased estimates of population parameters. This is often addressed by improving survey design, offering incentives, and conducting follow-up studies which make a repeated attempt to contact the unresponsive and to characterize their similarities and differences with the rest of the frame. The effects can also be mitigated by weighting the data (when population benchmarks are available) or by imputing data based on answers to other questions. Nonresponse is particularly a problem in internet sampling. Reasons for this problem may include improperly designed surveys, over-surveying (or survey fatigue),
+and the fact that potential participants may have multiple e-mail addresses, which they do not use anymore or do not check regularly.
+
+== Survey weights ==
+In many situations, the sample fraction may be varied by stratum and data will have to be weighted to correctly represent the population. Thus for example, a simple random sample of individuals in the United Kingdom might not include some in remote Scottish islands who would be inordinately expensive to sample. A cheaper method would be to use a stratified sample with urban and rural strata. The rural sample could be under-represented in the sample, but weighted up appropriately in the analysis to compensate.
+More generally, data should usually be weighted if the sample design does not give each individual an equal chance of being selected. For instance, when households have equal selection probabilities but one person is interviewed from within each household, this gives people from large households a smaller chance of being interviewed. This can be accounted for using survey weights. Similarly, households with more than one telephone line have a greater chance of being selected in a random digit dialing sample, and weights can adjust for this.
+Weights can also serve other purposes, such as helping to correct for non-response.
+
+== Methods of producing random samples ==
+Random number table
+Mathematical algorithms for pseudo-random number generators
+Physical randomization devices such as coins, playing cards or sophisticated devices such as ERNIE
+
+== See also ==
+
+== Notes ==
+The textbook by Groves et alia provides an overview of survey methodology, including recent literature on questionnaire development (informed by cognitive psychology) :
\ No newline at end of file
diff --git a/data/en.wikipedia.org/wiki/Sampling_(statistics)-7.md b/data/en.wikipedia.org/wiki/Sampling_(statistics)-7.md
new file mode 100644
index 000000000..634d01704
--- /dev/null
+++ b/data/en.wikipedia.org/wiki/Sampling_(statistics)-7.md
@@ -0,0 +1,68 @@
+---
+title: "Sampling (statistics)"
+chunk: 8/8
+source: "https://en.wikipedia.org/wiki/Sampling_(statistics)"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T03:17:18.009917+00:00"
+instance: "kb-cron"
+---
+
+Robert Groves, et alia. Survey methodology (2010 2nd ed. [2004]) ISBN 0-471-48348-6.
+The other books focus on the statistical theory of survey sampling and require some knowledge of basic statistics, as discussed in the following  textbooks:
+
+David S. Moore and George P. McCabe (February 2005). "Introduction to the practice of statistics" (5th edition). W.H. Freeman & Company. ISBN 0-7167-6282-X.
+Freedman, David; Pisani, Robert; Purves, Roger (2007). Statistics (4th ed.). New York: Norton. ISBN 978-0-393-92972-0.
+The elementary book by Scheaffer et alia uses quadratic equations from high-school algebra:
+
+Scheaffer, Richard L., William Mendenhal and R. Lyman Ott. Elementary survey sampling, Fifth Edition. Belmont: Duxbury Press, 1996.
+More mathematical statistics is required for Lohr, for Särndal et alia, and for Cochran:
+
+Cochran, William G. (1977). Sampling techniques (Third ed.). Wiley. ISBN 978-0-471-16240-7.
+Lohr, Sharon L. (1999). Sampling: Design and analysis. Duxbury. ISBN 978-0-534-35361-2.
+Särndal, Carl-Erik; Swensson, Bengt; Wretman, Jan (1992). Model assisted survey sampling. Springer-Verlag. ISBN 978-0-387-40620-6.
+The historically important books by Deming and Kish remain valuable for insights for social scientists (particularly about the U.S. census and the Institute for Social Research at the University of Michigan):
+
+Deming, W. Edwards (1966). Some Theory of Sampling. Dover Publications. ISBN 978-0-486-64684-8. OCLC 166526.
+Kish, Leslie (1995) Survey Sampling, Wiley, ISBN 0-471-10949-5
+
+== References ==
+
+== Further reading ==
+Singh, G N, Jaiswal, A. K., and Pandey A. K. (2021), Improved Imputation Methods for Missing Data in Two-Occasion Successive Sampling, Communications in Statistics: Theory and Methods. DOI:10.1080/03610926.2021.1944211
+Chambers, R L, and Skinner, C J (editors) (2003), Analysis of Survey Data, Wiley, ISBN 0-471-89987-9
+Deming, W. Edwards (1975) On probability as a basis for action, The American Statistician, 29(4), pp. 146–152.
+Gy, P (2012) Sampling of Heterogeneous and Dynamic Material Systems: Theories of Heterogeneity, Sampling and Homogenizing, Elsevier Science, ISBN 978-0444556066
+Korn, E.L., and Graubard, B.I. (1999) Analysis of Health Surveys, Wiley, ISBN 0-471-13773-1
+Lucas, Samuel R. (2012). doi:10.1007/s11135-012-9775-3 "Beyond the Existence Proof: Ontological Conditions, Epistemological Implications, and In-Depth Interview Research."], Quality & Quantity, doi:10.1007/s11135-012-9775-3.
+Stuart, Alan (1962) Basic Ideas of Scientific Sampling, Hafner Publishing Company, New York 
+Smith, T. M. F. (1984). "Present Position and Potential Developments: Some Personal Views: Sample surveys". Journal of the Royal Statistical Society, Series A. 147 (The 150th Anniversary of the Royal Statistical Society, number 2): 208–221. doi:10.2307/2981677. JSTOR 2981677.
+Smith, T. M. F. (1993). "Populations and Selection: Limitations of Statistics (Presidential address)". Journal of the Royal Statistical Society, Series A. 156 (2): 144–166. doi:10.2307/2982726. JSTOR 2982726. (Portrait of T. M. F. Smith on page 144)
+Smith, T. M. F. (2001). "Centenary: Sample surveys". Biometrika. 88 (1): 167–243. doi:10.1093/biomet/88.1.167.
+Smith, T. M. F. (2001). "Biometrika centenary: Sample surveys". In D. M. Titterington and D. R. Cox (ed.). Biometrika: One Hundred Years. Oxford University Press. pp. 165–194. ISBN 978-0-19-850993-6.
+Whittle, P. (May 1954). "Optimum preventative sampling". Journal of the Operations Research Society of America. 2 (2): 197–203. doi:10.1287/opre.2.2.197. JSTOR 166605.
+
+== Standards ==
+
+=== ISO ===
+ISO 2859 series
+ISO 3951 series
+
+=== ASTM ===
+ASTM E105 Standard Practice for Probability Sampling Of Materials
+ASTM E122 Standard Practice for Calculating Sample Size to Estimate, With a Specified Tolerable Error, the Average for Characteristic of a Lot or Process
+ASTM E141 Standard Practice for Acceptance of Evidence Based on the Results of Probability Sampling
+ASTM E1402 Standard Terminology Relating to Sampling
+ASTM E1994 Standard Practice for Use of Process Oriented AOQL and LTPD Sampling Plans
+ASTM E2234 Standard Practice for Sampling a Stream of Product by Attributes Indexed by AQL
+
+=== ANSI, ASQ ===
+ANSI/ASQ Z1.4
+
+=== U.S. federal and military standards ===
+MIL-STD-105
+MIL-STD-1916
+
+== External links ==
+
+ Media related to Sampling (statistics) at Wikimedia Commons
\ No newline at end of file
diff --git a/data/en.wikipedia.org/wiki/Science_of_team_science-0.md b/data/en.wikipedia.org/wiki/Science_of_team_science-0.md
new file mode 100644
index 000000000..620b3948a
--- /dev/null
+++ b/data/en.wikipedia.org/wiki/Science_of_team_science-0.md
@@ -0,0 +1,53 @@
+---
+title: "Science of team science"
+chunk: 1/1
+source: "https://en.wikipedia.org/wiki/Science_of_team_science"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T03:17:19.132158+00:00"
+instance: "kb-cron"
+---
+
+Science of Team Science (SciTS) is a field of methodology that focuses on understanding and improving cross-disciplinary collaboration in research. The field encompasses conceptual and methodological strategies to understand how teams of scientific researchers can be organized to work more effectively. SciTS initiatives involve understanding and managing factors that affect collaborative science and assessing its outcomes.
+
+
+== History ==
+Since the 1990s, interest and large-scale funding for team-based research initiatives has increased, driven by efforts to tackle complex problems through cross-disciplinary collaboration. Some argue that this trend reflects the growing recognition that addressing multifaceted challenges, such as climate change and public health issues, benefit from partnerships among scientists and practitioners from diverse fields. One SciTS literature review highlighted team science as essential to interprofessional collaborative research. The report called for its integration into health professions education and clinical practice at the University of Minnesota's National Center for Interprofessional Practice and Education.
+The interdisciplinary nature of SciTS initially emerged from concerns raised by funding agencies, which sought to assess the performance of team science, understand its added value, evaluate the return on investment in large research initiatives, and inform science policy. The term "science of team science" was first introduced in October 2006 at a conference titled The Science of Team Science: Assessing the Value of Transdisciplinary Research, hosted by the National Cancer Institute in Bethesda, Maryland. The SciTS field was further treated in a supplement to the American Journal of Preventive Medicine published in July 2008. The First Annual International Science of Team Science (SciTS) Conference was held on April 22–24, 2010, in Chicago, Illinois, organized by the Northwestern University Clinical and Translational Sciences (NUCATS) Institute.
+In 2013, the National Academy of Sciences established a National Research Council Committee on the Science of Team Science to evaluate the current state of knowledge and practice in SciTS. A committee report was published in 2015.
+In 2023, Patrick Forscher and colleagues published a review identifying the benefits of big team science, noting that innovations facilitate the collection of larger samples and support efforts toward reproducibility and generalizability. However, concerns exist that team science could increasingly influence funding priorities, potentially shifting emphasis from applied science to more theoretical research areas, as well as leading to unsuccessful large-scale projects. Forscher's recommendations included creating an advisory board and structured by-laws, formalizing feedback mechanisms from contributors, engaging in mentoring, and separating idea generation from project implementation.
+In 2026, the discourse shifted towards the management of Big Team Science (BTS). Vaidis and collagues synthesised work from global collaborations such as the Psychological Science Accelerator, ManyBabies, ManyManys and ManyPrimates. Although BTS can challenge statistical power and cross-cultural generalizability, several bureaucratic challenges arise requiring non-traditional management to function as high-fidelity organisational structure from informal collaborative agreements, forward and backward translation and the value of Universal Time (UTC) for global synchronisation and decision making protocols to maintain ethical and epistemological integrity in teams that exceed hundreds of members and labs to address the key questions in science. 
+
+
+== Methods ==
+The definition of a successful team may vary depending on stakeholders. SciTS uses both qualitative and quantitative methods to evaluate the antecedent conditions, collaborative processes, and outcomes associated with team science. It also considers the organizational, social, and political context that influences team science.
+A 2018 review of literature on SciTS published between 2006 and 2016 identified 109 articles. It reported that 75% of these articles used pre-existing data (e.g., archival data), 62% used bibliometrics, over 40% used surveys, and over 10% used interview and observational data.
+
+
+== See also ==
+Integrative learning
+Interactional expertise
+Interdisciplinarity
+Multidisciplinary
+Multidimensional network
+Transdisciplinary
+Global brain
+
+
+== References ==
+
+
+== Further reading ==
+Azoulay P, Joshua S, Zivin JW (2010). "Superstar Extinction". The Quarterly Journal of Economics. 125 (2): 549–589.
+Bennett LM, Gadlin H, Levine-Finley S (2010). "Collaboration and team science: a field guide" (PDF). Bethesda, Maryland: National Institutes of Health. Accessed May 28, 2010.
+Börner, Katy; Dall'Asta, Luca; Ke, Weimao; Vespignani, Alessandro (2005). "Studying the emerging global brain: Analyzing and visualizing the impact of co-authorship teams" (PDF). Complexity. 10 (4): 57–67. arXiv:cond-mat/0502147. Bibcode:2005Cmplx..10d..57B. doi:10.1002/cplx.20078. ISSN 1076-2787. S2CID 2190589.
+Contractor, Noshir (2009). "The Emergence of Multidimensional Networks". Journal of Computer-Mediated Communication. 14 (3): 743–747. doi:10.1111/j.1083-6101.2009.01465.x. ISSN 1083-6101.
+Cummings JN. "A socio-technical framework for identifying team science collaborations that could benefit from cyberinfrastructure". VOSS: National Science Foundation; 2009.
+Patel MM, Moseley TW, Nia ES, Perez F, Kapoor MM, Whitman GJ. "Team Science: A Practical Approach to Starting Collaborative Projects." J Breast Imaging. 2021 Jun 16;3(6):721-726. doi: 10.1093/jbi/wbab034. PMID 34805982; PMCID: PMC8599160.
+Stokols D, Taylor B, Hall K, Moser R (2006). "The science of team science: an overview of the field" (PDF). Bethesda, Maryland: National Cancer Institute. Accessed May 28, 2010.
+Rhoten D (2007). The dawn of networked science. The Chronicle Review. 54. Accessed May 28, 2010.
+
+
+== External links ==
+Annual International Science of Team Science Conference
+National Cancer Institute Science of Team Science website
\ No newline at end of file
diff --git a/data/en.wikipedia.org/wiki/Science_on_the_Verge-0.md b/data/en.wikipedia.org/wiki/Science_on_the_Verge-0.md
new file mode 100644
index 000000000..d59241e18
--- /dev/null
+++ b/data/en.wikipedia.org/wiki/Science_on_the_Verge-0.md
@@ -0,0 +1,62 @@
+---
+title: "Science on the Verge"
+chunk: 1/1
+source: "https://en.wikipedia.org/wiki/Science_on_the_Verge"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T03:17:20.444844+00:00"
+instance: "kb-cron"
+---
+
+Science on the verge is a book written in 2016 by group of eight scholars working in the tradition of Post-normal science. The book analyzes the main features and possible causes of the present science's crisis.
+
+
+== Content ==
+Science on the verge, written by Alice Benessia, Silvio Funtowicz, Mario Giampietro, Ângela Guimarães Pereira, Jerome Ravetz, Andrea Saltelli, Roger Strand, and Jeroen P. van der Sluijs, with a preface by Dan Sarewitz, follows different threads of the present science's crisis (lost reproducibility, collapsing peer review, perverse metrics, evidence based policy's dysfunctions, techno-science and its hubris, science as metaphysics ...) and presents a first systematic analysis of the main causes of the present predicaments.
+
+Chapter 1. The chapter entitled 'Who will solve the crisis in science?' tackles the following questions:
+Is there a crisis?
+What is being done 'from within'? Is this sufficient?
+What are the diagnoses for the crisis' root causes, and
+what are the solutions 'from without'?
+Chapter 2. This is chapter entitled 'The fallacy of evidence based policy' tackles:
+Quantification as hypocognition;
+Socially constructed ignorance & uncomfortable knowledge;
+Ancien régime syndrome;
+Quantitative story telling.
+Chapter 3. This is chapter entitled 'Never late, never lost, never unprepared' treats:
+Trajectories of innovation and modes of demarcation of science from society:
+'separation',
+'hybridization' and
+'substitution';
+What contradictions these trajectories generate.
+Chapter 4. This chapter entitled 'Institutions on the verge' discusses:
+Working at the science policy interface;
+The special case of the European Commission's in house science service;
+The Joint Research Centre as a boundary institutions;
+Diagnosis, challenges and perspectives.
+Chapter 5. This chapter entitled 'Numbers running wild' deals with:
+Uses and abuses of quantification;
+The loss of 'craft skills' with numbers;
+The case of '7.9% of all species shall become extinct'.
+Chapter 6. This chapter entitled 'Doubt has been eliminated' tells about:
+Gro Harlem Brundtland's famous 2007 speech, after the Fourth IPCC report and the Stern review;
+When science becomes a 'life philosophy';
+Science as the metaphysics  of modernity;
+The inquiry of the Norwegian Research Ethics Committee for Science and Technology.
+
+
+== Reviews ==
+From Joseph A. Tainter, Professor of Sustainability, Utah State University, "Scientists working in the policy arena are often naïve about the impact of their findings. Producing rational results, scientists expect their research to be rationally accepted. The fundamental problem is that humans are not rational. We are emotional thinkers. Science on the Verge exposes many of the fallacies in science applied to societal problems. No matter how good the science, public issues are always ideological and political."  
+From Judith Curry,  climatologist and former chair of the School of Earth and Atmospheric Sciences at the Georgia Institute of Technology writing on her blog "This book is a very rich resource for grappling with the problems with science in the 21st century – the articles themselves, as well as the extensive references.  I expect to be using this book as a resource for a number future blog posts.  This book deserves a wide audience, and I hope this blog post will help increase its reach."
+A post on the book can be found at the blog of the International Social Science Council in Paris. See also this article in The Guardian.
+
+
+== References ==
+
+
+== External links ==
+Page on Amazon 
+Page at CSPO
+[http://www.ascb.org/dora/ San Francisco Declaration on Research Assessment, drafted by publishers, with separate recommendations for institutions, publishers, organizations that supply metrics and researchers.
+Additional Material
\ No newline at end of file
diff --git a/data/en.wikipedia.org/wiki/Scientific_collection-0.md b/data/en.wikipedia.org/wiki/Scientific_collection-0.md
new file mode 100644
index 000000000..857035d10
--- /dev/null
+++ b/data/en.wikipedia.org/wiki/Scientific_collection-0.md
@@ -0,0 +1,42 @@
+---
+title: "Scientific collection"
+chunk: 1/2
+source: "https://en.wikipedia.org/wiki/Scientific_collection"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T03:17:21.722777+00:00"
+instance: "kb-cron"
+---
+
+A scientific collection is a collection of items that are preserved, catalogued, and managed for the purpose of scientific study.
+Scientific collections dealing specifically with organisms plants, fungi, animals, insects and their remains, may also be called natural history collections or biological collections. The latter may contain either living stocks or preserved repositories of biodiversity specimens and materials.
+Scientific collections hold a tangible portion of the cumulative evidence base in such fields as biology (especially taxonomy and evolutionary biology), geology, and archaeology.  They may be stored and managed by governments, educational institutions (e.g. colleges and universities), private organizations (including museums), or individuals.
+Prominent uses of scientific collections include the systematic description and identification of biological species, the study and prediction of long-term historical trends (including impacts of climate change), the dating and analysis of historical objects (e.g. via wood samples and ice cores with annual rings), and the maintenance of teaching resources.
+
+== Indexing ==
+
+The indexing of the collections was historically made by directories, catalogs, index cards, today supplemented by or replaced by databases with information such as e.g. scientific description, including picture, name, location, find circumstances, fund age, scientific analysis, phylogenetic relationships, DNA and isotope analysis results, analysis of pollutants, references, condition of the property, owner changes and name changes.
+Many organizations support the indexing and handling of their collections by specialist libraries.
+
+== Institutions ==
+
+Research collections hold especially museums, notably natural history museums, botanical gardens, universities and other research institutions. There are also independent research collections, such as the Zoological State Collection Munich with over 20 million stuffed animals for research purposes. Public authorities such as national geological agencies or police units hold partly research collections too.
+The Natural History Museum in London - with one of the biggest collections worldwide - is home to life and earth science specimens comprising some 70 million items within five main collections: botany, entomology, mineralogy, palaeontology and zoology.
+Largest German Natural History Museum is the Museum für Naturkunde, Berlin,  with over 30 million objects, including 9 million beetles and 275,000 jars with preserved in alcohol animals.
+
+== Geology / Earth Sciences collections ==
+Remarkable Earth Sciences collections:
+
+In Germany the Technische Universität Bergakademie Freiberg has particularly rich geological research collections. This includes 80,000 minerals in the Mineralogical Collection, 120,000 samples from deposits in the deposit collection, 16,000 rocks in the petrological collection, 114,000 macro - and almost a million micro-fossils in the fossil collection, 70,000 macro - and microfossils 12,000 and 15,000 lithostratigraphically or facies relevant rock samples and about 14,000 specimens and sections in the Stratigraphic collection, 30,000 pieces of evidence and 30,000 preparations and cuts in Fuel Geological Collection and 34,000 objects in the central body of evidence Lithothek.
+The IODP/ODP - Kernlager / Bremen Core Repository (BCR) at the University of Bremen has a collection of 140 km of drill core from the Ocean Drilling Program with 190,000 individual pieces, which are stored in a 1.100m ² large cold storage at 4 °C.
+The Musée de Minéralogie a museum of mineralogy operated by the École nationale supérieure des mines de Paris (Mines ParisTech) containing some 100,000 samples including 80,000 minerals, 15,000 rocks, 4,000 ores, 400 meteorites, 700 gems, and 300 artificial minerals.
+The most important geological collections in Europe include the Museum of the Earth of Polish Academy of Sciences in Warsaw With more than 170,000 minerals, meteorites, fossil and an amber collection.
+
+== Biological collections / Life Sciences collections ==
+Typical collection objects biology are fossils of organisms, preserved samples of extant animals and plants (protected from decay by drying or preparation), but also live plants, animals, bacteria and active viruses.
+Plant collections are referred to as herbaria. Live plants are collected in the Botanical gardens, (trees ) in arboretums, aquariums, and partly in seedbanks, as well as e.g. algae from the Culture Collection of Algae Göttingen. Live animals are collected in zoos and aquariums.
+The great Old Botanical Garden of the University of Göttingen e.g. represents about a collection of 17,000 species.
+Particularly well known in Germany are the major research collections of the Naturmuseum Senckenberg of Senckenberg Society for Nature Research in Frankfurt am Main with over 22 million natural objects (Herbaria 1 Million). Senckenberg offers to open up his collection to the SESAM database.
+The Macaulay Library is the world's largest archive of animal sounds. It includes more than 175,000 audio recordings covering 75 percent of the world's bird species. There are an ever increasing numbers of insect, fish, frog, and mammal recordings. The video archive includes over 50,000 clips, representing over 3,500 species.
+An example for a special collection are the objects of the Deutsche Sammlung von Mikroorganismen und Zellkulturen (German Collection of Microorganisms and Cell Cultures).
+Remarkable and big Biological collections (more than 1,000,000 specimens) in Europe are
\ No newline at end of file
diff --git a/data/en.wikipedia.org/wiki/Scientific_collection-1.md b/data/en.wikipedia.org/wiki/Scientific_collection-1.md
new file mode 100644
index 000000000..f91fb6f2a
--- /dev/null
+++ b/data/en.wikipedia.org/wiki/Scientific_collection-1.md
@@ -0,0 +1,58 @@
+---
+title: "Scientific collection"
+chunk: 2/2
+source: "https://en.wikipedia.org/wiki/Scientific_collection"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T03:17:21.722777+00:00"
+instance: "kb-cron"
+---
+
+in France: Muséum National d'Histoire Naturelle, Paris contains 9,500,000 specimens; Université Montpellier contains 4,000,000 specimens; Université Claude Bernard Lyon 1 contains 4,000,000 specimens
+in Russia: Komarov Botanical Institute in St. Petersburg contains 7,160,000 specimens,
+in Great Britain: Royal Botanic Gardens Kew contains 7,000,000 specimens; British Museum of Natural History contains 80,000,000 specimens; Royal Botanic Garden Edinburgh contains 2,000,000 specimens; University of Cambridge contains 1,000,000 specimens; University of Manchester contains 1,000,000 specimens;
+in Switzerland: Conservatoire et Jardin botaniques de la Ville de Genève contains 6,000,000 specimens; Eidgenössische Technische Hochschule Zürich contains 1,500,000 specimens,
+in Austria: Naturhistorisches Museum Wien contains 5,000,000 specimens; Universität Wien contains 1,400,000 specimens,
+in Sweden: Swedish Museum of Natural History (Naturhistoriska riksmuseet) contains 4,400,000 specimens; Uppsala University contains 3,000,000 specimens, Botanical Museum, Lund contains 2,500,000 specimens, University of Gothenburg contains 1,600,000 specimens,
+in Netherlands: National Herbarium of the Netherlands (Nationaal Herbarium Nederland) contains 4,000,000 specimens
+in Italy: Museo di Storia Naturale dell'Università, Florence contains 3,650,000 specimens; Università degli Studi di Roma La Sapienza, Rome contains 1,120,000 specimens; Università degli Studi di Torino contains 1,000,000 specimens,
+in Belgium: National Botanic Garden of Belgium contains 3,500,000 specimens,
+in Germany: Botanischer Garten und Botanisches Museum Berlin-Dahlem, Zentraleinrichtung der Freien Universität Berlin contains 3,000,000 specimens, University of Jena contains 3,000,000 specimens; Botanische Staatssammlung München contains 2,500,000 specimens; Biozentrum Klein-Flottbek, Hamburg contains 1,400,000 specimens; Staatliches Museum für Naturkunde, Stuttgart contains 1,000,000 specimens
+in Finland: University of Helsinki contains 3,000,000 specimens,
+in Denmark: University of Copenhagen contains 2,510,000 specimens,
+in Norway: Botanical Museum, Oslo contains 1,800,000 specimens,
+See more: List of herbaria in Europe
+Remarkable and big Biological collections (more than 1,000,000 specimens) in the Americas are:
+
+in United States: New York Botanical Garden Herbarium contains 7,200,000 specimens; Missouri Botanical Garden Herbarium contains 6,231,000 specimens; Harvard University Herbaria contains 5,005,000 specimens; United States National Herbarium, Washington contains 4,340,000 specimens; Field Museum, Chicago contains 2,650,000 specimens; University and Jepson Herbaria of University of California, Berkeley contains 2,200,000 specimens; California Academy of Sciences Herbarium of California Academy of Sciences contains 1,850,000 specimens;  University of Michigan Herbarium, Ann Arbor contains 1,700,000 specimens; Academy of Natural Sciences Herbarium of Academy of Natural Sciences, Philadelphia contains 1,430,000 specimens; Wisconsin State Herbarium of University of Wisconsin-Madison, Madison contains 1,100,000 specimens; Rancho Santa Ana Botanic Garden Herbarium of Rancho Santa Ana Botanic Garden, Claremont contains 1,084,000 specimens; Plant Resources Center of University of Texas at Austin contains 1,006,000 specimens; Botanical Research Institute of Texas, Fort Worth contains 1,000,000 specimens
+in Canada: Agriculture and Agri-Food Canada, Vascular Plant Herbarium contains 1,335,000 specimens,
+in Mexico: Universidad Nacional Autónoma de México, Mexico City contains 1,120,000 specimens.
+See more: List of herbaria in North America
+Remarkable and big Biological collections worldwide see: List of herbaria
+
+== History / Human Heritage collections ==
+Dendrochronology is located on the border between biology and history. An annual ring table or tree-ring calendar is a time series of tree ring s of dendrochronological art tree. Because of the specific growth of each tree species and regional differences of climate, such a table must always refer to a single species from the same region. Important tree chronologies are:
+
+Hohenheimer Jahrringkalender (Hohenheim tree-ring calendar), complete 12,483 years back to 10,480 BC in the Younger Dryas
+Aegean Dendrochronology Project to 1800 BC, Bronze Age
+Belfast Chronology 5474 BC
+English standard curve to 5,012 BC
+Bristlecone Pine Chronology extends back 8500 years exists for the bristlecone pine in the Southwest US (White Mountains of California)
+Sequoiadendron giganteum Chronology
+Remerkable History collections:
+
+The National Numismatic Collection is the national coin cabinet of the United States. The collection is part of the Smithsonian Institution's National Museum of American History. the collection contains over 1.6 million objects, including 450,000 coins, medals and decorations and 1.1 million pieces of paper money.
+The Norwegian folk music series is a scientific collection of traditional Norwegian dance music.
+
+== Literature ==
+Pamela Ebert Flattau (Project Leader), Margaret Boeckmann, Paul Lagassla, Nyema Mitchell, Darius Singpurwalla: Preliminary Findings from the NSF Survey of Object-Based Scientific Collections: 2008 (pdf)
+
+== See also ==
+Chemical library
+Collecting
+
+== References ==
+
+== Links ==
+http://wissenschaftliche-sammlungen.de/de/netzwerk/initiativen/
+http://www.plastercastcollection.org/en/index.php
\ No newline at end of file
diff --git a/data/en.wikipedia.org/wiki/Scientific_control-0.md b/data/en.wikipedia.org/wiki/Scientific_control-0.md
new file mode 100644
index 000000000..2791e9708
--- /dev/null
+++ b/data/en.wikipedia.org/wiki/Scientific_control-0.md
@@ -0,0 +1,27 @@
+---
+title: "Scientific control"
+chunk: 1/4
+source: "https://en.wikipedia.org/wiki/Scientific_control"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T03:17:22.864522+00:00"
+instance: "kb-cron"
+---
+
+A scientific control is an element of an experiment or observation designed to minimize the influence of variables other than the independent variable under investigation, thereby reducing the risk of confounding.  
+The use of controls increases the reliability and validity of results by providing a baseline for comparison between experimental measurements and control measurements. In many designs, the control group does not receive the experimental treatment, allowing researchers to isolate the effect of the independent variable.  
+Scientific controls are a fundamental part of the scientific method, particularly in fields such as biology, chemistry, medicine, and psychology, where complex systems are subject to multiple interacting variables.
+
+== Controlled experiments ==
+
+Controls eliminate alternate explanations of experimental results, especially experimental errors and experimenter bias. Many controls are specific to the type of experiment being performed, as in the molecular markers used in SDS-PAGE experiments, and may simply have the purpose of ensuring that the equipment is working properly. The selection and use of proper controls to ensure that experimental results are valid (for example, absence of confounding variables) can be very difficult. Control measurements may also be used for other purposes: for example, a measurement of a microphone's background noise in the absence of a signal allows the noise to be subtracted from later measurements of the signal, thus producing a processed signal of higher quality.
+For example, if a researcher feeds an experimental artificial sweetener to sixty laboratories rats and observes that ten of them subsequently become sick, the underlying cause could be the sweetener itself or something unrelated. Other variables, which may not be readily obvious, may interfere with the experimental design. For instance, the artificial sweetener might be mixed with a dilutant and it might be the dilutant that causes the effect.  To control for the effect of the dilutant, the same test is run twice; once with the artificial sweetener in the dilutant, and another done exactly the same way but using the dilutant alone.  Now the experiment is controlled for the dilutant and the experimenter can distinguish between sweetener, dilutant, and non-treatment.  Controls are most often necessary where a confounding factor cannot easily be separated from the primary treatments.  For example, it may be necessary to use a tractor to spread fertilizer where there is no other practicable way to spread fertilizer.  The simplest solution is to have a treatment where a tractor is driven over plots without spreading fertilizer and in that way, the effects of tractor traffic are controlled.
+The simplest types of control are negative and positive controls, and both are found in many different types of experiments. These two controls, when both are successful, are usually sufficient to eliminate most potential confounding variables: it means that the experiment produces a negative result when a negative result is expected, and a positive result when a positive result is expected. Other controls include vehicle controls, sham controls and comparative controls.
+
+== Confounding ==
+Confounding is a critical issue in observational studies because it can lead to biased or misleading conclusions about relationships between variables. A confounder is an extraneous variable that is related to both the independent variable (treatment or exposure) and the dependent variable (outcome), potentially distorting the true association. If confounding is not properly accounted for, researchers might incorrectly attribute an effect to the exposure when it is actually due to another factor. This can result in incorrect policy recommendations, ineffective interventions, or flawed scientific understanding. For example, in a study examining the relationship between physical activity and heart disease, failure to control for diet, a potential confounder, could lead to an overestimation or underestimation of the true effect of exercise.
+Falsification tests are a robustness-checking technique used in observational studies to assess whether observed associations are likely due to confounding, bias, or model misspecification rather than a true causal effect. These tests help validate findings by applying the same analytical approach to a scenario where no effect is expected. If an association still appears where none should exist, it raises concerns that the primary analysis may suffer from confounding or other biases.
+Negative controls are one type of falsification tests. The need to use negative controls usually arise in observational studies, when the study design can be questioned because of a potential confounding mechanism. A Negative control test can reject study design, but it cannot validate them. Either because there might be another confounding mechanism, or because of low statistical power. Negative controls are increasingly used in the epidemiology literature, but they show promise in social sciences fields such as economics. Negative controls are divided into two main categories: Negative Control Exposures (NCEs) and Negative Control Outcomes (NCOs).
+Lousdal et al. examined the effect of screening participation on death from breast cancer. They hypothesized that screening participants are healthier than non-participants and, therefore, already at baseline have a lower risk of breast-cancer death. Therefore, they used proxies for better health as negative-control outcomes (NCOs) and proxies for healthier behavior as negative-control exposures (NCEs). Death from causes other than breast cancer was taken as NCO, as it is an outcome of better health, not effected by breast cancer screening. Dental care participation was taken to be NCE, as it is assumed to be a good proxy of health attentive behavior.
+
+== Negative control ==
\ No newline at end of file
diff --git a/data/en.wikipedia.org/wiki/Scientific_control-1.md b/data/en.wikipedia.org/wiki/Scientific_control-1.md
new file mode 100644
index 000000000..b05d33829
--- /dev/null
+++ b/data/en.wikipedia.org/wiki/Scientific_control-1.md
@@ -0,0 +1,294 @@
+---
+title: "Scientific control"
+chunk: 2/4
+source: "https://en.wikipedia.org/wiki/Scientific_control"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T03:17:22.864522+00:00"
+instance: "kb-cron"
+---
+
+Negative controls are variables that meant to help when the study design is suspected to be invalid because of unmeasured confounders that are correlated with both the treatment and the outcome. Where there are only two possible outcomes, e.g. positive or negative, if the treatment group and the negative control (non-treatment group) both produce a negative result, it can be inferred that the treatment had no effect. If the treatment group and the negative control both produce a positive result, it can be inferred that a confounding variable is involved in the phenomenon under study, and the positive results are not solely due to the treatment.
+In other examples, outcomes might be measured as lengths, times, percentages, and so forth. In the drug testing example, we could measure the percentage of patients cured. In this case, the treatment is inferred to have no effect when the treatment group and the negative control produce the same results. Some improvement is expected in the placebo group due to the placebo effect, and this result sets the baseline upon which the treatment must improve upon. Even if the treatment group shows improvement, it needs to be compared to the placebo group. If the groups show the same effect, then the treatment was not responsible for the improvement (because the same number of patients were cured in the absence of the treatment). The treatment is only effective if the treatment group shows more improvement than the placebo group.
+
+=== Negative Control Exposure (NCE) ===
+
+NCE is a variable that should not causally affect the outcome, but may suffer from the same confounding as the exposure-outcome relationship in question. A priori, there should be no statistical association between the NCE and the outcome. If an association is found, then it through the unmeasured confounder, and since the NCE and treatment share the same confounding mechanism, there is an alternative path, apart from the direct path from the treatment to the outcome. In that case, the study design is invalid.
+For example, Yerushalmy used husband's smoking as an NCE. The exposure was maternal smoking; the outcomes were various birth factors, such as incidence of low birth weight, length of pregnancy, and neonatal mortality rates. It is assumed that husband's smoking share common confounders, such household health lifestyle with the pregnant woman's smoking, but it does not causally affect the fetus development. Nonetheless, Yerushalmy found a statistical association, And as a result, it casts doubt on the proposition that cigarette smoking causally interferes with intrauterine development of the fetus.
+
+==== Differences Between Negative Control Exposures and Placebo ====
+The term negative controls is used when the study is based on observations, while the Placebo should be used as a non-treatment in randomized control trials.
+
+=== Negative Control Outcome (NCO) ===
+
+Negative Control Outcomes are the more popular type of negative controls. NCO is a variable that is not causally affected by the treatment, but suspected to have a similar confounding mechanism as the treatment-outcome relationship. If the study design is valid, there should be no statistical association between the NCO and the treatment. Thus, an association between them suggest that the design is invalid.
+For example, Jackson et al. used mortality from all causes outside of influenza season an NCO in a study examining influenza vaccine's effect on influenza-related deaths. A possible confounding mechanism is health status and lifestyle, such as the people who are more healthy in general also tend to take the influenza vaccine. Jackson et al. found that a preferential receipt of vaccine by relatively healthy seniors, and that differences in health status between vaccinated and unvaccinated groups leads to bias in estimates of influenza vaccine effectiveness. In a similar example, when discussing the impact of air pollutants on asthma hospital admissions, Sheppard et al. et al. used non-elderly appendicitis hospital admissions as NCO.
+
+==== Formal Conditions ====
+Given a treatment 
+  
+    
+      
+        A
+      
+    
+    {\displaystyle A}
+  
+ and an outcome 
+  
+    
+      
+        Y
+      
+    
+    {\displaystyle Y}
+  
+, in the presence of a set of control variables 
+  
+    
+      
+        X
+      
+    
+    {\displaystyle X}
+  
+, and unmeasured confounder 
+  
+    
+      
+        U
+      
+    
+    {\displaystyle U}
+  
+ for the 
+  
+    
+      
+        A
+        −
+        Y
+      
+    
+    {\displaystyle A-Y}
+  
+ relationship. Shi et al. presented formal conditions for a negative control outcome 
+  
+    
+      
+        
+          
+            
+              Y
+              ~
+            
+          
+        
+      
+    
+    {\displaystyle {\tilde {Y}}}
+  
+,
+
+Stable Unit Treatment Value Assumption (SUTVA): For both 
+  
+    
+      
+        
+          Y
+        
+      
+    
+    {\displaystyle {Y}}
+  
+ and 
+  
+    
+      
+        
+          
+            
+              Y
+              ~
+            
+          
+        
+      
+    
+    {\displaystyle {\tilde {Y}}}
+  
+ with regard to 
+  
+    
+      
+        A
+        =
+        a
+      
+    
+    {\displaystyle A=a}
+  
+.
+Latent Exchangeability: 
+  
+    
+      
+        
+          Y
+          
+            A
+            =
+            a
+          
+        
+        ⊥
+        A
+        
+          |
+        
+        
+        X
+        ,
+        U
+      
+    
+    {\displaystyle Y^{A=a}\perp A|\;X,U}
+  
+ Given 
+  
+    
+      
+        X
+      
+    
+    {\displaystyle X}
+  
+ and 
+  
+    
+      
+        U
+      
+    
+    {\displaystyle U}
+  
+, the potential outcome 
+  
+    
+      
+        
+          Y
+          
+            A
+            =
+            a
+          
+        
+      
+    
+    {\displaystyle Y^{A=a}}
+  
+ is independent of the treatment.
+Irrelevancy: Ensures the irrelevancy of the treatment on the NCO.
+
+  
+    
+      
+        
+          
+            
+              
+                Y
+                ~
+              
+            
+          
+          
+            A
+            =
+            a
+          
+        
+        =
+        
+          
+            
+              
+                Y
+                ~
+              
+            
+          
+          
+            A
+            =
+            
+              a
+              ′
+            
+          
+        
+        =
+        
+          
+            
+              Y
+              ~
+            
+          
+        
+        
+          |
+        
+        
+        U
+        ,
+        X
+      
+    
+    {\displaystyle {\tilde {Y}}^{A=a}={\tilde {Y}}^{A=a'}={\tilde {Y}}|\;U,X}
+  
+: There is no causal effect of 
+  
+    
+      
+        A
+      
+    
+    {\displaystyle A}
+  
+ on 
+  
+    
+      
+        
+          
+            
+              Y
+              ~
+            
+          
+        
+      
+    
+    {\displaystyle {\tilde {Y}}}
+  
+ given 
+  
+    
+      
+        X
+      
+    
+    {\displaystyle X}
+  
+ and 
+  
+    
+      
+        U
+      
+    
+    {\displaystyle U}
+  
+.
\ No newline at end of file
diff --git a/data/en.wikipedia.org/wiki/Scientific_control-2.md b/data/en.wikipedia.org/wiki/Scientific_control-2.md
new file mode 100644
index 000000000..88ccb2d96
--- /dev/null
+++ b/data/en.wikipedia.org/wiki/Scientific_control-2.md
@@ -0,0 +1,454 @@
+---
+title: "Scientific control"
+chunk: 3/4
+source: "https://en.wikipedia.org/wiki/Scientific_control"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T03:17:22.864522+00:00"
+instance: "kb-cron"
+---
+
+  
+    
+      
+        
+          
+            
+              Y
+              ~
+            
+          
+        
+        ⊥
+        A
+        
+          |
+        
+        
+        U
+        ,
+        X
+      
+    
+    {\displaystyle {\tilde {Y}}\perp A|\;U,X}
+  
+: There is no causal effect of 
+  
+    
+      
+        A
+      
+    
+    {\displaystyle A}
+  
+ on 
+  
+    
+      
+        
+          
+            
+              Y
+              ~
+            
+          
+        
+      
+    
+    {\displaystyle {\tilde {Y}}}
+  
+ given 
+  
+    
+      
+        X
+      
+    
+    {\displaystyle X}
+  
+ and 
+  
+    
+      
+        U
+      
+    
+    {\displaystyle U}
+  
+. The NCO is independent of the treatment given 
+  
+    
+      
+        X
+      
+    
+    {\displaystyle X}
+  
+ and 
+  
+    
+      
+        U
+      
+    
+    {\displaystyle U}
+  
+.
+U-Comparability:  
+  
+    
+      
+        
+          
+            
+              Y
+              ~
+            
+          
+        
+        
+          
+            ⧸
+          
+        
+        
+          ⊥
+        
+        U
+        
+          |
+        
+        
+        X
+      
+    
+    {\displaystyle {\tilde {Y}}\not {\perp }U|\;X}
+  
+ The unmeasured confounders 
+  
+    
+      
+        U
+      
+    
+    {\displaystyle U}
+  
+ of the association between 
+  
+    
+      
+        A
+      
+    
+    {\displaystyle A}
+  
+ and 
+  
+    
+      
+        Y
+      
+    
+    {\displaystyle Y}
+  
+ are the same for the association between 
+  
+    
+      
+        A
+      
+    
+    {\displaystyle A}
+  
+ and 
+  
+    
+      
+        
+          
+            
+              Y
+              ~
+            
+          
+        
+      
+    
+    {\displaystyle {\tilde {Y}}}
+  
+.
+Given assumption 1 - 4, a non-null association between 
+  
+    
+      
+        A
+      
+    
+    {\displaystyle A}
+  
+ and 
+  
+    
+      
+        
+          
+            
+              Y
+              ~
+            
+          
+        
+      
+    
+    {\displaystyle {\tilde {Y}}}
+  
+, can be explained by 
+  
+    
+      
+        U
+      
+    
+    {\displaystyle U}
+  
+, and not by another mechanism. A possible violation of Latent Exchangeability will be when only the people that are influenced by a medicine will take it, even if both 
+  
+    
+      
+        X
+      
+    
+    {\displaystyle X}
+  
+ and 
+  
+    
+      
+        U
+      
+    
+    {\displaystyle U}
+  
+ are the same. For example, we would expect that given age and medical history (
+  
+    
+      
+        X
+      
+    
+    {\displaystyle X}
+  
+), general health awareness (
+  
+    
+      
+        U
+      
+    
+    {\displaystyle U}
+  
+), the intake of 
+  
+    
+      
+        A
+      
+    
+    {\displaystyle A}
+  
+ influenza vaccine will be independent of potential influenza related deaths 
+  
+    
+      
+        
+          
+            
+              
+                Y
+                ~
+              
+            
+          
+          
+            A
+            =
+            a
+          
+        
+      
+    
+    {\displaystyle {\tilde {Y}}^{A=a}}
+  
+. Otherwise, the Latent Exchangeability assumption is violated, and no identification can be made.
+A violation of Irrelevancy occurs when there is a causal effect of 
+  
+    
+      
+        A
+      
+    
+    {\displaystyle A}
+  
+ on 
+  
+    
+      
+        
+          
+            
+              Y
+              ~
+            
+          
+        
+      
+    
+    {\displaystyle {\tilde {Y}}}
+  
+. For example, we would expect that given 
+  
+    
+      
+        X
+      
+    
+    {\displaystyle X}
+  
+ and 
+  
+    
+      
+        U
+      
+    
+    {\displaystyle U}
+  
+, the influenza vaccine does not influence all-cause mortality. If, however, during the influenza vaccine medical visit, the physician also performs a general physical test, recommends good health habits, and prescribes vitamins and essential drugs. In this case, there is likely a causal effect of 
+  
+    
+      
+        A
+      
+    
+    {\displaystyle A}
+  
+ on 
+  
+    
+      
+        
+          
+            
+              Y
+              ~
+            
+          
+        
+      
+    
+    {\displaystyle {\tilde {Y}}}
+  
+ (conditional on 
+  
+    
+      
+        X
+      
+    
+    {\displaystyle X}
+  
+ and 
+  
+    
+      
+        U
+      
+    
+    {\displaystyle U}
+  
+). Therefore, 
+  
+    
+      
+        
+          
+            
+              Y
+              ~
+            
+          
+        
+      
+    
+    {\displaystyle {\tilde {Y}}}
+  
+ cannot be used as NCO, as the test might fail even if the causal design is valid.
+U-Comparability is violated when 
+  
+    
+      
+        
+          
+            
+              Y
+              ~
+            
+          
+        
+        
+          ⊥
+        
+        U
+      
+    
+    {\displaystyle {\tilde {Y}}{\perp }U}
+  
+ , and therefore the lack of association between 
+  
+    
+      
+        A
+      
+    
+    {\displaystyle A}
+  
+ and 
+  
+    
+      
+        
+          
+            
+              Y
+              ~
+            
+          
+        
+      
+    
+    {\displaystyle {\tilde {Y}}}
+  
+ does not provide us any evidence for the invalidity of 
+  
+    
+      
+        A
+      
+    
+    {\displaystyle A}
+  
+. This violation would occur when we choose a poor NCO, that is not or very weakly correlated with the unmeasured confounders.
+
+== Positive control ==
+
+Positive controls are often used to assess test validity. For example, to assess a new test's ability to detect a disease (its sensitivity), then we can compare it against a different test that is already known to work. The well-established test is a positive control since we already know that the answer to the question (whether the test works) is yes.
+Similarly, in an enzyme assay to measure the amount of an enzyme in a set of extracts, a positive control would be an assay containing a known quantity of the purified enzyme (while a negative control would contain no enzyme). The positive control should give a large amount of enzyme activity, while the negative control should give very low to no activity.
+If the positive control does not produce the expected result, there may be something wrong with the experimental procedure, and the experiment is repeated. For difficult or complicated experiments, the result from the positive control can also help in comparison to previous experimental results. For example, if the well-established disease test was determined to have the same effect as found by previous experimenters, this indicates that the experiment is being performed in the same way that the previous experimenters did.
+When possible, multiple positive controls may be used—if there is more than one disease test that is known to be effective, more than one might be tested. Multiple positive controls also allow finer comparisons of the results (calibration, or standardization) if the expected results from the positive controls have different sizes. For example, in the enzyme assay discussed above, a standard curve may be produced by making many different samples with different quantities of the enzyme.
+
+== Randomization ==
+
+In randomization, the groups that receive different experimental treatments are determined randomly. While this does not ensure that there are no differences between the groups, it ensures that the differences are distributed equally, thus correcting for systematic errors.
+For example, in experiments where crop yield is affected (e.g. soil fertility), the experiment can be controlled by assigning the treatments to randomly selected plots of land. This mitigates the effect of variations in soil composition on the yield.
+
+== Blind experiments ==
\ No newline at end of file
diff --git a/data/en.wikipedia.org/wiki/Scientific_control-3.md b/data/en.wikipedia.org/wiki/Scientific_control-3.md
new file mode 100644
index 000000000..b858e4e3f
--- /dev/null
+++ b/data/en.wikipedia.org/wiki/Scientific_control-3.md
@@ -0,0 +1,26 @@
+---
+title: "Scientific control"
+chunk: 4/4
+source: "https://en.wikipedia.org/wiki/Scientific_control"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T03:17:22.864522+00:00"
+instance: "kb-cron"
+---
+
+Blinding is the practice of withholding information that may bias an experiment. For example, participants may not know who received an active treatment and who received a placebo. If this information were to become available to trial participants, patients could receive a larger placebo effect, researchers could influence the experiment to meet their expectations (the observer effect), and evaluators could be subject to confirmation bias. A blind can be imposed on any participant of an experiment, including subjects, researchers, technicians, data analysts, and evaluators. In some cases, sham surgery may be necessary to achieve blinding.
+During the course of an experiment, a participant becomes unblinded if they deduce or otherwise obtain information that has been masked to them. Unblinding that occurs before the conclusion of a study is a source of experimental error, as the bias that was eliminated by blinding is re-introduced. Unblinding is common in blind experiments and must be measured and reported. Meta-research has revealed high levels of unblinding in pharmacological trials. In particular, antidepressant trials are poorly blinded. Reporting guidelines recommend that all studies assess and report unblinding. In practice, very few studies assess unblinding.
+Blinding is an important tool of the scientific method, and is used in many fields of research. In some fields, such as medicine, it is considered essential. In clinical research, a trial that is not blinded trial is called an open trial.
+
+== See also ==
+False positives and false negatives
+Designed experiment
+Controlling for a variable
+James Lind
+Randomized controlled trial
+Wait list control group
+
+== References ==
+
+== External links ==
+"Control" . Encyclopædia Britannica. Vol. 7 (11th ed.). 1911.
\ No newline at end of file
diff --git a/data/en.wikipedia.org/wiki/Scientific_dissent-0.md b/data/en.wikipedia.org/wiki/Scientific_dissent-0.md
new file mode 100644
index 000000000..75453cf1c
--- /dev/null
+++ b/data/en.wikipedia.org/wiki/Scientific_dissent-0.md
@@ -0,0 +1,56 @@
+---
+title: "Scientific dissent"
+chunk: 1/1
+source: "https://en.wikipedia.org/wiki/Scientific_dissent"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T03:17:24.085084+00:00"
+instance: "kb-cron"
+---
+
+Scientific dissent is dissent from scientific consensus. Disagreements can be useful for finding problems in underlying assumptions, methodologies, and reasoning, as well as for generating and testing new ways of tackling the unknown. In modern times, with the increased role of science on the society and the politicization of science, a new aspect gained prominence: effects of scientific dissent on public policies.
+Scientific dissent is distinct from denialism, which is a deliberate rejection of scientific consensus usually for commercial or ideological reasons.
+
+
+== Dissent as part of scientific inquiry ==
+
+Miriam Solomon in her book Social Empiricism argues that scientific dissent is the normal state of scientific inquiry, rather than a conflict that needs resolution. She argues that disagreements between individual scientists about the proper direction of research are not cause for concern, because scientific rationality must be assessed at the level of the scientific community.  As long as all theories being pursued yield some unique empirical successes, Solomon argues that their pursuit is worthwhile and consistent with the common view that science aims at truth. In Solomon's view, competing scientific theories can even be inconsistent with one another while each containing some degree of truth.  Empirical evidence may not be sufficient to distinguish between competing theories, and successful theories often have core assumptions that are incorrect.
+
+
+=== Historical scientific dissent ===
+A number of famous scientists have been sceptical of what were, or came to be, mainstream scientific positions. For example, Ernst Mach famously declared in 1897: "I don't believe that atoms exist!" Wilhelm Ostwald expressed a similar scepticism about atoms, but changed his mind in 1908.
+In the early 20th century, peptic ulcers were  believed to be caused by stress and dietary factors. The physicians Robin Warren and Barry Marshall showed in 1982 that the bacterium Helicobacter pylori was responsible, but the medical community was slow to make appropriate changes in ulcer treatment.
+
+
+== Suppressed dissent ==
+Scientific debate is a healthy and necessary part of science, but scientific debate may collide with power dynamics within the academic world. Suppression of legitimate scientific debate can be considered as a breach of academic integrity. Examples of suppression include journal editors rejecting a paper for political reasons prior to peer-review, refusing access to data for research which might draw negative conclusions about the safety of some commercial product, and putting pressure on a university to fire a dissenting researcher.
+
+
+== False scientific dissent ==
+
+In modern times proponents of science denialism, pseudoscience, and conspiracy theories often try to disguise their viewpoints as "scientific dissent" to take an advantage of the benefit of doubt. Such cases are typically recognized by lack of crucial elements of scientific approach: insufficient evidence base, lack of rigor and control, etc.
+Lack of discussion of claims coming from fringe science may be presented as suppression by mainstream science. This was described as "manufacturing dissent" and discussed in the context of neo-creationism.
+In the introduction to Creating Scientific Controversies, David Harker summarizes the history of how the tobacco industry worked towards manufacturing a controversy regarding the health effects of tobacco.
+In what is sometimes known as the "Galileo gambit", pseudoscientists will sometimes compare themselves to Galileo, arguing that opposition from established scientists is actually a point in favour of their ideas. Jean Paul Van Bendegem writes that "No doubt the most famous example of mistaken analogy is the abuse of Galileo Galilei's case resulting in his conviction by the Holy Inquisition. The basic strategy consists of equating Galileo with the poor astrologer or parapsychologist and equating the Inquisition with the scientific establishment."
+
+
+== Effect on modern public policies ==
+
+Views which disagree with scientific consensus may have an adverse effect on the perception of science by general public and affect decision making in various policies. When prominently promoted without due proportion, dissenting views can create an impression of uncertainty to laypeople. Common examples of such situation include  global warming controversy and issues of public health and genetically modified organisms.  Therefore, scientists treat scientific dissent as problematic when it may have a significant impact on public and policy-making, and try to mitigate it.
+Inmaculada de Melo-Martín and Kristen Intemann criticize three major strategies in battling allegedly dangerous scientific dissent: masking the dissent, silencing the dissent, and discrediting the dissenters. Melo-Martin and Intermann argue that these strategies come from a misdiagnosis: the real problem is not dissent,  but public scientific illiteracy. Rather than focusing on dissent, scientists must concentrate on educating the general public, so that people could make educated opinions and recognize false claims and invalid arguments. They further argue that silencing dissent rather than promoting literacy incurs the risk of undermining the public trust in science.
+Sheila Jasanoff, in the context of climate change, mentions a common argument that public opinion is poorly informed because petroleum industry manufactures uncertainties and the media exaggerate the dissent, but argues that it is insufficient for the understanding of the problem. She writes that studies of scientific controversies show that credibility of science depend not only on strong scientific consensus, but also on the persuasive power of those who speak for science, especially in the situations of controversy.
+
+
+== See also ==
+Critical thinking
+Criticism of science
+Denialism
+Metascience
+Scientific skepticism
+Suppression of dissent
+
+
+== References ==
+
+
+== Further references ==
\ No newline at end of file
diff --git a/data/en.wikipedia.org/wiki/Scientific_enterprise-0.md b/data/en.wikipedia.org/wiki/Scientific_enterprise-0.md
new file mode 100644
index 000000000..4f4eda614
--- /dev/null
+++ b/data/en.wikipedia.org/wiki/Scientific_enterprise-0.md
@@ -0,0 +1,34 @@
+---
+title: "Scientific enterprise"
+chunk: 1/1
+source: "https://en.wikipedia.org/wiki/Scientific_enterprise"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T03:17:25.275542+00:00"
+instance: "kb-cron"
+---
+
+A scientific enterprise is a science-based project developed by, or in cooperation with, a private entrepreneur. For example, in the Age of Exploration, leaders like Henry the Navigator founded schools of navigation, from which stemmed voyages of exploration.
+
+
+== Examples of enterprising scientific organizations ==
+Each organization listed below has the ability to conduct scientific research on an extended basis, involving multiple researchers over an extended time. Generally, the research is funded not only for the science itself, but for some application which shows promise for the enterprise. But the researchers, if left to their own choices, will tend to follow their research interest, which is essential for the long-term health of their chosen field. Note that a successful scientific enterprise is not equivalent to a successful high-tech enterprise or to a successful business enterprise, but that they form an ecology, a food chain.
+
+The Max Planck Institute, which supports fundamental research in the natural, life and social sciences, the arts and humanities
+The RAND Corporation, founded as a research corporation. Did groundbreaking work in the field of artificial intelligence
+The Jet Propulsion Laboratory, founded by Theodore von Kármán
+Bell Laboratories, renowned for the quality of its scientific work and for inventing the operating system Unix, and the programming languages C and C++
+Xerox PARC, research organization which has also spawned innovations such as the graphical user interface, laser printing, and Ethernet
+SRI, where the computer mouse was invented, were explicitly founded with a research basis
+IBM Research, notable for inventing the relational database, hard disk drive, and floppy disk
+Hughes Research Laboratories, location of the invention of the first working laser
+The Howard Hughes Medical Institute, non-profit medical research organization
+
+
+== See also ==
+Science of science policy (SoSP)
+
+
+== References ==
+Gerald Holton, Einstein, History, and Other Passions
+John Ziman, Reliable Knowledge
\ No newline at end of file
diff --git a/data/en.wikipedia.org/wiki/Scientific_evidence-0.md b/data/en.wikipedia.org/wiki/Scientific_evidence-0.md
new file mode 100644
index 000000000..eba609198
--- /dev/null
+++ b/data/en.wikipedia.org/wiki/Scientific_evidence-0.md
@@ -0,0 +1,21 @@
+---
+title: "Scientific evidence"
+chunk: 1/2
+source: "https://en.wikipedia.org/wiki/Scientific_evidence"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T03:17:26.492214+00:00"
+instance: "kb-cron"
+---
+
+Scientific evidence is evidence that serves to either support or counter a scientific theory or hypothesis, although scientists also use evidence in other ways, such as when applying theories to practical problems. Such evidence is expected to be empirical evidence and interpretable in accordance with the scientific method. Standards for scientific evidence vary according to the field of inquiry, but the strength of scientific evidence is generally based on the results of statistical analysis and the strength of scientific controls.
+
+== Principles of inference ==
+A person's assumptions or beliefs about the relationship between observations and a hypothesis will affect whether that person takes the observations as evidence. These assumptions or beliefs will also affect how a person utilizes the observations as evidence. For example, the Earth's apparent lack of motion may be taken as evidence for a geocentric cosmology. However, after sufficient evidence is presented for heliocentric cosmology and the apparent lack of motion is explained, the initial observation is strongly discounted as evidence.
+When rational observers have different background beliefs, they may draw different conclusions from the same scientific evidence. For example, Priestley, working with phlogiston theory, explained his observations about the decomposition of mercuric oxide using phlogiston. In contrast, Lavoisier, developing the theory of elements, explained the same observations with reference to oxygen. A causal relationship between the observations and hypothesis does not exist to cause the observation to be taken as evidence, but rather the causal relationship is provided by the person seeking to establish observations as evidence.
+A more formal method to characterize the effect of background beliefs is Bayesian inference. In Bayesian inference, beliefs are expressed as percentages indicating one's confidence in them. One starts from an initial probability (a prior), and then updates that probability using Bayes' theorem after observing evidence. As a result, two independent observers of the same event will rationally arrive at different conclusions if their priors (previous observations that are also relevant to the conclusion) differ.
+The importance of background beliefs in the determination of what observations are evidence can be illustrated using deductive reasoning, such as syllogisms. If either of the propositions is not accepted as true, the conclusion will not be accepted either.
+
+== Utility of scientific evidence ==
+
+Philosophers, such as Karl R. Popper, have provided influential theories of the scientific method within which scientific evidence plays a central role. In summary, Popper provides that a scientist creatively develops a theory that may be falsified by testing the theory against evidence or known facts. Popper's theory presents an asymmetry in that evidence can prove a theory wrong, by establishing facts that are inconsistent with the theory. In contrast, evidence cannot prove a theory correct because other evidence, yet to be discovered, may exist that is inconsistent with the theory.
\ No newline at end of file
diff --git a/data/en.wikipedia.org/wiki/Scientific_evidence-1.md b/data/en.wikipedia.org/wiki/Scientific_evidence-1.md
new file mode 100644
index 000000000..d7490fa80
--- /dev/null
+++ b/data/en.wikipedia.org/wiki/Scientific_evidence-1.md
@@ -0,0 +1,35 @@
+---
+title: "Scientific evidence"
+chunk: 2/2
+source: "https://en.wikipedia.org/wiki/Scientific_evidence"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T03:17:26.492214+00:00"
+instance: "kb-cron"
+---
+
+== Philosophical versus scientific views ==
+In the 20th century, many philosophers investigated the logical relationship between evidence statements and hypotheses, whereas scientists tended to focus on how the data used for statistical inference are generated. But according to philosopher Deborah Mayo, by the end of the 20th century philosophers had come to understand that "there are key features of scientific practice that are overlooked or misdescribed by all such logical accounts of evidence, whether hypothetico-deductive, Bayesian, or instantiationist".
+There were a variety of 20th-century philosophical approaches to decide whether an observation may be considered evidence; many of these focused on the relationship between the evidence and the hypothesis. In the 1950s, Rudolf Carnap recommended distinguishing such approaches into three categories: classificatory (whether the evidence confirms the hypothesis), comparative (whether the evidence supports a first hypothesis more than an alternative hypothesis) or quantitative (the degree to which the evidence supports a hypothesis). A 1983 anthology edited by Peter Achinstein provided a concise presentation by prominent philosophers on scientific evidence, including Carl Hempel (on the logic of confirmation), R. B. Braithwaite (on the structure of a scientific system), Norwood Russell Hanson (on the logic of discovery), Nelson Goodman (of grue fame, on a theory of projection), Rudolf Carnap (on the concept of confirming evidence), Wesley C. Salmon (on confirmation and relevance), and Clark Glymour (on relevant evidence). In 1990, William Bechtel provided four factors (clarity of the data, replication by others, consistency with results arrived at by alternative methods, and consistency with plausible theories of mechanisms) that biologists used to settle controversies about procedures and reliability of evidence.
+In 2001, Achinstein published his own book on the subject titled The Book of Evidence, in which, among other topics, he distinguished between four concepts of evidence: epistemic-situation evidence (evidence relative to a given epistemic situation), subjective evidence (considered to be evidence by a particular person at a particular time), veridical evidence (a good reason to believe that a hypothesis is true), and potential evidence (a good reason to believe that a hypothesis is highly probable). Achinstein defined all his concepts of evidence in terms of potential evidence, since any other kind of evidence must at least be potential evidence, and he argued that scientists mainly seek veridical evidence but they also use the other concepts of evidence, which rely on a distinctive concept of probability, and Achinstein contrasted this concept of probability with previous probabilistic theories of evidence such as Bayesian, Carnapian, and frequentist.
+Simplicity is one common philosophical criterion for scientific theories. Based on the philosophical assumption of the strong Church-Turing thesis, a mathematical criterion for evaluation of evidence has been conjectured, with the criterion having a resemblance to the idea of Occam's razor that the simplest comprehensive description of the evidence is most likely correct. It states formally, "The ideal principle states that the prior probability associated with the hypothesis should be given by the algorithmic universal probability, and the sum of the log universal probability of the model plus the log of the probability of the data given the model should be minimized." However, some philosophers (including Richard Boyd, Mario Bunge, John D. Norton, and Elliott Sober) have adopted a skeptical or deflationary view of the role of simplicity in science, arguing in various ways that its importance has been overemphasized.
+Emphasis on hypothesis testing as the essence of science is prevalent among both scientists and philosophers. However, philosophers have noted that testing hypotheses by confronting them with new evidence does not account for all the ways that scientists use evidence. For example, when Geiger and Marsden scattered alpha particles through thin gold foil, the resulting data enabled their experimental adviser, Ernest Rutherford, to very accurately calculate the mass and size of an atomic nucleus for the first time. Rutherford used the data to develop a new atomic model, not only to test an existing hypothesis; such use of evidence to produce new hypotheses is sometimes called abduction (following C. S. Peirce). Social-science methodologist Donald T. Campbell, who emphasized hypothesis testing throughout his career, later increasingly emphasized that the essence of science is "not experimentation per se" but instead the iterative competition of "plausible rival hypotheses", a process that at any given phase may start from evidence or may start from hypothesis. Other scientists and philosophers have emphasized the central role of questions and problems in the use of data and hypotheses.
+
+== Concept of scientific proof ==
+While the phrase "scientific proof" is often used in the popular media, many scientists and philosophers have argued that there is really no such thing as infallible proof. For example, Karl Popper once wrote that "In the empirical sciences, which alone can furnish us with information about the world we live in, proofs do not occur, if we mean by 'proof' an argument which establishes once and for ever the truth of a theory." Albert Einstein said:
+
+The scientific theorist is not to be envied. For Nature, or more precisely experiment, is an inexorable and not very friendly judge of his work. It never says "Yes" to a theory. In the most favorable cases it says "Maybe", and in the great majority of cases simply "No". If an experiment agrees with a theory it means for the latter "Maybe", and if it does not agree it means "No". Probably every theory will someday experience its "No"—most theories, soon after conception.
+However, in contrast to the ideal of infallible proof, in practice theories may be said to be proved according to some standard of proof used in a given inquiry. In this limited sense, proof is the high degree of acceptance of a theory following a process of inquiry and critical evaluation according to the standards of a scientific community.
+
+== See also ==
+
+Anecdotal evidence – Evidence relying on personal testimony
+Evidence-based medicine – Illness diagnosis, treatment and prevention based on data collection and analysis
+Scientific evidence (law) – Person whose opinion is accepted by the judge as an expertPages displaying short descriptions of redirect targets
+Science – Systematic endeavour to gain knowledge
+Probabilistic causation
+Probabilistic argumentation
+Probabilistic logic – Applications of logic under uncertainty
+Opinion – Judgement, viewpoint, or statement that is not conclusive
+
+== References ==
\ No newline at end of file
diff --git a/data/en.wikipedia.org/wiki/Scientific_formalism-0.md b/data/en.wikipedia.org/wiki/Scientific_formalism-0.md
new file mode 100644
index 000000000..c93b6b379
--- /dev/null
+++ b/data/en.wikipedia.org/wiki/Scientific_formalism-0.md
@@ -0,0 +1,44 @@
+---
+title: "Scientific formalism"
+chunk: 1/1
+source: "https://en.wikipedia.org/wiki/Scientific_formalism"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T03:17:27.673283+00:00"
+instance: "kb-cron"
+---
+
+Scientific formalism is a family of approaches to the presentation of science. It is viewed as an important part of the scientific method, especially in the physical sciences.
+
+
+== Levels of formalism ==
+There are multiple levels of scientific formalism possible. At the lowest level, scientific formalism deals with the symbolic manner in which the information is presented. To achieve formalism in a scientific theory at this level, one starts with a well defined set of axioms, and from this follows a formal system.
+However, at a higher level, scientific formalism also involves consideration of the axioms themselves. These can be viewed as questions of ontology. For example, one can, at the lower level of formalism, define a property called 'existence'. However, at the higher level, the question of whether an electron exists in the same sense that a bacterium exists still needs to be resolved.
+Some actual formal theories on facts have been proposed.
+
+
+== In modern physics ==
+The scientific climate of the twentieth century revived these questions. From about the time of Isaac Newton to that of James Clerk Maxwell they had been dormant, in the sense that the physical sciences could rely on the status of the real numbers as a description of the continuum, and an agnostic view of atoms and their structure. Quantum mechanics, the dominant physical theory after about 1925, was formulated in a way which raised questions of both types.
+In the Newtonian framework there was indeed a degree of comfort in the answers one could give. Consider for example the question of whether the Earth really goes round the Sun. In a frame of reference adapted to calculating the Earth's orbit, this is a mathematical but also tautological statement. Newtonian mechanics can answer the question, whether it is not equally the case that the Sun goes round the Earth, as it indeed appears to Earth-based astronomers. In Newton's theory there is a basic, fixed frame of reference that is inertial. The 'correct answer' is that the point of view of an observer in an inertial frame of reference is privileged: other observers see artifacts of their acceleration relative to an inertial frame (the inertial forces). Before Newton, Galileo would draw the consequences, from the Copernican heliocentric model. He was, however, constrained to call his work (in effect) scientific formalism, under the old 'description' saving the phenomena. To avoid going against authority, the elliptic orbits of the heliocentric model could be labelled as a more convenient device for calculations, rather than an actual description of reality.
+In general relativity, Newton's inertial frames are no longer privileged. In quantum mechanics, Paul Dirac argued that physical models were not there to provide semantic constructs allowing us to understand microscopic physics in language comparable to that we use on the familiar scale of everyday objects. His attitude, adopted by many theoretical physicists, is that a good model is judged by our capacity to use it to calculate physical quantities that can be tested experimentally. Dirac's view is close to what Bas van Fraassen calls constructive empiricism.
+
+
+== Duhem ==
+A physicist who took the issues involved seriously was Pierre Duhem, writing at the beginning of the twentieth century. He wrote an extended analysis of the approach he saw as characteristically British, in requiring field theories of theoretical physics to have a mechanical-physical interpretation. That was an accurate characterisation of what Dirac (himself British) would later argue against. The national characteristics specified by Duhem do not need to be taken too seriously, since he also claimed that the use of abstract algebra, namely quaternions, was also characteristically British (as opposed to French or German); as if the use of classical analysis methods alone was important one way or the other.
+Duhem also wrote on saving the phenomena. In addition to the Copernican Revolution debate of "saving the phenomena" (Greek: σῴζειν τὰ φαινόμενα, sozein ta phainomena) versus offering explanations that inspired Duhem was Thomas Aquinas, who wrote, regarding eccentrics and epicycles, that
+
+Reason may be employed in two ways to establish a point: firstly, for the purpose of furnishing sufficient proof of some principle [...]. Reason is employed in another way, not as furnishing a sufficient proof of a principle, but as confirming an already established principle, by showing the congruity of its results, as in astronomy the theory of eccentrics and epicycles is considered as established, because thereby the sensible appearances of the heavenly movements can be explained (possunt salvari apparentia sensibilia); not, however, as if this proof were sufficient, forasmuch as some other theory might explain them. [...]
+The idea that a physical interpretation—in common language or classical ideas and physical entities, though of or examined in an ontological or quasi-ontological sense—of a phenomenon in physics is not an ultimate or necessary condition for its understanding or validity, also appears in modern structural realist views on science.
+
+
+== Bellarmine ==
+Robert Bellarmine wrote to heliocentrist Paolo Antonio Foscarini:Nor is it the same to demonstrate that by assuming the sun to be at the center and the earth in heaven one can save the appearances, and to demonstrate that in truth the sun is at the center and the earth in heaven; for I believe the first demonstration may be available, but I have very great doubts about the second…
+Modern physicist Pierre Duhem "suggests that in one respect, at least, Bellarmine had shown himself a better scientist than Galileo by disallowing the possibility of a 'strict proof of the earth's motion,' on the grounds that an astronomical theory merely 'saves the appearances' without necessarily revealing what 'really happens.'"
+
+
+== See also ==
+Andreas Osiander
+Scientific community metaphor
+
+
+== Notes ==
\ No newline at end of file
diff --git a/data/en.wikipedia.org/wiki/Scientific_imperialism-0.md b/data/en.wikipedia.org/wiki/Scientific_imperialism-0.md
new file mode 100644
index 000000000..dffbc2a72
--- /dev/null
+++ b/data/en.wikipedia.org/wiki/Scientific_imperialism-0.md
@@ -0,0 +1,39 @@
+---
+title: "Scientific imperialism"
+chunk: 1/1
+source: "https://en.wikipedia.org/wiki/Scientific_imperialism"
+category: "reference"
+tags: "science, encyclopedia"
+date_saved: "2026-05-05T03:17:28.907176+00:00"
+instance: "kb-cron"
+---
+
+Scientific imperialism is a term that appears to have been coined by Ellis T. Powell when addressing the Commonwealth Club of Canada on 8 September 1920. He defined imperialism as "the sense of arbitrary and capricious domination over the bodies and souls of men," he used the term "scientific imperialism" to mean "the subjection of all the developed and undeveloped powers of the earth to the mind of man."
+In modern usage, however, scientific imperialism refers to situations in which critics perceive science to act imperiously. Philosopher of science John Dupré described it (in his 2001 book Human Nature and the Limits of Science, p. 74) as "the tendency to push a good scientific idea far beyond the domain in which it was originally introduced, and often far beyond the domain in which it can provide much illumination." He wrote that "devotees of these approaches are inclined to claim that they are in possession not just of one useful perspective on human behavior, but of the key that will open doors to the understanding of ever wider areas of human behavior."
+Scientific imperialism has also been charged against "those who believe that the study of politics can and should be modelled on the natural sciences, a position defended most forcibly in the United States, and those who have dissented, viewing this ambition as methodologically unjustified and ethically undesirable."
+
+
+== Critique of power ==
+Writing about scientific exploration by James Cook in the 18th century, the textbook Worlds Together, Worlds Apart by Jeremy Adelman, Elizabeth Pollard, Clifford Rosenberg and Robert Tignor defined scientific imperialism as the "pursuit of power through the pursuit of knowledge." Arthur Peacocke wrote that its later pejorative use may reflect the frustration felt by some with "the limitations of reductive scientism (scientific imperialism)." He also questions the notion that "successful scientific theories are true or approximately true models of the world," and expresses a desire to "dethrone science from an imperialistic stance over philosophy and theology." Theologian and Christian apologist J. P. Moreland argue that "the myth that science is the model of truth and rationality still grips the mind of much of our popular and scientific culture", stating that "though philosophers of science over the past few decades have gutted many of the claims of this scientific imperialism, many thinkers, knee-jerk agnostics, and even judges persist in the grip of this notion."
+
+
+== "Religion of the intellectuals" ==
+Behavioral psychologist J. E. R. Staddon defined scientific imperialism as "the idea that all decisions, in principle, can be made scientifically" and stated that it had become a "religion of the intellectuals". John Dupré also criticised "a natural tendency, when one has a successful scientific model, to attempt to apply it to as many problems as possible", and described these extended applications as being "dangerous". Such notions have been compared to cultural imperialism, and to a rigid and intolerant form of intellectual monotheism.
+
+
+== Medical research ==
+Medical doctor Peter Wilmshurst has used the term to describe "poor people in developing countries...being exploited in research for the benefit of patients in the developed world", and advised that "the scientific community has a responsibility to ensure that all scientific research is conducted ethically". Others consider that there is a misappropriation of indigenous drugs in poor countries by drug companies in the developed world. Pharmacologist Elaine Elisabetsky wrote that "ethnopharmacology involves a series of sociopolitical, economic and ethical dilemmas, at various levels...frequently host country scientists, visiting scientists, and informants disagree...research efforts are (often) perceived as scientific imperialism; scientists are accused of stealing plant materials and appropriating traditional plant knowledge for financial profit and/or professional advancement. Many governments, as well as indigenous societies are increasingly reluctant to permit such research...historically neither native populations nor host countries have shared to a significant extent the financial benefits from any drug that reaches the market...unless these issues are amply discussed and fairly resolved, medicinal plant research runs the risk of serving ethically questionable purposes."
+
+
+== See also ==
+Antireductionism
+Experimental political science
+Imperialism, the Highest Stage of Capitalism
+Pseudoscience
+Scientific racism
+
+
+== References ==
+
+
+== Further reading ==
\ No newline at end of file