kb/data/en.wikipedia.org/wiki/Sampling_(statistics)-6.md

6.4 KiB
Raw Blame History

title chunk source category tags date_saved instance
Sampling (statistics) 7/8 https://en.wikipedia.org/wiki/Sampling_(statistics) reference science, encyclopedia 2026-05-05T03:45:25.903358+00:00 kb-cron

Sampling schemes may be without replacement ('WOR' no element can be selected more than once in the same sample) or with replacement ('WR' an element may appear multiple times in the one sample). For example, if we catch fish, measure them, and immediately return them to the water before continuing with the sample, this is a WR design, because we might end up catching and measuring the same fish more than once. However, if we do not return the fish to the water or tag and release each fish after catching it, this becomes a WOR design.

== Sample size determination ==

Formulas, tables, and power function charts are well known approaches to determine sample size. Steps for using sample size tables:

Postulate the effect size of interest, α, and β. Check sample size table Select the table corresponding to the selected α Locate the row corresponding to the desired power Locate the column corresponding to the estimated effect size. The intersection of the column and row is the minimum sample size required.

== Sampling and data collection == Good data collection involves:

Following the defined sampling process Keeping the data in time order Noting comments and other contextual events Recording non-responses

== Applications of sampling == Sampling enables the selection of right data points from within the larger data set to estimate the characteristics of the whole population. For example, there are about 600 million tweets produced every day. It is not necessary to look at all of them to determine the topics that are discussed during the day, nor is it necessary to look at all the tweets to determine the sentiment on each of the topics. A theoretical formulation for sampling Twitter data has been developed. In manufacturing different types of sensory data such as acoustics, vibration, pressure, current, voltage, and controller data are available at short time intervals. To predict down-time it may not be necessary to look at all the data but a sample may be sufficient.

== Errors in sample surveys ==

Survey results are typically subject to some error. Total errors can be classified into sampling errors and non-sampling errors. The term "error" here includes systematic biases as well as random errors.

=== Sampling errors and biases === Sampling errors and biases are induced by the sample design. They include:

Selection bias: When the true selection probabilities differ from those assumed in calculating the results. Random sampling error: Random variation in the results due to the elements in the sample being selected at random.

=== Non-sampling error ===

Non-sampling errors are other errors which can impact final survey estimates, caused by problems in data collection, processing, or sample design. Such errors may include:

Over-coverage: inclusion of data from outside of the population Under-coverage: sampling frame does not include elements in the population. Measurement error: e.g., when respondents misunderstand a question, or find it difficult to answer Processing error: mistakes in data coding Non-response or Participation bias: failure to obtain complete data from all selected individuals After sampling, a review is held of the exact process followed in sampling, rather than that intended, in order to study any effects that any divergences might have on subsequent analysis. A particular problem involves non-response. Two major types of non-response exist:

unit nonresponse (lack of completion of any part of the survey) item non-response (submission or participation in survey but failing to complete one or more components/questions of the survey) In survey sampling, many of the individuals identified as part of the sample may be unwilling to participate, not have the time to participate (opportunity cost), or survey administrators may not have been able to contact them. In this case, there is a risk of differences between respondents and nonrespondents, leading to biased estimates of population parameters. This is often addressed by improving survey design, offering incentives, and conducting follow-up studies which make a repeated attempt to contact the unresponsive and to characterize their similarities and differences with the rest of the frame. The effects can also be mitigated by weighting the data (when population benchmarks are available) or by imputing data based on answers to other questions. Nonresponse is particularly a problem in internet sampling. Reasons for this problem may include improperly designed surveys, over-surveying (or survey fatigue), and the fact that potential participants may have multiple e-mail addresses, which they do not use anymore or do not check regularly.

== Survey weights == In many situations, the sample fraction may be varied by stratum and data will have to be weighted to correctly represent the population. Thus for example, a simple random sample of individuals in the United Kingdom might not include some in remote Scottish islands who would be inordinately expensive to sample. A cheaper method would be to use a stratified sample with urban and rural strata. The rural sample could be under-represented in the sample, but weighted up appropriately in the analysis to compensate. More generally, data should usually be weighted if the sample design does not give each individual an equal chance of being selected. For instance, when households have equal selection probabilities but one person is interviewed from within each household, this gives people from large households a smaller chance of being interviewed. This can be accounted for using survey weights. Similarly, households with more than one telephone line have a greater chance of being selected in a random digit dialing sample, and weights can adjust for this. Weights can also serve other purposes, such as helping to correct for non-response.

== Methods of producing random samples == Random number table Mathematical algorithms for pseudo-random number generators Physical randomization devices such as coins, playing cards or sophisticated devices such as ERNIE

== See also ==

== Notes == The textbook by Groves et alia provides an overview of survey methodology, including recent literature on questionnaire development (informed by cognitive psychology) :