1086 lines
15 KiB
Markdown
1086 lines
15 KiB
Markdown
---
|
||
title: "Design effect"
|
||
chunk: 10/12
|
||
source: "https://en.wikipedia.org/wiki/Design_effect"
|
||
category: "reference"
|
||
tags: "science, encyclopedia"
|
||
date_saved: "2026-05-05T09:49:56.844427+00:00"
|
||
instance: "kb-cron"
|
||
---
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
σ
|
||
^
|
||
|
||
|
||
|
||
|
||
y
|
||
|
||
|
||
|
||
|
||
{\displaystyle {\hat {\sigma }}_{y}}
|
||
|
||
estimates the population variance
|
||
|
||
|
||
|
||
|
||
σ
|
||
|
||
y
|
||
|
||
|
||
|
||
|
||
{\displaystyle \sigma _{y}}
|
||
|
||
, and
|
||
L is the relvariance of the weights, as defined in Kish's formula:
|
||
|
||
|
||
|
||
L
|
||
=
|
||
c
|
||
|
||
v
|
||
|
||
w
|
||
|
||
|
||
2
|
||
|
||
|
||
=
|
||
|
||
relvar
|
||
|
||
(
|
||
w
|
||
)
|
||
=
|
||
|
||
|
||
|
||
V
|
||
(
|
||
w
|
||
)
|
||
|
||
|
||
|
||
|
||
|
||
w
|
||
¯
|
||
|
||
|
||
|
||
|
||
2
|
||
|
||
|
||
|
||
|
||
|
||
|
||
{\displaystyle L=cv_{w}^{2}={\text{relvar}}(w)={\frac {V(w)}{{\bar {w}}^{2}}}}
|
||
|
||
.
|
||
This assumes that the regression model fits well so that the probability of selection and the residuals are independent, since it leads to the residuals, and the square residuals, to be uncorrelated with the weights, i.e., that
|
||
|
||
|
||
|
||
|
||
ρ
|
||
|
||
ϵ
|
||
,
|
||
W
|
||
|
||
|
||
=
|
||
0
|
||
|
||
|
||
{\displaystyle \rho _{\epsilon ,W}=0}
|
||
|
||
and also
|
||
|
||
|
||
|
||
|
||
ρ
|
||
|
||
|
||
ϵ
|
||
|
||
2
|
||
|
||
|
||
,
|
||
W
|
||
|
||
|
||
=
|
||
0
|
||
|
||
|
||
{\displaystyle \rho _{\epsilon ^{2},W}=0}
|
||
|
||
.
|
||
When the population size (N) is very large, the formula can be written as:
|
||
|
||
|
||
|
||
|
||
|
||
|
||
Deff
|
||
|
||
|
||
S
|
||
p
|
||
e
|
||
n
|
||
c
|
||
e
|
||
r
|
||
|
||
|
||
=
|
||
(
|
||
1
|
||
−
|
||
|
||
|
||
|
||
|
||
ρ
|
||
^
|
||
|
||
|
||
|
||
|
||
y
|
||
,
|
||
P
|
||
|
||
|
||
2
|
||
|
||
|
||
)
|
||
(
|
||
1
|
||
+
|
||
c
|
||
|
||
v
|
||
|
||
w
|
||
|
||
|
||
2
|
||
|
||
|
||
)
|
||
+
|
||
|
||
|
||
(
|
||
|
||
|
||
1
|
||
|
||
c
|
||
|
||
v
|
||
|
||
Y
|
||
|
||
|
||
2
|
||
|
||
|
||
|
||
|
||
|
||
)
|
||
|
||
|
||
2
|
||
|
||
|
||
c
|
||
|
||
v
|
||
|
||
w
|
||
|
||
|
||
2
|
||
|
||
|
||
|
||
|
||
{\displaystyle {\text{Deff}}_{Spencer}=(1-{\hat {\rho }}_{y,P}^{2})(1+cv_{w}^{2})+\left({\frac {1}{cv_{Y}^{2}}}\right)^{2}cv_{w}^{2}}
|
||
|
||
|
||
(since
|
||
|
||
|
||
|
||
α
|
||
=
|
||
|
||
|
||
|
||
Y
|
||
¯
|
||
|
||
|
||
|
||
−
|
||
β
|
||
×
|
||
|
||
|
||
|
||
P
|
||
¯
|
||
|
||
|
||
|
||
=
|
||
|
||
|
||
|
||
Y
|
||
¯
|
||
|
||
|
||
|
||
−
|
||
β
|
||
×
|
||
|
||
|
||
1
|
||
N
|
||
|
||
|
||
≈
|
||
|
||
|
||
|
||
Y
|
||
¯
|
||
|
||
|
||
|
||
|
||
|
||
{\displaystyle \alpha ={\bar {Y}}-\beta \times {\bar {P}}={\bar {Y}}-\beta \times {\frac {1}{N}}\approx {\bar {Y}}}
|
||
|
||
, where
|
||
|
||
|
||
|
||
c
|
||
|
||
v
|
||
|
||
Y
|
||
|
||
|
||
2
|
||
|
||
|
||
=
|
||
|
||
|
||
|
||
σ
|
||
|
||
Y
|
||
|
||
|
||
2
|
||
|
||
|
||
|
||
|
||
|
||
|
||
Y
|
||
¯
|
||
|
||
|
||
|
||
|
||
2
|
||
|
||
|
||
|
||
|
||
|
||
|
||
{\displaystyle cv_{Y}^{2}={\frac {\sigma _{Y}^{2}}{{\bar {Y}}^{2}}}}
|
||
|
||
)
|
||
This approximation assumes that the linear relationship between P and y holds. And also that the correlation of the weights with the errors, and the errors squared, are both zero. I.e.,
|
||
|
||
|
||
|
||
|
||
ρ
|
||
|
||
w
|
||
,
|
||
e
|
||
|
||
|
||
=
|
||
0
|
||
|
||
|
||
{\displaystyle \rho _{w,e}=0}
|
||
|
||
and
|
||
|
||
|
||
|
||
|
||
ρ
|
||
|
||
w
|
||
,
|
||
|
||
e
|
||
|
||
2
|
||
|
||
|
||
|
||
|
||
=
|
||
0
|
||
|
||
|
||
{\displaystyle \rho _{w,e^{2}}=0}
|
||
|
||
.
|
||
We notice that if
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
ρ
|
||
^
|
||
|
||
|
||
|
||
|
||
y
|
||
,
|
||
P
|
||
|
||
|
||
≈
|
||
0
|
||
|
||
|
||
{\displaystyle {\hat {\rho }}_{y,P}\approx 0}
|
||
|
||
, then
|
||
|
||
|
||
|
||
|
||
|
||
|
||
α
|
||
^
|
||
|
||
|
||
|
||
≈
|
||
|
||
|
||
|
||
y
|
||
¯
|
||
|
||
|
||
|
||
|
||
|
||
{\displaystyle {\hat {\alpha }}\approx {\bar {y}}}
|
||
|
||
(i.e., the average of y). In such a case, the formula reduces to
|
||
|
||
|
||
|
||
|
||
|
||
|
||
Deff
|
||
|
||
|
||
S
|
||
p
|
||
e
|
||
n
|
||
c
|
||
e
|
||
r
|
||
|
||
|
||
=
|
||
(
|
||
1
|
||
+
|
||
L
|
||
)
|
||
+
|
||
|
||
|
||
(
|
||
|
||
|
||
1
|
||
|
||
|
||
relvar
|
||
|
||
(
|
||
y
|
||
)
|
||
|
||
|
||
|
||
)
|
||
|
||
|
||
2
|
||
|
||
|
||
L
|
||
|
||
|
||
{\displaystyle {\text{Deff}}_{Spencer}=(1+L)+\left({\frac {1}{{\text{relvar}}(y)}}\right)^{2}L}
|
||
|
||
|
||
Only if the variance of y is much larger than its mean, then the right-most term is close to 0 (i.e.,
|
||
|
||
|
||
|
||
|
||
|
||
1
|
||
|
||
|
||
relvar
|
||
|
||
(
|
||
y
|
||
)
|
||
|
||
|
||
|
||
=
|
||
|
||
|
||
|
||
|
||
Y
|
||
¯
|
||
|
||
|
||
|
||
σ
|
||
|
||
y
|
||
|
||
|
||
|
||
|
||
≈
|
||
0
|
||
|
||
|
||
{\displaystyle {\frac {1}{{\text{relvar}}(y)}}={\frac {\bar {Y}}{\sigma _{y}}}\approx 0}
|
||
|
||
), which reduces Spencer's design effect (for the estimated total) to be equal to Kish's design effect (for the ratio means):
|
||
|
||
|
||
|
||
|
||
|
||
Deff
|
||
|
||
|
||
S
|
||
p
|
||
e
|
||
n
|
||
c
|
||
e
|
||
r
|
||
|
||
|
||
≈
|
||
(
|
||
1
|
||
+
|
||
L
|
||
)
|
||
=
|
||
|
||
|
||
Deff
|
||
|
||
|
||
Kish
|
||
|
||
|
||
|
||
|
||
{\displaystyle {\text{Deff}}_{Spencer}\approx (1+L)={\text{Deff}}_{\text{Kish}}}
|
||
|
||
. Otherwise, the two formulas will yield different results, which demonstrates the difference between the design effect of the total vs. the design effect of the mean.
|
||
|
||
===== Park and Lee's Deff for estimated ratio-mean =====
|
||
In 2001, Park and Lee extended Spencer's formula to the case of the ratio-mean (i.e., estimating the mean by dividing the estimator of the total with the estimator of the population size). It is:
|
||
|
||
|
||
|
||
|
||
|
||
|
||
Deff
|
||
|
||
|
||
P
|
||
a
|
||
r
|
||
k
|
||
&
|
||
L
|
||
e
|
||
e
|
||
|
||
|
||
=
|
||
(
|
||
1
|
||
−
|
||
|
||
|
||
|
||
|
||
ρ
|
||
^
|
||
|
||
|
||
|
||
|
||
y
|
||
,
|
||
P
|
||
|
||
|
||
2
|
||
|
||
|
||
)
|
||
(
|
||
1
|
||
+
|
||
c
|
||
|
||
v
|
||
|
||
w
|
||
|
||
|
||
2
|
||
|
||
|
||
)
|
||
+
|
||
|
||
|
||
|
||
|
||
|
||
|
||
ρ
|
||
^
|
||
|
||
|
||
|
||
|
||
y
|
||
,
|
||
P
|
||
|
||
|
||
2
|
||
|
||
|
||
|
||
c
|
||
|
||
v
|
||
|
||
P
|
||
|
||
|
||
2
|
||
|
||
|
||
|
||
|
||
|
||
c
|
||
|
||
v
|
||
|
||
w
|
||
|
||
|
||
2
|
||
|
||
|
||
|
||
|
||
{\displaystyle {\text{Deff}}_{Park\&Lee}=(1-{\hat {\rho }}_{y,P}^{2})(1+cv_{w}^{2})+{\frac {{\hat {\rho }}_{y,P}^{2}}{cv_{P}^{2}}}cv_{w}^{2}}
|
||
|
||
|
||
Where:
|
||
|
||
|
||
|
||
|
||
c
|
||
|
||
v
|
||
|
||
P
|
||
|
||
|
||
2
|
||
|
||
|
||
|
||
|
||
{\displaystyle cv_{P}^{2}}
|
||
|
||
is the (estimated) squared coefficient of variation of the probabilities of selection.
|
||
Park and Lee's formula is exactly equal to Kish's formula when
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
ρ
|
||
^
|
||
|
||
|
||
|
||
|
||
y
|
||
,
|
||
P
|
||
|
||
|
||
2
|
||
|
||
|
||
=
|
||
0
|
||
|
||
|
||
{\displaystyle {\hat {\rho }}_{y,P}^{2}=0}
|
||
|
||
. Both formulas relate to the design effect of the mean of y, while Spencer's
|
||
|
||
|
||
|
||
|
||
Deff
|
||
|
||
|
||
|
||
{\displaystyle {\text{Deff}}}
|
||
|
||
relates to the estimation of the population total.
|
||
In general, the
|
||
|
||
|
||
|
||
|
||
Deff
|
||
|
||
|
||
|
||
{\displaystyle {\text{Deff}}}
|
||
|
||
for the total (
|
||
|
||
|
||
|
||
|
||
|
||
|
||
Y
|
||
^
|
||
|
||
|
||
|
||
|
||
|
||
{\displaystyle {\hat {Y}}}
|
||
|
||
) tends to be less efficient than the
|
||
|
||
|
||
|
||
|
||
Deff
|
||
|
||
|
||
|
||
{\displaystyle {\text{Deff}}}
|
||
|
||
for the ratio mean (
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
Y
|
||
¯
|
||
|
||
|
||
^
|
||
|
||
|
||
|
||
|
||
|
||
{\displaystyle {\hat {\bar {Y}}}}
|
||
|
||
) when
|
||
|
||
|
||
|
||
|
||
ρ
|
||
|
||
y
|
||
,
|
||
P
|
||
|
||
|
||
|
||
|
||
{\displaystyle \rho _{y,P}}
|
||
|
||
is small. And in general,
|
||
|
||
|
||
|
||
|
||
ρ
|
||
|
||
y
|
||
,
|
||
P
|
||
|
||
|
||
|
||
|
||
{\displaystyle \rho _{y,P}}
|
||
|
||
impacts the efficiency of both design effects.
|
||
|
||
=== Cluster sampling ===
|
||
For data collected using cluster sampling we assume the following structure:
|
||
|
||
|
||
|
||
|
||
|
||
n
|
||
|
||
k
|
||
|
||
|
||
|
||
|
||
{\displaystyle n_{k}}
|
||
|
||
observations in each cluster and K clusters, and with a total of
|
||
|
||
|
||
|
||
n
|
||
=
|
||
∑
|
||
|
||
n
|
||
|
||
k
|
||
|
||
|
||
|
||
|
||
{\displaystyle n=\sum n_{k}}
|
||
|
||
observations.
|
||
The observations have a block diagonal correlation matrix in which every pair of observations from the same cluster is correlated with an intra-class correlation of
|
||
|
||
|
||
|
||
ρ
|
||
|
||
|
||
{\displaystyle \rho }
|
||
|
||
, while every pair from difference clusters are uncorrelated. I.e., for every pair of observations,
|
||
|
||
|
||
|
||
i
|
||
|
||
|
||
{\displaystyle i}
|
||
|
||
and
|
||
|
||
|
||
|
||
j
|
||
|
||
|
||
{\displaystyle j}
|
||
|
||
, if they belong to the same cluster
|
||
|
||
|
||
|
||
k
|
||
|
||
|
||
{\displaystyle k}
|
||
|
||
, we get
|
||
|
||
|
||
|
||
c
|
||
o
|
||
v
|
||
(
|
||
|
||
y
|
||
|
||
i
|
||
|
||
|
||
,
|
||
|
||
y
|
||
|
||
j
|
||
|
||
|
||
)
|
||
=
|
||
ρ
|
||
|
||
σ
|
||
|
||
2
|
||
|
||
|
||
|
||
|
||
{\displaystyle cov(y_{i},y_{j})=\rho \sigma ^{2}}
|
||
|
||
. And two items from two different clusters are not correlated, i.e.i.e.:
|
||
|
||
|
||
|
||
c
|
||
o
|
||
v
|
||
(
|
||
|
||
y
|
||
|
||
i
|
||
|
||
|
||
,
|
||
|
||
y
|
||
|
||
j
|
||
|
||
|
||
)
|
||
=
|
||
0
|
||
|
||
|
||
{\displaystyle cov(y_{i},y_{j})=0}
|
||
|
||
.
|
||
An element from any cluster is assumed to have the same variance:
|
||
|
||
|
||
|
||
|
||
var
|
||
|
||
(
|
||
|
||
y
|
||
|
||
i
|
||
|
||
|
||
)
|
||
=
|
||
|
||
σ
|
||
|
||
h
|
||
|
||
|
||
2
|
||
|
||
|
||
=
|
||
|
||
σ
|
||
|
||
2
|
||
|
||
|
||
|
||
|
||
{\displaystyle {\text{var}}(y_{i})=\sigma _{h}^{2}=\sigma ^{2}}
|
||
|
||
.
|
||
When clusters are all of the same size
|
||
|
||
|
||
|
||
|
||
n
|
||
|
||
∗
|
||
|
||
|
||
|
||
|
||
{\displaystyle n^{*}}
|
||
|
||
, the design effect Deff, proposed by Kish in 1965 (and later re-visited by others), is given by:
|
||
|
||
|
||
|
||
|
||
|
||
Deff
|
||
|
||
=
|
||
1
|
||
+
|
||
(
|
||
|
||
n
|
||
|
||
∗
|
||
|
||
|
||
−
|
||
1
|
||
)
|
||
ρ
|
||
.
|
||
|
||
|
||
{\displaystyle {\text{Deff}}=1+(n^{*}-1)\rho .}
|
||
|
||
|
||
It is sometimes also denoted as
|
||
|
||
|
||
|
||
|
||
|
||
Deff
|
||
|
||
|
||
C
|
||
|
||
|
||
|
||
|
||
{\displaystyle {\text{Deff}}_{C}}
|
||
|
||
.
|
||
In various papers, when cluster sizes are not equal, the above formula is also used with
|
||
|
||
|
||
|
||
|
||
n
|
||
|
||
∗
|
||
|
||
|
||
|
||
|
||
{\displaystyle n^{*}}
|
||
|
||
as the average cluster size (which is also sometimes denoted as
|
||
|
||
|
||
|
||
|
||
|
||
|
||
b
|
||
¯
|
||
|
||
|
||
|
||
|
||
|
||
{\displaystyle {\bar {b}}}
|
||
|
||
). In such cases, Kish's formula (using the average cluster weight) serves as a conservative (upper bound) of the exact design effect.
|
||
Alternative formulas exists for unequal cluster sizes. Followup work had discussed the sensitivity of using the average cluster size with various assumptions. |