Estimation of a Population Proportion: Large Samples
A population proportion is denoted by \(p\). From a random sample of size \(n\), we compute the
sample proportion \(\hat{p}\), which is used as a point estimate of \(p\).
A percentage is obtained by multiplying a proportion by 100.
Sample proportion and its complement
\[
\begin{aligned}
\hat{p} &= \frac{x}{n}, \\
\hat{q} &= 1 - \hat{p},
\end{aligned}
\]
where \(x\) is the number of “successes” in the sample and \(\hat{q}\) is the sample proportion of “failures”.
Sampling distribution of \(\hat{p}\) for large samples
When the sample is large, the sampling distribution of \(\hat{p}\) is approximately normal. In that case:
\[
\begin{aligned}
\mu_{\hat{p}} &= p, \\
\sigma_{\hat{p}} &= \sqrt{\frac{p(1-p)}{n}} = \sqrt{\frac{pq}{n}},
\end{aligned}
\]
where \(q = 1 - p\). Since \(p\) and \(q\) are unknown, we estimate the standard deviation using \(\hat{p}\) and \(\hat{q}\).
Large-sample condition (rule of thumb)
\[
\begin{aligned}
n\hat{p} &> 5 \quad \text{and} \quad n\hat{q} > 5
\end{aligned}
\]
This condition supports the normal approximation for the distribution of \(\hat{p}\).
Estimator of the standard deviation of \(\hat{p}\)
Because \(p\) is unknown, we use the estimated standard deviation of \(\hat{p}\), denoted by \(s_{\hat{p}}\):
\[
\begin{aligned}
s_{\hat{p}} &= \sqrt{\frac{\hat{p}\hat{q}}{n}}.
\end{aligned}
\]
Confidence interval for the population proportion \(p\)
For a confidence level of \((1-\alpha)100\%\), the two-sided confidence interval for \(p\) is:
\[
\begin{aligned}
\text{CI for } p &= \hat{p} \pm z^{*}\, s_{\hat{p}} \\
&= \hat{p} \pm z^{*}\sqrt{\frac{\hat{p}\hat{q}}{n}}.
\end{aligned}
\]
Margin of error
The amount added to and subtracted from \(\hat{p}\) is the margin of error, denoted by \(E\):
\[
\begin{aligned}
E &= z^{*} s_{\hat{p}} = z^{*}\sqrt{\frac{\hat{p}\hat{q}}{n}}.
\end{aligned}
\]
How to construct the interval
- Identify \(x\) and \(n\), then compute \(\hat{p} = x/n\) and \(\hat{q} = 1-\hat{p}\).
- Check the large-sample condition \(n\hat{p} > 5\) and \(n\hat{q} > 5\).
- Choose the confidence level and compute \(\alpha = 1 - \text{confidence}\).
- Find \(z^{*}\) such that the total tail area is \(\alpha\) (so each tail has area \(\alpha/2\)).
- Compute \(s_{\hat{p}} = \sqrt{\hat{p}\hat{q}/n}\) and \(E = z^{*}s_{\hat{p}}\).
- Report the interval \(\bigl(\hat{p}-E,\ \hat{p}+E\bigr)\), and convert to a percentage if requested.
Interpretation
If we repeatedly took samples in the same way and built an interval each time, then about \((1-\alpha)100\%\) of those intervals
would contain the true population proportion \(p\). This is a long-run interpretation of the procedure.
Effect of confidence level and sample size
The width of the interval depends on \(E\). Increasing \(n\) decreases \(s_{\hat{p}}\) and therefore decreases \(E\), producing a narrower interval.
Increasing the confidence level increases \(z^{*}\), producing a wider interval.
\[
\begin{aligned}
s_{\hat{p}} &= \sqrt{\frac{\hat{p}\hat{q}}{n}} \ \downarrow \ \text{as } n \uparrow,
\qquad
E = z^{*} s_{\hat{p}}.
\end{aligned}
\]