## Research concepts: Confidence interval of a proportion

Data which exist in categories that only have 2 possible values are known as binary data. “Yes” or “No” survey responses, dead or alive, male or female, etc. are examples of binary values. These data can be expressed as proportions (e.g. the proportion of male students in a class), are known as *binomial variables*, and follow a *binomial distribution*. The binomial distribution is an example of a *mathematical distribution* used to describe data.

### Simulate binary data and calculate a 95% CI

Let’s simulate some data using the binomial distribution and see how a confidence interval (CI) calculated from a sample of the population, is used to make inferences about the population.

Imagine that a jar contains 25 red balls and 75 black balls (total = 100 balls) mixed together well. The jar contains the whole population of balls, so the proportion of red balls in the population is 25/100 = 0.25.

Next, imagine that we want 20 students to work out the proportion of red balls in the population without looking at the jar. To do so, each student randomly picks one ball, notes the colour, then replaces it in the jar and mixes the balls well. Each student repeats this 15 times. That is, each student *samples* 15 balls from the jar *with replacement*. At the end, each student calculates the proportion of red balls in the sample of 15 balls. Lastly, we can calculate the 95% CI of the proportion of red balls picked out by each student (using the modified Wald method; p39, Motulsky 2018). Figure 1 shows the proportion of red balls and 95% CI about the proportion, for each of the 20 simulated students:

Look at the values for student 1: the proportion of red balls in this student’s sample was 0.27 and the width of the error bars shows the 95% CI of this proportion ranges from 0.10 to 0.53. What does this mean? We know for a fact that the proportion of red balls in this sample = 0.27. Why is this sample’s proportion different to the population proportion of 0.25? From the previous post, we learned that this type of study-to-study (or student-to-student) variability can be caused by sampling error or bias.

Where does the true population proportion lie in relation to this sample’s 95% CI? We see that the 95% CI of this sample’s proportion *captures the true population proportion* since the dashed line falls within the width of the CI. Looking at the CIs of sample proportions from all 20 students, we see that *in the long run*, 1 out of 20 CIs in our case does not include the population proportion.

This example uses simulated data where we fixed the population proportion. In real life, there is no way to know the population proportion; the best you can do is *estimate* it. If 95% CIs from many samples were calculated, we would expect them to include the population value in ~95% of the samples, and exclude the population value in ~5% of them. This is the best way to think about the meaning of “95%”; i.e. 95% of *many confidence intervals* from many experiments will contain the population value. Loosely speaking, we can also say that “there is a 95% chance that the interval computed in a particular experiment includes the population value”.

### Interpreting the 95% CI

Taken together, we interpret the data for student 1 as: the proportion of red balls was 0.27, and we are confident that 95% of the time, the proportion varies from 0.10 to 0.53. The simulated data in show that in the long run, 1 out of 20 of the 95% CI (i.e. 5% of the samples) will exclude the population value.

The width of the confidence interval shows how *precise* our estimate of the population proportion is; the proportion could be as small as 0.10 and as large as 0.53.

### Important assumptions

For this example, the interpretation of the CI depends on the following assumptions:

**The sample is representative of the population.** That is, we assume that samples are drawn from a larger population of data about which we want to generalise or make inferences.

**Observations are statistically independent.** That is, one data point provides no information about another data point. In our simulation, the ball colour during one trial does not influence ball colour during subsequent trials, or by subsequent students.

**Data are accurate.** Data need to be tabulated accurately. Here, ball colours must be counted correctly.

### Summary

We simulated data using the binomial distribution to show proportions and their 95% CI over repeated experiments. We interpreted the proportions and learned how 95% CI are used to estimate the population proportion.

Interpretation of the CI assumes that samples accurately represent the population about which we want to make inferences, observations are independent, and data are accurately quantified.

### Reference

Motulsky H (2018) Intuitive Biostatistics. A Nonmathematical Guide to Statistical Thinking. 4th Ed. Oxford University Press: Oxford, UK.