Calculating sample size for a 2 independent sample t-test

Scientists often plan for studies by calculating how many subjects or units need to be tested in order to find an effect. That is, they plan for a study using statistical power according to principles of hypothesis testing. Sample size calculations are usually required in ethics applications and grant proposals to justify the study.
We previously learned how to calculate sample size for a 2 independent t-test in R. If you do most of your work in Python, you could instead use the statsmodels
package to perform the same calculation. statsmodels
is a Python module that provides functionality for conducting many statistical tests and analyses. It has been tested against R and other statistical packages, and implements R-style formulas with pandas
dataframes or numpy
functions to fit models.
Calculating sample size for a 2 independent sample t-test in Python requires specifying similar parameters to performing the calculation in R, but there are some differences. Here’s how to do it in statsmodels
(output shown using >>>
prompt, and documentation available here):
from statsmodels.stats.power import tt_ind_solve_power
mean_diff, sd_diff = 0.5, 0.5
std_effect_size = mean_diff / sd_diff
n = tt_ind_solve_power(effect_size=std_effect_size, alpha=0.05, power=0.8, ratio=1, alternative='two-sided')
print('Number in *each* group: {:.5f}'.format(n))
>>> Number in *each* group: 16.71472
The tt_ind_solve_power()
function requires the following parameters to calculate sample size:
effect_size
: The standardised effect size ie. difference between the two means divided by the standard deviation; this value has to be positive. (This is different to R’sdelta
parameter, which requires the mean difference only.)alpha
: Significance level or probability of Type I error (false positives), usually set at 0.05.power
: Power of the test, or 1 – probability of Type II error (false negatives), usually set at 0.8.ratio
: Ratio of sample size in sample 2 relative to sample 1, default set at 1. (This function can be used to calculate power for unevenly-sized samples.)alternative
: Power the test to detect two-sided effects (eg. the effect could be an increase or a reduction in outcome, not forced to be only an increase in outcome.)
In the code above, we specified the difference between two means and the standard deviation of the difference as 0.5 each, producing a standardised effect size of 1. This means we are calculating sample size (or powering the study) to detect quite a big effect! Performing the sample size calculation in Python obtains the same answer, to 4 decimal places, as the output from R.
It is easy to see that changes in the standardised mean difference we want to detect will change the sample size. For example, for a given mean difference of 0.5, sample size increases as standard deviation of the difference increases:
for sd in [0.4, 0.5, 0.6]:
n = tt_ind_solve_power(effect_size=mean_diff/sd, alpha=0.05, power=0.8, ratio=1, alternative='two-sided')
print('Number in *each* group when SD is {:<4.1f}: {:.2f}'.format(sd, n))
>>> Number in *each* group when SD is 0.4 : 11.09
>>> Number in *each* group when SD is 0.5 : 16.71
>>> Number in *each* group when SD is 0.6 : 23.60
Summary
We used Python’s statsmodels
module to calculate sample size for a 2 independent sample t-test. Sample size is sensitive to the size and variability of the difference between groups, and tolerance to Type I and II errors.
Hi, I am having a question when trying to estimate the sample size required for an experiment using this tt_ind_solve_power.
As the instruction puts it, in order to obtain the sample size nobs1, we need to specify: effect size, alpha, power, ratio, alternative.
Naturally, we need to compute effect size = mean_diff / sd_diff. It is easy to estimate the mean_diff we want to detect in an experiment. Now my question is: how do we compute sd_diff? I think sd_diff is proportional to the sample size, which requires me to know the sample size of each group in advance. However, the sample size of each group is exactly what I want to estimate using tt_ind_solve_power. I feel I got myself into a loop. Does it mean I should not be using this function to estimate the sample size when designing an experiment?
LikeLike
Hi Xiaochen,
You raise a good question. The standard deviation of the effect (sd_diff) to power a study should be based on variability from published literature, but this is often unknown. So researchers often make a guess of this variability to power a study. An easy way to think of how the effect is related to its variability is to think of it as a ratio and power the study to detect a large effect (mean_diff / sd_diff = 1). So, sample size calculations often involve some guesstimating, which is okay provided that samples are “large enough”. What “large” means partly depends on the field, but more is always better.
At the end of the study, plot the individual data so readers can see the variability. Also calculate the effect and its precision (e.g. mean difference and 95% CI) so readers know how large the effect is and how precisely it was estimated.
LikeLike
I am trying to do power analysis using python statsmodel package. I need to input effect size, power and alpha, but I am not sure if what I have for effect size is right.
Say I have a study on medicine vs placebo, and the outcome is duration of stay in terms of hours. Most people use standardized effect size like 0.2, 0.5, 0.8, but seems that an unstandardized effect size is more suitable here. For my study, I am only interested to see if those on medicine has less duration of stay than placebo, and to me, a clinically meaningful effect if it is 12 hours or less.
SD of two groups combined is 34.9 hrs.
Would the effect size then = 12/34.9 ?
LikeLike
Hi Kazuya,
Yes, you are correct. A standardised effect size is simply a method to measure an effect in units of standard deviations, where the SD is the pooled variability across groups. Even if you powered the study using the effect in its natural units, you would still need to indicate how variable that effect is expected to be. Marty wrote a nice post on interpreting standardised effects; this might be helpful. Cheers.
LikeLike