Calculating sample size for a 2 independent sample t-test

Scientists often plan for studies by calculating how many subjects or units need to be tested in order to find an effect. That is, they plan for a study using statistical power according to principles of hypothesis testing. Sample size calculations are usually required in ethics applications and grant proposals to justify the study.

We previously learned how to calculate sample size for a 2 independent t-test in R. If you do most of your work in Python, you could instead use the statsmodels package to perform the same calculation. statsmodels is a Python module that provides functionality for conducting many statistical tests and analyses. It has been tested against R and other statistical packages, and implements R-style formulas with pandas dataframes or numpy functions to fit models.

Calculating sample size for a 2 independent sample t-test in Python requires specifying similar parameters to performing the calculation in R, but there are some differences. Here’s how to do it in statsmodels (output shown using >>> prompt, and documentation available here):

from statsmodels.stats.power import tt_ind_solve_power

mean_diff, sd_diff = 0.5, 0.5
std_effect_size = mean_diff / sd_diff

n = tt_ind_solve_power(effect_size=std_effect_size, alpha=0.05, power=0.8, ratio=1, alternative='two-sided')
print('Number in *each* group: {:.5f}'.format(n))

>>> Number in *each* group: 16.71472

The tt_ind_solve_power() function requires the following parameters to calculate sample size:

  • effect_size: The standardised effect size ie. difference between the two means divided by the standard deviation; this value has to be positive. (This is different to R’s delta parameter, which requires the mean difference only.)
  • alpha: Significance level or probability of Type I error (false positives), usually set at 0.05.
  • power: Power of the test, or 1 – probability of Type II error (false negatives), usually set at 0.8.
  • ratio: Ratio of sample size in sample 2 relative to sample 1, default set at 1. (This function can be used to calculate power for unevenly-sized samples.)
  • alternative: Power the test to detect two-sided effects (eg. the effect could be an increase or a reduction in outcome, not forced to be only an increase in outcome.)

In the code above, we specified the difference between two means and the standard deviation of the difference as 0.5 each, producing a standardised effect size of 1. This means we are calculating sample size (or powering the study) to detect quite a big effect! Performing the sample size calculation in Python obtains the same answer, to 4 decimal places, as the output from R.

It is easy to see that changes in the standardised mean difference we want to detect will change the sample size. For example, for a given mean difference of 0.5, sample size increases as standard deviation of the difference increases:

for sd in [0.4, 0.5, 0.6]:
    n = tt_ind_solve_power(effect_size=mean_diff/sd, alpha=0.05, power=0.8, ratio=1, alternative='two-sided')
    print('Number in *each* group when SD is {:<4.1f}: {:.2f}'.format(sd, n))

>>> Number in *each* group when SD is 0.4 : 11.09
>>> Number in *each* group when SD is 0.5 : 16.71
>>> Number in *each* group when SD is 0.6 : 23.60

Summary

We used Python’s statsmodels module to calculate sample size for a 2 independent sample t-test. Sample size is sensitive to the size and variability of the difference between groups, and tolerance to Type I and II errors.

 

2 comments

  • I am trying to do power analysis using python statsmodel package. I need to input effect size, power and alpha, but I am not sure if what I have for effect size is right.

    Say I have a study on medicine vs placebo, and the outcome is duration of stay in terms of hours. Most people use standardized effect size like 0.2, 0.5, 0.8, but seems that an unstandardized effect size is more suitable here. For my study, I am only interested to see if those on medicine has less duration of stay than placebo, and to me, a clinically meaningful effect if it is 12 hours or less.

    SD of two groups combined is 34.9 hrs.

    Would the effect size then = 12/34.9 ?

    Like

    • Hi Kazuya,
      Yes, you are correct. A standardised effect size is simply a method to measure an effect in units of standard deviations, where the SD is the pooled variability across groups. Even if you powered the study using the effect in its natural units, you would still need to indicate how variable that effect is expected to be. Marty wrote a nice post on interpreting standardised effects; this might be helpful. Cheers.

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s