R: Analysing small datasets – Part 2
In the previous post we plotted repeated measures data from 10 subjects under 2 conditions. There are different ways to analyse small datasets. We could apply parametric methods to analyse the data values, such as describing the data with means and standard deviations, and calculating a paired difference. Or, we could also apply nonparametric methods by analysing data values based on their rank rather than the numeric value. For example, we could calculate the middle (median) value of the data and the range between the 2575% value (interquartile range), and show these values on a boxplot with the raw data.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 
# plot scatterplot and boxplots of hours of sleep
set.seed(10) # keeps noise in jitter constant
fig < ggplot(df, aes(x=group, y=extra)) +
geom_boxplot(width=0.4) +
geom_point(aes(color=ID), size=5, position=position_jitter(width=0.05)) +
# Remove comment from line below to plot lines linking points from the same subject for drugs 1 and 2
# geom_line(aes(x=group, y=extra, group=ID), size=1, alpha=0.5) +
xlab('Drug') +
ylab('Extra sleep (hour)') +
theme_bw() +
theme(axis.line = element_line(colour="black"),
panel.border = element_rect(colour="black", size=1),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank(),
text = element_text(size=18))
print(fig)
# save figure
png(filename="boxplot.png", width=11, height=7, units='in', res=300)
plot(fig)
dev.off()
# reshape dataframe from long to wide format
df_wide < reshape(df, idvar='ID', timevar='group', direction='wide')
print(df_wide)
# calculate median and IQR of hours of sleep
lapply(df_wide[, 2:3], quantile)

Line 121. Code for the boxplot and scatterplot here is nearly identical to graph code previously. At present, the graph is being plotted with data in long format.
Line 2328. Reshape the data so it is now in wide format ie. repeated observations for each subject are written on the same row, so that each subject has data only over 1 row. Then, calculate the median and IQR values of the outcome (hours of sleep) for each of the conditions (drug 1 and 2) using a list apply function (lapply
).
The code generates the following Figure 1:
Figure 1:
and the following median and IQR of hours of sleep:
1 2 3 4 5 6 7 8 
> lapply(df_wide[, 2:3], quantile)
$extra.1
0% 25% 50% 75% 100%
1.600 0.175 0.350 1.700 3.700
$extra.2
0% 25% 50% 75% 100%
0.100 0.875 1.750 4.150 5.500

From the figure, it might be reasonable to think there is a betweencondition difference in the medians, but there is also a lot of overlap in the data: this is obvious from the raw data and the interquartile ranges. How can we test whether this difference in the medians is real, and get a measure of precision of this effect? In the next post we will calculate the betweencondition difference of the medians and use a resampling technique called bootstrapping to calculate precision about our estimate of the difference of the medians.
Summary
For this small dataset, we calculated the nonparametric median and interquartile range of data in each condition, and plotted these values as boxplots overlaid with raw data points. Next, we will calculate the betweencondition difference of the medians and calculate precision about our estimate of the difference of the medians.