Blind data analysis can contribute to reproducible research
Blind analysis does not refer to closing your eyes, crossing your fingers and hitting
GO! on your statistical analysis!
In a previous post, I highlighted an article published in Nature on cognitive biases and their impact on reproducible science. Various debiasing techniques have been proposed to tackle this issue, including blind data analysis. This technique was new to me, so I read with interest the article in the same issue of Nature entitled Hide results to seek the truth.
Blinding is not new to science. Some scientific journals request that authors submit de-identified versions of their manuscript for review. Similarly, double-blind studies have become the gold standard for determining the effectiveness of therapeutic interventions. However, these approaches do not control for the potential impact of cognitive biases on data analysis.
In an ideal world, data management and analysis protocols are prepared before data are collected. In the real world, these protocols are often never prepared, or they are finalized post-hoc after several looks at the data. These looks are usually justified: data must be inspected for outliers, unusual trends or skewed distributions. The problem is that researchers are not blind to the study’s hypothesis when they clean the data and finalize statistical analyses. Thus, they may be unfairly critical of extreme values (i.e., outliers) that do not support their hypothesis, but allow data to go unquestioned if they support their hypothesis. Many other decisions can be influence by cognitive biases in this way if researchers are not blinded. The solution: blind researchers from their data while they clean it and finalize the analysis protocol.
With blind analysis, all important decisions about data analysis are made before the true data and results are revealed. In theory, the approach is simple. A trusted colleague or computer programmer knowingly perturbs the data and reorders or removes data labels so that researchers making decisions about data cleaning and statistical analysis cannot identify from which group or experimental condition the data originate. Examples of data blinding techniques include adding random noise to each data point or adding different constants to different experimental groups. The order the data are presented in can be randomized and data labels can be removed or replaced by generic ones. Importantly, the perturbed data must retain key characteristics of the real data (e.g., variance, outliers, correlations) so that decisions about data cleaning and analysis remain valid and relevant when applied to the real data. Before being unblinded, the research team should conduct as much of the analysis as possible on this modified dataset, and agree to publish the final results without further tweaking of the data or analysis.
Like it or not, humans are prone to cognitive biases and scientists must protect their studies from their own rose-colored glasses.