Errors in science: I make them do you? Part 3

In my last two posts, I talked about my experience with errors in science. Errors are part of life, and they are part of science. Without errors we can’t learn. However, if we hide away our data, our analysis and, if relevant, our code, errors will not be noticed. They and the data and results they generate will become part of the published literature, assumed by many to reflect the truth.

While errors can occur during every step of the scientific process, my feeling is that the data analysis and statistical analysis steps are particularly vulnerable to errors. And this is where I see a clear demarcation: those who code and those who use spreadsheets.

Spreadsheets

I have an allergy to spreadsheets. I break out into hives whenever I see one, especially monster spreadsheets with ungodly colors everywhere. Simple spreadsheets can be useful, but the moment I have to look at formulas embedded in spreadsheet cells my airway starts to narrow.

It is easy to make errors in a spreadsheet. There are countless stories of companies making big mistakes, some causing billions of dollars to be lost. Such errors also occur in science. The bigger and more complex the spreadsheet, the more likely it will contain mistakes. Unfortunately, spreadsheet errors can be difficult to identify, and people tend to be overconfident in the accuracy of their spreadsheets.

Computer code

To me, it seems logical to use computer code to analyse data. Computers are good at doing mundane, repetitive tasks with amazing consistency. However, this does not mean the output will be correct; that depends what we asked to computer to do. As the old saying goes: garbage in, garbage out.

It is true that computer code can be rather intimidating to look at and read, especially when people do not make an effort to write clear, readable and properly documented code. But when code is well written and documented, it can be a pleasure to read.

The more I learn about computer science and how software is developed and maintained, the more I admire the systematic approached used to improve and fix software. For example, computer science uses version control to track the history of all changes that have been made to the code. That way, you can go back if needed. Also, version control allows you to work on new bits of code without breaking the current version that works. These branches, as they are often referred to, are used to figure out a solution to problems (i.e. bugs) in the code.

Problems in the code?! Bugs?!

Yes, computer scientists are very open about the fact that the code they write may contain errors. These errors are so common that entire processes and workflows have been created to track and manage them. For example, on Github, a popular website that many individuals and companies use to host their software projects, users can lodge issues and suggest fixes to errors they have identified.

I want to make sure this is clear. As the software is being written, the team can see, write and fix the code. Then, once a version of the software is officially released, the software continues to be improved by fixing bugs (or more serious problems). These are not improvements to the software, but corrections.

How does this differ from how code is written in science? While an increasing number of scientists are trying to adhere to best-practices (or good-enough practices), many more scientists are self-taught and write computer code in isolation, unaware of the lessons learned by computer scientists and software developers.

Publishing scientific code

Up until recently, the code that generated results presented in scientific papers was not made public. Readers had to trust that the authors did not make mistakes. While this is still how it works for most published papers, there is a move towards publishing the code and de-identified data that were used to generate results. This represents an important step forward; however, it will take some time before it becomes common practice.

Why is that? I think part of the problem is that many scientists write code for themselves. They want results, not clear and well documented code. These additional steps take time. Thus a scientist may be a little less productive (in the publish or perish sense) if she has to spend time making her code presentable. So what do people do? Some folks simply dump their incomprehensible code into a folder and say they have published their code. While this is certainly a small step forward, it somewhat misses the point.

The other important issue is that by publishing your code you are opening yourself up to the possibility that someone will find an error. In software development this is expected and welcomed. In science this can lead to a paper having to be retracted: a major embarrassment to the scientist. Thus, one approach would be to have colleagues and peers carefully review the code and results before the paper is published, and where applicable implement some form of software testing and validation.

This might involve formal code reviews, or creating a series of software tests that make sure the code is doing what it is suppose to be doing. It might also involve creating fake data (or using reference data) and running through the analysis pipeline to make sure the results are sensible. While these practices are commonplace in software development, and some research laboratories, there is a long way to go before published results are truly reproducible.

Summary

I don’t want to seem pessimistic. Errors are simply a part of life, and therefore they are part of science. As scientists, we have to resist the urge to rush ahead and publish our results without formally testing and validating our code and results. We have to fight against the pressure of administrators and directors who create incentives based on quantity, not quality.

It will, at first, take a little longer to write clear code that you are willing to share with the world. It will take longer to have colleagues review your work to make sure it is error free. It will take longer to create tests and simulated data to run your code through its paces. However, you will have greater certainty in your results. And importantly, these efforts will lead to great time savings in the future; a future where such practices will become common place and possibly mandated.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s