Saturday, September 26, 2015

Programming Bugs and P-Values

Bugs Happen

Programs have bugs. Anyone who has written code for long enough learns that while most bugs are obvious (typos, indention problems, etc.) others are subtle, and difficult to detect. Professional developers spend a lot of their time writing tests in order to try to catch these bugs.

However, I have learned that, for the most part, scholars don't write tests. I think that there are a number of good reasons for this: the programs they write are much simpler and can break in fewer ways, the code is only run a few times and rarely require input from external users, etc. Without good tests, though, bugs will happen, and can have disastrous results. For example, a paper by Harvard economists Reinhart and Rogoff contained a bug which changed their results and may have influenced worldwide financial policy.

Stopping Bugs

How should scholars react to this? When learning how rare tests were, my knee-jerk reaction was that we should require scholars to write tests in the same way that professional developers write tests.

I do think that scholars should do much more to avoid bugs - they should review their code with others, write basic tests, and look at raw data (and distributions of raw data) to make sure that it looks reasonable.

However, I don't think that we should hold academic code to the same standard as professional code, for a few reasons. First, because in a user-facing software environment, the software will be pushed to its limits, generally with far more variation in inputs, expected use cases, etc. This means that there are a lot of bugs which would matter in this environment which wouldn't affect academic code (e.g., not sanitizing inputs). No matter how thorough testing is, bugs cannot be completely eliminated from complex code, but the optimal balance for professional code includes more testing than in an academic environment.

Obligatory XKCD comic
Second, because the output of scholarly code isn't a program, but an analysis. Rather, it is an argument intended to persuade the audience about the state of the world. Instead of eliminating bugs, readers can accept the reality of bugs when assessing arguments based on academic code.

Trusting Results Less

But how should readers adjust their beliefs regarding research results that are produced in an environment that includes software bugs? In other words, are bugs more likely to produce stronger or weaker results (i.e., higher or lower p-values)?

Intuition says that any sort of error (such as a bug) should make results less reliable. However, ceteris paribus, we should actually expect results that are weaker than reality. Researchers include measures that they believe are correlated with outcomes. We should expect accidental/noisy deviations from our intended measures (such as those from a subtle bug) to have a weaker correlation with outcomes.

However, research doesn't happen in a vacuum, and there is a human in the middle. When researchers see results that are unexpected (either in the wrong direction or non-significant), they are likely to examine the data and code to try to understand why (and are therefore more likely to find bugs). On the other hand, when results line up with their expectations, researchers are unlikely to probe further, and bugs which push results in the expected direction are therefore more likely to remain unexposed.

The overall effect on the p-value is dependent on the likelihood of having a bug, the likelihood that the bug increases the p-value estimate, the likelihood of searching for the bug, and the likelihood of finding the bug. If we call the true p-value p and the estimate of the p-value we could model the likelihood of an underestimate of the p-value (i.e., thinking an effect is stronger than it actually is) as:

P( < p) = P(bug exists) * P( < p | bug exists) * (1 - P(search for bug | < p) * P(find bug | search for bug))
*  (1 - P(search for bug | < p) * P(find bug | search for bug)) is 1 - the probability that a bug is found; in other words, the probability that a bug is not found and remains in the published results.

The likelihood of an overestimate (i.e., thinking an effect is weaker than it is) is:

P( > p) = P(bug exists) * P( > p | bug exists) * P(search for bug | > p) * P(find bug | search for bug)

The argument is that P( > p | bug exists) > P( < p | bug exists), but P(search for bug | > p) > P(search for bug | < p). That is, it's more likely that bugs increase p-value estimates, but also more likely that they are searched for.

Which outcome is more likely depends on each of the probabilities, but a rough example is illustrative. If we assume that bugs exist in 20% of projects, that 70% of the time they increase p-value estimates (and 30% of the time decrease them), that 80% of the time when they increase p-value estimates they are searched for (and 0% of the time when they decrease p-values), and that they are found 80% of the time they are searched for, then:

P( < p) = .2 * .3 * (1 - 0 * .8) = .06
P( < p) = .2 * .7 * (1 - .8 * .8) = .0504

In other words, 6% of studies would have underestimates of p-values, and ~5% would have overestimates.

Conclusion

So, what should we take from this? First, that researchers should do a better job with testing. Simple tests plus eyeballing of results can catch a lot of bugs, including the most egregious bugs which are most likely to distort results.

Second, that even if researchers do more tests, bugs will still exist and that's OK. As readers, we should just take the results of computational research projects with a grain of salt. Assuming my estimates of probabilities are reasonably close to reality (a very courageous assumption) suggests that the likelihood of both Type I and Type II errors is actually quite a bit higher than statistical parameters would imply. That is, we should trust both significant and null results a bit less.