An Ego-Centric Researcher: 2015

Bugs Happen

Programs have bugs. Anyone who has written code for long enough learns that while most bugs are obvious (typos, indention problems, etc.) others are subtle, and difficult to detect. Professional developers spend a lot of their time writing tests in order to try to catch these bugs.

However, I have learned that, for the most part, scholars don't write tests. I think that there are a number of good reasons for this: the programs they write are much simpler and can break in fewer ways, the code is only run a few times and rarely require input from external users, etc. Without good tests, though, bugs will happen, and can have disastrous results. For example, a paper by Harvard economists Reinhart and Rogoff contained a bug which changed their results and may have influenced worldwide financial policy.

Stopping Bugs

How should scholars react to this? When learning how rare tests were, my knee-jerk reaction was that we should require scholars to write tests in the same way that professional developers write tests.

I do think that scholars should do much more to avoid bugs - they should review their code with others, write basic tests, and look at raw data (and distributions of raw data) to make sure that it looks reasonable.

However, I don't think that we should hold academic code to the same standard as professional code, for a few reasons. First, because in a user-facing software environment, the software will be pushed to its limits, generally with far more variation in inputs, expected use cases, etc. This means that there are a lot of bugs which would matter in this environment which wouldn't affect academic code (e.g., not sanitizing inputs). No matter how thorough testing is, bugs cannot be completely eliminated from complex code, but the optimal balance for professional code includes more testing than in an academic environment.

Obligatory XKCD comic

Second, because the output of scholarly code isn't a program, but an analysis. Rather, it is an argument intended to persuade the audience about the state of the world. Instead of eliminating bugs, readers can accept the reality of bugs when assessing arguments based on academic code.

Trusting Results Less

But how should readers adjust their beliefs regarding research results that are produced in an environment that includes software bugs? In other words, are bugs more likely to produce stronger or weaker results (i.e., higher or lower p-values)?

Intuition says that any sort of error (such as a bug) should make results less reliable. However, ceteris paribus, we should actually expect results that are weaker than reality. Researchers include measures that they believe are correlated with outcomes. We should expect accidental/noisy deviations from our intended measures (such as those from a subtle bug) to have a weaker correlation with outcomes.

However, research doesn't happen in a vacuum, and there is a human in the middle. When researchers see results that are unexpected (either in the wrong direction or non-significant), they are likely to examine the data and code to try to understand why (and are therefore more likely to find bugs). On the other hand, when results line up with their expectations, researchers are unlikely to probe further, and bugs which push results in the expected direction are therefore more likely to remain unexposed.

The overall effect on the p-value is dependent on the likelihood of having a bug, the likelihood that the bug increases the p-value estimate, the likelihood of searching for the bug, and the likelihood of finding the bug. If we call the true p-value p and the estimate of the p-value p̂ we could model the likelihood of an underestimate of the p-value (i.e., thinking an effect is stronger than it actually is) as:

P(p̂ < p) = P(bug exists) * P(p̂ < p | bug exists) * (1 - P(search for bug | p̂ < p) * P(find bug | search for bug))
* (1 - P(search for bug | p̂ < p) * P(find bug | search for bug)) is 1 - the probability that a bug is found; in other words, the probability that a bug is not found and remains in the published results.

The likelihood of an overestimate (i.e., thinking an effect is weaker than it is) is:

P(p̂ > p) = P(bug exists) * P(p̂ > p | bug exists) * P(search for bug | p̂ > p) * P(find bug | search for bug)

The argument is that P(p̂ > p | bug exists) > P(p̂ < p | bug exists), but P(search for bug | p̂ > p) > P(search for bug | p̂ < p). That is, it's more likely that bugs increase p-value estimates, but also more likely that they are searched for.

Which outcome is more likely depends on each of the probabilities, but a rough example is illustrative. If we assume that bugs exist in 20% of projects, that 70% of the time they increase p-value estimates (and 30% of the time decrease them), that 80% of the time when they increase p-value estimates they are searched for (and 0% of the time when they decrease p-values), and that they are found 80% of the time they are searched for, then:

P(p̂ < p) = .2 * .3 * (1 - 0 * .8) = .06
P(p̂ < p) = .2 * .7 * (1 - .8 * .8) = .0504

In other words, 6% of studies would have underestimates of p-values, and ~5% would have overestimates.

Conclusion

So, what should we take from this? First, that researchers should do a better job with testing. Simple tests plus eyeballing of results can catch a lot of bugs, including the most egregious bugs which are most likely to distort results.

Second, that even if researchers do more tests, bugs will still exist and that's OK. As readers, we should just take the results of computational research projects with a grain of salt. Assuming my estimates of probabilities are reasonably close to reality (a very courageous assumption) suggests that the likelihood of both Type I and Type II errors is actually quite a bit higher than statistical parameters would imply. That is, we should trust both significant and null results a bit less.

Photo via smoothgroover22 on flickr

As a kid, I was fascinated by the idea of self-driving cars. This may be the very definition of a nerdy child, but I remember sitting awake one night, thinking about how self-driving cars might work. All of my schemes involved a centralized computer that would be aware of where every car was, and would calculate ahead of time when collisions might occur, and make minor adjustments to vechicle speeds to avoid them.

However, the actual self-driving cars which are coming are very different from what I envisioned. Instead of operating via a centralized "hive mind", each car operates autonomously and locally, without any direct knowledge about other cars. Instead of a centralized knowledge about the whole system, they each gather their own information about their local environment, and operate in a decentralized way, with only local knowledge of their portion of the system.

There are some reasons for this, some of which can point toward general principles for when systems are better as decentralized and when they are better centralized:

There are important aspects of the environment which are unknown. A centralized system doesn't know about cyclists or children running into the road, or debris, etc.
Centralized systems create a single point of failure. If there is a bug in the system, all cars are affected. A bug in a single car (or even a single model) is much less catastrophic
Centralized systems are less adaptable. If you decide that you don't want to go to that restaurant after all, and you'd rather just go to Wendy's, with a local system, your car simply adjusts its route accordingly. With a centralized system, an adjustment to one car propagates through the system, and hundreds or thousands of route adjustments need to be calculated and applied (similarly every time someone starts a new route or there is an accident, etc.). These are complex optimizations problems which can be approximated by local systems without all of the overhead.

I think that the eventual system of transportation will be a mix of local and centralized. As I understand it, current cars have access to centralized maps, and I can imagine that the richness of the sort of data that can be centralized will grow: information about when lights will change, information about traffic jams, even information about the destination of other cars in the area. However, for the reasons outlined above, I think that these will be information systems, not decision systems. They will always be accompanied by a rich understanding of the local situation, with the actual decision making done at that level.

An Ego-Centric Researcher

Friday, October 2, 2015

Moving

Saturday, September 26, 2015

Programming Bugs and P-Values

Bugs Happen

Stopping Bugs

Trusting Results Less

Conclusion

Thursday, August 6, 2015

Local vs. Global Control