Fooled By Testing
This post comes as the byproduct of taking a break from testing rails apps and reading Nicholas Nassim Taleb's The Black Swan (also author of the less preachy Fooled By Randomness, hence the title) while laying on the beach for my honeymoon. As such it should be noted that any lack of coherence can be directly attributed to the balmy Caribbean backdrop under which this was conceived.
In the software development community and particularly in the Rails world, unit testing is the dominant paradigm of professionalism. I don't think it's much of a stretch to say that most rails coders would agree that if you're not testing, you're doing it wrong and perhaps that it's possible classify coders as either 'software engineers' or 'programmers' and that this would basically be the split between 'unit testers' and 'non-testing, script-kiddie hacks'.
These posts are my attempt to think a bit about whether testing should really be sufficient to bestow the coveted status of 'engineering' on our profession and about dangers of relying on tests.
Don't be a turkey
The best graph in The Black Swan is (similar to) the following:
What can we tell about this graph, absent any other dimensional information? Well, it might seem like we can tell quite a lot. We have a generally increasing metric and it would seem we could be fairly confident that an extrapolation from this data would be a valid conclusion. Of course this actual graph looks like this:
and it turns out that it's a graph of Food Eaten over Time for turkeys. I'll leave it as an exercise for the reader to determine where Thanksgiving lies on the timeline.
In development terms, this reminds me very much of the first application I wrote using Freemarker as a templating language. The following worked brilliantly... until it didn't.
<#list products as product>
<a href="products/${product.id}">${product.name}</a>
</#list>
Can you spot the bug? Does it help if I mention that it worked for the first 999 products? Sadly, product 1000 rendered as '1,000' which is not a integer, breaking the link. (the solution Travels-travails-freemarker)
Past performance not indicative of future results.
To quote NTT:
He describes this sort of problem as a 'black swan'. An unpredictable event that has dramatic, game-changing effect. The central thesis of the Black Swan is that these events are, in some respects, the only things that matter. Worse, they happen much more often than your brain was built to expect. I think this jives with any seasoned developer's sense of the amazing ability of code to break under duress (numbers greater than 1000, demo-ing for the boss, butterfly wings, etc).
"our emotional apparatus is designed for linear causality".
He describes this sort of problem as a 'black swan'. An unpredictable event that has dramatic, game-changing effect. The central thesis of the Black Swan is that these events are, in some respects, the only things that matter. Worse, they happen much more often than your brain was built to expect. I think this jives with any seasoned developer's sense of the amazing ability of code to break under duress (numbers greater than 1000, demo-ing for the boss, butterfly wings, etc).
So what does this have to do with testing?
Testing is billed as a way to get early, continual positive feedback. NTT would say that we're attempting to 'platonicize' software: to wrest order and predictability from chaos. I think it does do a great job of that, or at least, I think it does a great job of making us feel like we've created order from chaos. TATFT == 'dopamine release'. It makes us feel good. Better yet, since our ancestral environment was made of of small, localized, linear events and it is thus hedonistically better to spread positive effects over time we've developed continuous integration and this compounds the hedonistic effect.
def gather(berry)
return :good_taste
end
Wash, rinse, repeat. After a long day of successful unit testing I feel like I'm the king of the world.
But let's face it: software couldn't be less linear if we tried, (and we're trying: ie concurrency). The big problem category: cascading fails, server flapping, the slashdot effect are all highly non-linear problems and the small problem category: unexpected input, unexpected side-effects is populated principally by unknown unknowns.
To write good software we need to stay vigilant.
Gaining confidence from testing is akin to like throwing a shawl on in a hurricane. I went to a talk from ThoughtBot a couple weeks ago about TDD and they showed an (admittedly awesome) slide of a cat walking calmly past a line of massive german shepards.
Look back at the turkey graph. That turkey felt probably felt pretty confident about getting fed right before he got led off behind the woodshed. That cat is not the kind of code I want to release. That cat trusts that her tests > uncertainty. I want my cat to be a paranoid, ex-CIA, wacko, wearing a bullet proof vest, tin foil hat and packing an uzi. Confidence is the enemy.
So, "just write awesome tests" you say? Let me pick a couple bones:
Testing is complex:
Test This:
def suggest_event?
user.can_report_events? &&
(effects > 0 || side_effects.any?{ |s| s.severity >= 2})
end
We can generalize this to the form: A && (B || C) and this humble one liner seems like it should be a great candidate for a test, easy setup.
In reality it has 8 separate execution paths, essentially forcing us to test at least a 2^3 truth table, and even that won't really capture the numerical comparators eg checking the >=.
(If you're thinking that you don't need to individually test ¬A, B, C and ¬A, ¬B, C, because ¬A should short-circuit the AND, then I would say you're not really thinking of this as a black box)
Now imagine how you might do set this up with shoulda or context? I shudder to think of it. Some custom truth table asserter would probably work, and maybe you do decide to spend a half hour writing your test frameworks. Good job, but at the end of the day all you're really doing is exercising the ruby interpreter's logical expressions evaluator.
So testing all possible code paths starts to get impossible for complex applications. So what do we do in response? I believe that all too often, the solution is to: test trivial applications! Yes, we mock out the expected interactions, stub the thing to death and are rewarded by the sweet success of our favorite color 'Test Pass Green'. The tests run fast and true, but there's just one little problem: the site doesn't work. I call as expert witness Yehuda Katz, who just had a nice post about this sort of problem on the Rail codebase. http://yehudakatz.com/2009/06/20/on-rails-testing/ Yehuda?
My general rule is “Don’t mock anything you own” and more strictly “Don’t mock anything happening inside your own process”.
Thanks, Yehuda. Just what I was trying to say and (unlike me) people presume you know what you're talking about.
So what is the point here? The point of this exercise is primarily to remind me that testing is not a panacea and that overconfidence is to be avoided. As I read & write more 'well-tested' code I'm becoming suspicious that it is more common for pernicious edge cases to slip towards production because as developers we're thinking too much inside the testing box. That is my thesis.
So what's next? Part 2 "Security Theatre", where we look at a few case studies including a list of the past 5 bugs I've let slip into production.
FWIW, I'm aware that Unit Testing and TDD were never truly billed as an answer to all these woes. I'm also aware that I'm not the first to think about this stuff: