gist JS

Wednesday, June 24, 2009

Security Theater

In Fooled By Testing I looked at some general concerns I've had about testing and what we can expect from it. In this part I'd like expose myself to the Internet at large and let you in on:

With no further ado, here are the past ~5 bugs I've let slip into production:

Production Bug #1
One of the joys of 'crowd-sourcing' some aspects of our application is that we regularly need to merge the symptoms that our user's report due to mispellings. When we do this, we automatically email the affected users so they know what's up. Normally this is merging 'tramors' into 'tremors' with 2 people reporting 'tramors' and 1500 reporting 'tremors'. It recently came up that we had 'headache' and 'headaches' as separate symptoms and each were reported by thousands of users. The task fell to engineering that we wanted to be able to do this merge without sending thousands of emails to users. I spent a good while writing pretty good tests that email didn't get sent and upgrading the merge tests in general and I felt pretty confident in my code. We got to QA and we realized that doing a merge like this was going to take 10+ minutes so we decided to move on and check it later.

Bottom line: Not once did we check to see if the site worked after we merged headache into headaches, the very functionality we were attempting to achieve.

Good News: The code I wrote worked. The merge happened successfully and no email was sent.
Bad News: The second the merge completed, all hell broke loose. You see, headaches is a primary symptom in our mood community and, long story and a bit of meta-programming magic later it turns out that primary symptoms get methods like 'has_headache?' created for them. The merge blew this symptom away thence there was no method and by this declension we proceed into the madness whereupon Hoptoad lit up like a christmas tree.

Now you are perfectly justified in claiming that only an idiot would let this happen, but I declare that I am often an idiot. Furthermore I do solemnly swear that I would have picked up on this without TDD. The great joy and efficiency of TDD, that I can exercise code through tests, is a A+ way to avoid rigorously seeing if the site actually works.

Production Bug #2
Added a css file that broke some style things in a different part of the site.
I grant you, this isn't an example of the evils of testing, but I will say that from a user perspective CSS-fail can look an awful lot like site fail. writing a test case to prevent this would be a monster task with poor ROI. If you a page to render in IE, you'd best load the page in IE.

Production Bug #3
Nobody merged the old production branch, so we regressed (tests existed... but they we're part of what didn't get merged).
Again, not testings fault, but what we're looking at here is 'Things that break production' and whether tests helped and in this case they did not. Sometime there is no substitute for manual labor.

Production Bug #4
Cannot Update Frozen Hash. This was some squirly ActiveRecord nuissance. The kind of thing that cropped up in Hibernate every 3 seconds bit which ActiveRecord generally seems robust to. It's in the middle of a gross controller that no one is proud of. A coworker couldn't reproduce this bug using the site, but 'fixed' this bug by writing a functional test that produced the same exception. He then fixed that, but the bug was still there. Eventually we found a way to exercise it using the app and then exorcised it, but I would argue that we got a false sense of 'fixedness' from this test.

Production Bug #5
Performance Disaster when searching for multiple treatments.
I know I know, nobody ever said unit testing was performance testing and technically this got stopped just short of production, but it got darn close and it was another episode of me writing my tests and thinking all was well and not testing the app as thoroughly as I should.

Production Bug #6
Tested: User.for_disease(disease), but actual form submits looked like User.for_disease([disease]). Turns out that array was bad news and could have had a major affect in production though thankfully no one actually used the feature involved.

Bottom line: It is astoundingly easy to write tests that seem to exercise the full functionality of the code, but for which subtle differences in initial conditions have catastrophic effects. (See Atmospheric Disturbances for more info on the profound psychological effects of perturbations in initial conditions :)

So is testing dangerous?
Well no, obviously not, but also yes a little.

Let's take a little break from development work and ask a similar question:
"Were the past 20 years of sharpe's ratio's and Case/Schiller financial analytics dangerous?"
Of course not. Before these tools existed, financial holdings were opaque and this was indeed dangerous. Once again however, as we've all learned, it was dangerous. Banks blew up in the 60's and 70's but knowledge of imperfect information breeds caution. Knowing 50% more but have 90% more confidence is dangerous in the extreme. A 99% chance the site works means 3.6 days/year of the site not working.

Obviously most of our rails apps are not 'too big to fail' (besides Twitter) but as confidence rises and paranoia decreases the chance of a killer bug grows sharply. If our profession is to avoid a backlash at the reliability of online applications, we can not simply throw all our eggs in the automated testing basket.

Let's say it 3 times together:

Unit-Testing doesn't find bugs
Unit-Testing doesn't find bugs
Unit-Testing doesn't find bugs

Bottom Line: Saying 'we can trust this, it's been unit tested' is a lot like saying 'what could go wrong? That debt is insured by AIG'

Tuesday, June 23, 2009

Fooled By Testing

This post comes as the byproduct of taking a break from testing rails apps and reading Nicholas Nassim Taleb's The Black Swan (also author of the less preachy Fooled By Randomness, hence the title) while laying on the beach for my honeymoon. As such it should be noted that any lack of coherence can be directly attributed to the balmy Caribbean backdrop under which this was conceived.

In the software development community and particularly in the Rails world, unit testing is the dominant paradigm of professionalism. I don't think it's much of a stretch to say that most rails coders would agree that if you're not testing, you're doing it wrong and perhaps that it's possible classify coders as either 'software engineers' or 'programmers' and that this would basically be the split between 'unit testers' and 'non-testing, script-kiddie hacks'.

These posts are my attempt to think a bit about whether testing should really be sufficient to bestow the coveted status of 'engineering' on our profession and about dangers of relying on tests.

Don't be a turkey

The best graph in The Black Swan is (similar to) the following:

What can we tell about this graph, absent any other dimensional information? Well, it might seem like we can tell quite a lot. We have a generally increasing metric and it would seem we could be fairly confident that an extrapolation from this data would be a valid conclusion. Of course this actual graph looks like this:

and it turns out that it's a graph of Food Eaten over Time for turkeys. I'll leave it as an exercise for the reader to determine where Thanksgiving lies on the timeline.

In development terms, this reminds me very much of the first application I wrote using Freemarker as a templating language. The following worked brilliantly... until it didn't.

<#list products as product>
<a href="products/${product.id}">${product.name}</a>

Can you spot the bug? Does it help if I mention that it worked for the first 999 products? Sadly, product 1000 rendered as '1,000' which is not a integer, breaking the link. (the solution Travels-travails-freemarker)

Past performance not indicative of future results.

To quote NTT:
"our emotional apparatus is designed for linear causality".

He describes this sort of problem as a 'black swan'. An unpredictable event that has dramatic, game-changing effect. The central thesis of the Black Swan is that these events are, in some respects, the only things that matter. Worse, they happen much more often than your brain was built to expect. I think this jives with any seasoned developer's sense of the amazing ability of code to break under duress (numbers greater than 1000, demo-ing for the boss, butterfly wings, etc).

So what does this have to do with testing?

Testing is billed as a way to get early, continual positive feedback. NTT would say that we're attempting to 'platonicize' software: to wrest order and predictability from chaos. I think it does do a great job of that, or at least, I think it does a great job of making us feel like we've created order from chaos. TATFT == 'dopamine release'. It makes us feel good. Better yet, since our ancestral environment was made of of small, localized, linear events and it is thus hedonistically better to spread positive effects over time we've developed continuous integration and this compounds the hedonistic effect.

def gather(berry)
return :good_taste

Wash, rinse, repeat. After a long day of successful unit testing I feel like I'm the king of the world.

But let's face it: software couldn't be less linear if we tried, (and we're trying: ie concurrency). The big problem category: cascading fails, server flapping, the slashdot effect are all highly non-linear problems and the small problem category: unexpected input, unexpected side-effects is populated principally by unknown unknowns.

To write good software we need to stay vigilant.

Gaining confidence from testing is akin to like throwing a shawl on in a hurricane. I went to a talk from ThoughtBot a couple weeks ago about TDD and they showed an (admittedly awesome) slide of a cat walking calmly past a line of massive german shepards.

Look back at the turkey graph. That turkey felt probably felt pretty confident about getting fed right before he got led off behind the woodshed. That cat is not the kind of code I want to release. That cat trusts that her tests > uncertainty. I want my cat to be a paranoid, ex-CIA, wacko, wearing a bullet proof vest, tin foil hat and packing an uzi. Confidence is the enemy.

So, "just write awesome tests" you say? Let me pick a couple bones:

Testing is complex:

Test This:
def suggest_event?
user.can_report_events? &&
(effects > 0 || side_effects.any?{ |s| s.severity >= 2})

We can generalize this to the form: A && (B || C) and this humble one liner seems like it should be a great candidate for a test, easy setup.
In reality it has 8 separate execution paths, essentially forcing us to test at least a 2^3 truth table, and even that won't really capture the numerical comparators eg checking the >=.
(If you're thinking that you don't need to individually test ¬A, B, C and ¬A, ¬B, C, because ¬A should short-circuit the AND, then I would say you're not really thinking of this as a black box)

Now imagine how you might do set this up with shoulda or context? I shudder to think of it. Some custom truth table asserter would probably work, and maybe you do decide to spend a half hour writing your test frameworks. Good job, but at the end of the day all you're really doing is exercising the ruby interpreter's logical expressions evaluator.

So testing all possible code paths starts to get impossible for complex applications. So what do we do in response? I believe that all too often, the solution is to: test trivial applications! Yes, we mock out the expected interactions, stub the thing to death and are rewarded by the sweet success of our favorite color 'Test Pass Green'. The tests run fast and true, but there's just one little problem: the site doesn't work. I call as expert witness Yehuda Katz, who just had a nice post about this sort of problem on the Rail codebase. http://yehudakatz.com/2009/06/20/on-rails-testing/ Yehuda?

My general rule is “Don’t mock anything you own” and more strictly “Don’t mock anything happening inside your own process”.

Thanks, Yehuda. Just what I was trying to say and (unlike me) people presume you know what you're talking about.

So what is the point here? The point of this exercise is primarily to remind me that testing is not a panacea and that overconfidence is to be avoided. As I read & write more 'well-tested' code I'm becoming suspicious that it is more common for pernicious edge cases to slip towards production because as developers we're thinking too much inside the testing box. That is my thesis.

So what's next? Part 2 "Security Theatre", where we look at a few case studies including a list of the past 5 bugs I've let slip into production.

FWIW, I'm aware that Unit Testing and TDD were never truly billed as an answer to all these woes. I'm also aware that I'm not the first to think about this stuff: