gist JS

Saturday, August 31, 2013

The Statistics of Monopoly with Respect to Cornish Game Hen Provisioning: Part 2 "Probability is a bitch"

In part one we figured out the average likelihood of a guest ending up on any particular square. So what's the problem with that?

The problem can be summed up in one, easy to remember, phrase: "YOLO".


So we did 100,000 simulations, that seems like it should be enough right? Maybe we should do one million to be more accurate? Nope, that's not the problem. The problem is that we're not throwing one million dinner parties. Or even 100,000. We're only throwing one dinner party. And frankly, anything could happen.

Just because the expected value over the long run says that we'll need 5.44x the cornish game hens, this doesn't mean that the actually dinner party won't have 30 guests just haphazardly roll 12's on their first roll, throwing our expectations into turmoil.

So what is a Culinary Experience creator to do?

It turns out that Monte Carlo works really well here too. Since we recorded all 100,000 simulations, we can ask the question "How many game hens do I need to buy in order to have enough in 95% of simulations." Obviously we can change the percentage we use here too. The average, is actually just saying, "How many game hens do I need to buy in order to have enough in 50% of simulations." Which is pretty much like saying "How can I run out of game hens HALF OF THE TIME!"

Gimme Code:

Get an array of the pretty names for the squares.

SQUARES is something like:
What we really want is to put all the Baltic Avenues together. Put all the B&O Railroads together. You know, kinda 'zip' each of these arrays together.

Now zipped is:

Then we process the results:


These results have the average in column one. The 95th percentile in column two and the max observed in column three. So what's the result? Well, say we run 4 moves. The average on chance was 5.44x. But if we want to provision enough food with 95% certainty that there will be enough, we're going to need 9x. And out of 100,000 simulations, one simulation had 15x the number of cornish game hens on Chance. That sure doesn't make it easy to plan the menu.


But what if we want to play by the monopoly rules? Well, then we just change our move function and run things again. This time you can see the super high prevalence of Jail and a bit of a secondary bump ~7 squares after Jail.

Friday, August 30, 2013

The Statistics of Monopoly with Respect to Cornish Game Hen Provisioning

Let's pretend that you need to throw a once in a lifetime culinary spectacle in Panama. If you're @ashinyknife, this will be no problem.

Let's pretend you decide upon a monopoly theme. Generally, N guests start out on go, roll dice and end up on a monopoly square.

Let's pretend that each square has a wholy different gastronomic creation on it.

Given the above, how many cornish game hens should we expect to buy for St Charles place? How much caviar will we need to supply the B&O railroad?

These are the important questions that we will set out to answer today.


Our first approach might look something like this: http://statistics.about.com/od/ProbHelpandTutorials/a/Probability-And-Monopoly.htm

Basic probability, round 1 is reasonable. Round 2 makes sense.. oh gawd round 3 starts to get hard to keep track of.

Monte Carlo

So what should we do? It seems to me that the appropriate technique to use here is Monte Carlo simulation. What is Monte Carlo? Honestly Monte Carlo should be pretty attractive to those of us for whom probability 101 was a long time ago. Basically "Monte Carlo simulation" means "let's just see what really happens". Say I ask you to figure out the probability that when flipping a coin 100 times I get at least one run of 10 heads. You've got two choices:

1) Figure out the appropriate math.
2) Flip a coin 100 times. Figure out if you get 10 heads in a row. Do this 1 million times and calculate the percentage of times when it was true.

Option 2 is monte carlo.

Time for computers

This is really pretty easy to code up. Create a two-dimension array. Dimension one will keep track of each simulation. Dimension two will track each of the 40 Monopoly squares.

For each simulation, for each user in the simulation, for each of the moves, move them around the board.

To move them around the board we just roll two dice, and move us along.

Finally it's just a matter of averaging up the values for each square in our simulation and voila


So now the big question: Did we answer our original question? Do we know how much food to buy?
Say we're planning on serving 4 courses. Do we feel figuring out how many hens we would need for an even distribution, then buying 5.44x the cornish game hens for 'Chance' and 3.8x the caviar for the B&O railroad?

What do you think?

See my answer in The Statistics of Monopoly with Respect to Cornish Game Hen Provisioning: Part 2 "Probability is a bitch"

Friday, August 16, 2013

hbase scan: batch vs cache

Here's today's contribution to the Internet: tl;dr When it comes to HBase scanner settings, you want caching, not batchsize. Maybe this is totally clear to everyone else. But for those of us who are 'newer to hbase' I can never quite remember what I'm doing.
Say you've got this code:
Scan s = new Scan(startKey);s.setCaching(foo);s.setBatch(bar);ResultScanner scanner = new ResultScanner(s);for (final Result r : scanner) {  //stuff}
But you're clever and you don't want to do RPC calls to HBase for every row. You might even say you'd like to 'batch' the results from your scanner. 

So you read http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html public void setBatch(int batch)
Set the maximum number of values to return for each call to next()
public void setCaching(int caching)
Set the number of rows for caching that will be passed to scanners. If not set, the default setting from HTable.getScannerCaching() will apply. Higher caching values will enable faster scanners but will use more memory.
Annnd.... not sure. I mean, I only want one Result every time I call next()in my iterator, right? What would  a number >1 even mean? 

And I'm sure I shouldn't set 'caching' that sounds like it will 'cache' something. I want to read the real stuff.

But you do want caching. Caching is how many things come back in a batch from your scanner. 

Ok. Fine. Caching got named poorly.  What is batch?

Batch is in case you have super wide rows. Say you have 250 columns. Batch of 100 would give your iterator:
  • Iteration 1: Result id 0. Columns 0-99
  • Iteration 2: Result id 0. Columns 100-199
  • Iteration 3: Result id 0. Columns 200-249
  • Iteration 4: Result id 1. Columns 0-99
  • Iteration 5: Result id 1. Columns 100-199
Or at least that's what http://twitter.com/monkeyatlarge told me.

Wednesday, June 12, 2013

Github gists on blogger

I've been using https://github.com/moski/gist-Blogger/ to display gists in blogger like:

The only problem with this was that I was including a link to the raw github which was getting a mime type of text/plain, which caused some browsers to not load the JS. The solution is to use github pages apparently, but that a small pita to setup, so I hereby share the results of my toils. 

Step 1: Create your gist

Step 2: Add a div to your blog post

<div class="gistLoad" data-id="5561359" id="gist-5561359"> </div>

Step 3: Add this script to your blog post
<script src="http://jdwyah.github.io/gist-Blogger/javascript/gistLoader.js" type="text/javascript"></script>

Step 4: Profit

Saturday, May 11, 2013

Postgres Common Table Expression Super Example (with a hot little window function action too)

Common table expressions are a PostgreSQL user's best friend. Let me show you them.

The Example Problem:

Let's take a look at how to build SymptomsLikeMe. This is something that would look at a bunch of health reports and figure out who has symptoms that are most like mine.

The data model is pretty simple. A user has many symptom surveys. Each symptom survey has a symptom_id (ie pain, nausea, fatigue) and a severity (1,2,3,4).

So, here's the challenge in more details. Given all of a user's latest symptom surveys (1 for fatigue, 4 for pain), compare that to all other latest symptom surveys and produce a similarity score (least square). Return the top 10 closest symptom reports.

So what needs to happen?

  1. Get all of the latests symptom surveys
  2. Get just this users's latest symptom survey
  3. Compare that to all other latest symptom surveys and produce a similarity score (using least squares).
  4. Return the top 10 closest symptom reports.

How to do it?

Well, that's a lot to do, right? Let's look at a couple approaches.
  • In Ruby? Wow. There are ~= 1,788,141 symptom reports eh? So we serialize them all into ruby and...  no.
  • We'll denormalize it!   Urm... I'm not really sure what that would mean. We explode the combinatorial space of all symptoms to symptoms....  ouch.
  • ??????


Let's forget trying to do the whole problem in one massive statement and just do the four steps we listed out above. 

Step 1 "Get all of the latest symptom surveys"

huh? Why not use a not exist well, because not exists have a bad habit of returning more than one row when things starts colliding and we have to tie break collisions or suffer the consequences of duplicate counts. rank = 1 guarantees us that we'll only get 1 result, no matter how many dupes there are.

Step 2 "Given this user's latest symptom survey"

easy peasy.

Step 3 "compare that to all other latest symptom surveys and produce a similarity score"

So to compute the similarity, we need to compare our user, vs each user. If only there were a way to JOIN our user's info onto each row of a another user's data... oo a join!

Ok, so that's the join and then we just want the difference of the two columns 'minus' and we square that (for least squares) and um, sum(). 

Easy peasy!!

Step 4 "Return the top 10 closest symptom reports."

All together now

But then we need all 4 pieces of this to work together and... omg we are done.


  • CommonTableExpressions work to decompose SQL problems into manageable hunks.
  • Read Steps 1-4 backwards and it's just a bunch of unfunded mandates that you need to fill in.
  • I really didn't tell you how window functions work. Sorry. They're great. But they need another blog post to really explain. See http://www.postgresql.org/docs/9.1/static/tutorial-window.html which is really pretty good.

Decoding the "Two Weeks" estimate

The Psychology of Overconfidence

Dan Milstein has a nice write up of his thoughts on estimation: Coding, Fast and Slow: Developers and the Psychology of Overconfidence

To try to summarize,

  1. "Writing Software = Learning Something You Don’t Know When You Start"
  2. We are systematically, provably, overconfident.
  3. That said, we can get decent at estimating things that will take ~0-4 hours.
  4. But there's no way to get good at quick-estimation of things > 8 hours, because you need really quick feedback loops to hone this skill.
  5. Sadly adding up 100 4 hour tasks does not equal an accurate estimate of a large project.

So I totally agree that there's value to using 'System I' to make quick gut check estimates on small things. And I agree that spending very much time deeply estimating a project with 'System II' is pretty useless.

What I want to try to defend is the "2 week", "1 month" size gut check estimates. I'm definitely not going to argue that they're something to bet money on, but I think that used properly they can be useful.

Decoding the proverbial "Two Weeks"

So here's my secret decoder ring for programmer estimates. "Two weeks" really means "In two weeks, I'll be able to tell you what is really happening". Maybe 50% of the time that's because it you will have actual working software. But the rest of the time it means you'll have an engineer who now better understands why this will take another month. Or six. Or simply a day of cleanup.

"Two weeks" really means: "In two weeks, I'll be able to tell you what is really happening".

So how can this help you?

I may doubt an engineer's confidence in their estimation, but I feel a lot better about relying on their hatred of inefficiency. And I think that's part of what you're getting with a "2 week estimate." You're getting "Anything less than 2 weeks is going to be an inefficient use of my time."

So how do we translate a developer saying "2 months"? This is big. So big that it's going to take me two months to figure out the scope of what this thing is. That's right, you want an accurate estimate of Project X I've got news for you. If somebody says "2 months" they just told you it will take "2 months" to have an accurate estimate.

Is there any good news?

Yes! If we change what we hear I think estimates can actually speed up process and reduce churn from shifting priorities. Take the developer's "2 weeks" then don't bug them for 2 weeks. If you are going to ask them a question about schedule reduce it to this simple one: "what percent sure are you that this will be done". 80% means things are good. A seasoned developer will say 80% even if the code is written and tested and sales has signed off, and the CEO loves it, but marketing just wants to take "a quick look at the copy". Because a sesoned developer has seen 20% of their "done" projects still slip right here.

If the developer says they're only 50% sure you can start planning the reprioritizatuon meeting. But leave the engineers doing what they're doing. Make sure they're focussed on the totality of the problem. And at the end of their proverbial "two weeks" you will have a concrete and legitimate estimation about your project.

How to Frame "2 Week"+ Projects
So if we decide to be honest with ourselves and re-defined the nature of "two weeks" projects, is that the best we can do? Actually I think we can reap even greater rewards by clearly framing projects for developers in this light. If we say "You said '2 weeks'? Ok, do project X, you have 2 weeks" we are likely to get: a whole bunch of code, slapped together in the last half of the second week, a feature that appears to work, but has an unknown number of bugs and has not been user tested and may or may not really be on the right track.

However if we say "Work on project X, you have 2 weeks to report back to me what it's going to take to ship this with 95% confidence that it's a big win for customers". Well I think you get a really different result. I think you're going to have an engineer inclined to think critically about the problem, not about how to deliver "something" in "2 weeks". And what does that mean? Well if they're any good it means you're going to get a combination of design work, code spikes, feasability and a list of distinct manageable 4 hour tasks that aren't finished yet.

In my experience, estimates up to 2 developers & 2 weeks can be relatively accurate about getting 'something' shipped. But they absolutely require an "after-party" story to clean up. Baking this into the expectation from the outset can help engineers focus on the most critical bits first. Whether that's ensuring the API behaves properly, spiking out the critical path, or badgering the customer to figure out whether the feature is of any use at all.

In closing, I say to you "Go forth and shout your estimate from your hip!". (Just tell your PM what you really mean)

P.S. I should perhaps point out that this is all based on previous jobs. I haven't actually 'estimated' anything in the past 9 months at my new job. We just 'do' stuff. Crazy I know.

Friday, February 08, 2013

Mock HBase for Unit Testing

HBase Checklist:

  • Reliably store terabytes of data across umpteen cloudy shards?   Check. 
  • Backend for all your map reduce needs?   Check. 
  • Failover across region servers?   Check. 
  • Still work locally after your laptop goes to sleep?   Not so much. 

One repercussion of HBase's recalcitrance is that I find local development can be a bit of a pita when tests written against an HBase instance that needs a restarting everytime I get coffee.

Beyond that it would be nice to be able to write some unit tests against HBase and not worry about configuring it. What I really want is a Mock implementation of HBase that just runs in memory.

Mocking HBase for fun & Profit

Happily there's a great gist out there which does just this. Thanks, Internet, you're the best. And here's the gist https://gist.github.com/agaoglu/613217. I'm not sure if it's perfect, but it's worked for whatever I've thrown at it so far.

Loading ....