gist JS

Saturday, August 31, 2013

The Statistics of Monopoly with Respect to Cornish Game Hen Provisioning: Part 2 "Probability is a bitch"

In part one we figured out the average likelihood of a guest ending up on any particular square. So what's the problem with that?

The problem can be summed up in one, easy to remember, phrase: "YOLO".


So we did 100,000 simulations, that seems like it should be enough right? Maybe we should do one million to be more accurate? Nope, that's not the problem. The problem is that we're not throwing one million dinner parties. Or even 100,000. We're only throwing one dinner party. And frankly, anything could happen.

Just because the expected value over the long run says that we'll need 5.44x the cornish game hens, this doesn't mean that the actually dinner party won't have 30 guests just haphazardly roll 12's on their first roll, throwing our expectations into turmoil.

So what is a Culinary Experience creator to do?

It turns out that Monte Carlo works really well here too. Since we recorded all 100,000 simulations, we can ask the question "How many game hens do I need to buy in order to have enough in 95% of simulations." Obviously we can change the percentage we use here too. The average, is actually just saying, "How many game hens do I need to buy in order to have enough in 50% of simulations." Which is pretty much like saying "How can I run out of game hens HALF OF THE TIME!"

Gimme Code:

Get an array of the pretty names for the squares.

SQUARES is something like:
What we really want is to put all the Baltic Avenues together. Put all the B&O Railroads together. You know, kinda 'zip' each of these arrays together.

Now zipped is:

Then we process the results:


These results have the average in column one. The 95th percentile in column two and the max observed in column three. So what's the result? Well, say we run 4 moves. The average on chance was 5.44x. But if we want to provision enough food with 95% certainty that there will be enough, we're going to need 9x. And out of 100,000 simulations, one simulation had 15x the number of cornish game hens on Chance. That sure doesn't make it easy to plan the menu.


But what if we want to play by the monopoly rules? Well, then we just change our move function and run things again. This time you can see the super high prevalence of Jail and a bit of a secondary bump ~7 squares after Jail.

Friday, August 30, 2013

The Statistics of Monopoly with Respect to Cornish Game Hen Provisioning

Let's pretend that you need to throw a once in a lifetime culinary spectacle in Panama. If you're @ashinyknife, this will be no problem.

Let's pretend you decide upon a monopoly theme. Generally, N guests start out on go, roll dice and end up on a monopoly square.

Let's pretend that each square has a wholy different gastronomic creation on it.

Given the above, how many cornish game hens should we expect to buy for St Charles place? How much caviar will we need to supply the B&O railroad?

These are the important questions that we will set out to answer today.


Our first approach might look something like this: http://statistics.about.com/od/ProbHelpandTutorials/a/Probability-And-Monopoly.htm

Basic probability, round 1 is reasonable. Round 2 makes sense.. oh gawd round 3 starts to get hard to keep track of.

Monte Carlo

So what should we do? It seems to me that the appropriate technique to use here is Monte Carlo simulation. What is Monte Carlo? Honestly Monte Carlo should be pretty attractive to those of us for whom probability 101 was a long time ago. Basically "Monte Carlo simulation" means "let's just see what really happens". Say I ask you to figure out the probability that when flipping a coin 100 times I get at least one run of 10 heads. You've got two choices:

1) Figure out the appropriate math.
2) Flip a coin 100 times. Figure out if you get 10 heads in a row. Do this 1 million times and calculate the percentage of times when it was true.

Option 2 is monte carlo.

Time for computers

This is really pretty easy to code up. Create a two-dimension array. Dimension one will keep track of each simulation. Dimension two will track each of the 40 Monopoly squares.

For each simulation, for each user in the simulation, for each of the moves, move them around the board.

To move them around the board we just roll two dice, and move us along.

Finally it's just a matter of averaging up the values for each square in our simulation and voila


So now the big question: Did we answer our original question? Do we know how much food to buy?
Say we're planning on serving 4 courses. Do we feel figuring out how many hens we would need for an even distribution, then buying 5.44x the cornish game hens for 'Chance' and 3.8x the caviar for the B&O railroad?

What do you think?

See my answer in The Statistics of Monopoly with Respect to Cornish Game Hen Provisioning: Part 2 "Probability is a bitch"

Friday, August 16, 2013

hbase scan: batch vs cache

Here's today's contribution to the Internet: tl;dr When it comes to HBase scanner settings, you want caching, not batchsize. Maybe this is totally clear to everyone else. But for those of us who are 'newer to hbase' I can never quite remember what I'm doing.
Say you've got this code:
Scan s = new Scan(startKey);s.setCaching(foo);s.setBatch(bar);ResultScanner scanner = new ResultScanner(s);for (final Result r : scanner) {  //stuff}
But you're clever and you don't want to do RPC calls to HBase for every row. You might even say you'd like to 'batch' the results from your scanner. 

So you read http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html public void setBatch(int batch)
Set the maximum number of values to return for each call to next()
public void setCaching(int caching)
Set the number of rows for caching that will be passed to scanners. If not set, the default setting from HTable.getScannerCaching() will apply. Higher caching values will enable faster scanners but will use more memory.
Annnd.... not sure. I mean, I only want one Result every time I call next()in my iterator, right? What would  a number >1 even mean? 

And I'm sure I shouldn't set 'caching' that sounds like it will 'cache' something. I want to read the real stuff.

But you do want caching. Caching is how many things come back in a batch from your scanner. 

Ok. Fine. Caching got named poorly.  What is batch?

Batch is in case you have super wide rows. Say you have 250 columns. Batch of 100 would give your iterator:
  • Iteration 1: Result id 0. Columns 0-99
  • Iteration 2: Result id 0. Columns 100-199
  • Iteration 3: Result id 0. Columns 200-249
  • Iteration 4: Result id 1. Columns 0-99
  • Iteration 5: Result id 1. Columns 100-199
Or at least that's what http://twitter.com/monkeyatlarge told me.