jquery

Saturday, May 11, 2013

Postgres Common Table Expression Super Example (with a hot little window function action too)



Common table expressions are a PostgreSQL user's best friend. Let me show you them.

The Example Problem:

Let's take a look at how to build SymptomsLikeMe. This is something that would look at a bunch of health reports and figure out who has symptoms that are most like mine.

The data model is pretty simple. A user has many symptom surveys. Each symptom survey has a symptom_id (ie pain, nausea, fatigue) and a severity (1,2,3,4).

So, here's the challenge in more details. Given all of a user's latest symptom surveys (1 for fatigue, 4 for pain), compare that to all other latest symptom surveys and produce a similarity score (least square). Return the top 10 closest symptom reports.

So what needs to happen?

  1. Get all of the latests symptom surveys
  2. Get just this users's latest symptom survey
  3. Compare that to all other latest symptom surveys and produce a similarity score (using least squares).
  4. Return the top 10 closest symptom reports.

How to do it?

Well, that's a lot to do, right? Let's look at a couple approaches.
  • In Ruby? Wow. There are ~= 1,788,141 symptom reports eh? So we serialize them all into ruby and...  no.
  • We'll denormalize it!   Urm... I'm not really sure what that would mean. We explode the combinatorial space of all symptoms to symptoms....  ouch.
  • ??????

PostgreSQL!!!!

Let's forget trying to do the whole problem in one massive statement and just do the four steps we listed out above. 

Step 1 "Get all of the latest symptom surveys"


huh? Why not use a not exist well, because not exists have a bad habit of returning more than one row when things starts colliding and we have to tie break collisions or suffer the consequences of duplicate counts. rank = 1 guarantees us that we'll only get 1 result, no matter how many dupes there are.


Step 2 "Given this user's latest symptom survey"


easy peasy.

Step 3 "compare that to all other latest symptom surveys and produce a similarity score"

So to compute the similarity, we need to compare our user, vs each user. If only there were a way to JOIN our user's info onto each row of a another user's data... oo a join!

Ok, so that's the join and then we just want the difference of the two columns 'minus' and we square that (for least squares) and um, sum(). 

Easy peasy!!

Step 4 "Return the top 10 closest symptom reports."


All together now

But then we need all 4 pieces of this to work together and... omg we are done.

Conclusion


  • CommonTableExpressions work to decompose SQL problems into manageable hunks.
  • Read Steps 1-4 backwards and it's just a bunch of unfunded mandates that you need to fill in.
  • I really didn't tell you how window functions work. Sorry. They're great. But they need another blog post to really explain. See http://www.postgresql.org/docs/9.1/static/tutorial-window.html which is really pretty good.


Decoding the "Two Weeks" estimate


The Psychology of Overconfidence

Dan Milstein has a nice write up of his thoughts on estimation: Coding, Fast and Slow: Developers and the Psychology of Overconfidence


To try to summarize,

  1. "Writing Software = Learning Something You Don’t Know When You Start"
  2. We are systematically, provably, overconfident.
  3. That said, we can get decent at estimating things that will take ~0-4 hours.
  4. But there's no way to get good at quick-estimation of things > 8 hours, because you need really quick feedback loops to hone this skill.
  5. Sadly adding up 100 4 hour tasks does not equal an accurate estimate of a large project.


So I totally agree that there's value to using 'System I' to make quick gut check estimates on small things. And I agree that spending very much time deeply estimating a project with 'System II' is pretty useless.

What I want to try to defend is the "2 week", "1 month" size gut check estimates. I'm definitely not going to argue that they're something to bet money on, but I think that used properly they can be useful.


Decoding the proverbial "Two Weeks"

So here's my secret decoder ring for programmer estimates. "Two weeks" really means "In two weeks, I'll be able to tell you what is really happening". Maybe 50% of the time that's because it you will have actual working software. But the rest of the time it means you'll have an engineer who now better understands why this will take another month. Or six. Or simply a day of cleanup.

"Two weeks" really means: "In two weeks, I'll be able to tell you what is really happening".

So how can this help you?

I may doubt an engineer's confidence in their estimation, but I feel a lot better about relying on their hatred of inefficiency. And I think that's part of what you're getting with a "2 week estimate." You're getting "Anything less than 2 weeks is going to be an inefficient use of my time."

So how do we translate a developer saying "2 months"? This is big. So big that it's going to take me two months to figure out the scope of what this thing is. That's right, you want an accurate estimate of Project X I've got news for you. If somebody says "2 months" they just told you it will take "2 months" to have an accurate estimate.

Is there any good news?

Yes! If we change what we hear I think estimates can actually speed up process and reduce churn from shifting priorities. Take the developer's "2 weeks" then don't bug them for 2 weeks. If you are going to ask them a question about schedule reduce it to this simple one: "what percent sure are you that this will be done". 80% means things are good. A seasoned developer will say 80% even if the code is written and tested and sales has signed off, and the CEO loves it, but marketing just wants to take "a quick look at the copy". Because a sesoned developer has seen 20% of their "done" projects still slip right here.

If the developer says they're only 50% sure you can start planning the reprioritizatuon meeting. But leave the engineers doing what they're doing. Make sure they're focussed on the totality of the problem. And at the end of their proverbial "two weeks" you will have a concrete and legitimate estimation about your project.


How to Frame "2 Week"+ Projects
So if we decide to be honest with ourselves and re-defined the nature of "two weeks" projects, is that the best we can do? Actually I think we can reap even greater rewards by clearly framing projects for developers in this light. If we say "You said '2 weeks'? Ok, do project X, you have 2 weeks" we are likely to get: a whole bunch of code, slapped together in the last half of the second week, a feature that appears to work, but has an unknown number of bugs and has not been user tested and may or may not really be on the right track.

However if we say "Work on project X, you have 2 weeks to report back to me what it's going to take to ship this with 95% confidence that it's a big win for customers". Well I think you get a really different result. I think you're going to have an engineer inclined to think critically about the problem, not about how to deliver "something" in "2 weeks". And what does that mean? Well if they're any good it means you're going to get a combination of design work, code spikes, feasability and a list of distinct manageable 4 hour tasks that aren't finished yet.

In my experience, estimates up to 2 developers & 2 weeks can be relatively accurate about getting 'something' shipped. But they absolutely require an "after-party" story to clean up. Baking this into the expectation from the outset can help engineers focus on the most critical bits first. Whether that's ensuring the API behaves properly, spiking out the critical path, or badgering the customer to figure out whether the feature is of any use at all.


In closing, I say to you "Go forth and shout your estimate from your hip!". (Just tell your PM what you really mean)





P.S. I should perhaps point out that this is all based on previous jobs. I haven't actually 'estimated' anything in the past 9 months at my new job. We just 'do' stuff. Crazy I know.


Friday, February 08, 2013

Mock HBase for Unit Testing

HBase Checklist:

  • Reliably store terabytes of data across umpteen cloudy shards?   Check. 
  • Backend for all your map reduce needs?   Check. 
  • Failover across region servers?   Check. 
  • Still work locally after your laptop goes to sleep?   Not so much. 

One repercussion of HBase's recalcitrance is that I find local development can be a bit of a pita when tests written against an HBase instance that needs a restarting everytime I get coffee.

Beyond that it would be nice to be able to write some unit tests against HBase and not worry about configuring it. What I really want is a Mock implementation of HBase that just runs in memory.

Mocking HBase for fun & Profit


Happily there's a great gist out there which does just this. Thanks, Internet, you're the best. And here's the gist https://gist.github.com/agaoglu/613217. I'm not sure if it's perfect, but it's worked for whatever I've thrown at it so far.

Loading ....

Saturday, August 25, 2012

Patella Video from RailsConf live

Patella: Memoization into memcached via resque.

Woot! The video of my RailsConf talk is up on Confreaks. Happily PLM is actually using the patella gem in production now too, so you can go check that out too. In unrelated news I'm going to be checking into Demosthenes School for Mumblers.

Monday, June 25, 2012

Patella

I can't believe I forgot to self-promote after my RailsConf talk. The talk isn't up yet on Confreaks, but you can get a teaser video In the Spotlight with Jeff Dwyer on Vimeo. Additionally I put the PPT up on SlideShare here: Memoization into Memcached done in Resque. Enjoy :)

Wednesday, April 06, 2011

WillPaginate in Half the Time

Finally got around to writing up a cool PostgreSQL trick on the PatientsLikeMe tech blog. WillPaginate in Half the Time.

Monday, November 22, 2010

Reader 9.4.1! a Adobe Reader Version History

I'm very happy to welcome Adobe Acrobat Reader version 9.4.1 to the world. I'm so excited about this new version. In fact, I'm so excited that I thought I'd take this as a chance to look back at some of the previous versions of Reader that we all know and love.

Version : Features
Reader 1.0 : Opens PDFs!! Also Crashes
Reader 1.1 : Still opens PDFs. More Crashes
Reader 1.2 : Still opens PDFs. Fewer Crashes
Reader 1.3 : Still opens PDFs. More Crashes
Reader 2.0 : Still opens PDFs. Fewer Crashes
Reader 2.1 : Still opens PDFs. More Crashes
Reader 2.2 : Still opens PDFs. More Crashes
Reader 3.0 : Still opens PDFs. Fewer Crashes
Reader 3.1 : Still opens PDFs. Fewer Crashes
Reader 3.2 : Still opens PDFs. Fewer Crashes
Reader 3.3 : Still opens PDFs. Fewer Crashes
Reader 4.0 : Still opens PDFs. More Crashes. New Logo!
Reader 4.1 : Still opens PDFs. More Crashes
Reader 4.2 : Still opens PDFs. Fewer Crashes
Reader 4.3 : Still opens PDFs. More Crashes
Reader 5.0 : Still opens PDFs. More Crashes
Reader 5.1 : Still opens PDFs. Fewer Crashes
Reader 5.2 : Still opens PDFs. More Crashes
Reader 6.0 : Still opens PDFs. Fewer Crashes
Reader 6.1 : Still opens PDFs. More Crashes
Reader 6.2 : Still opens PDFs. More Crashes
Reader 6.3 : Still opens PDFs. Fewer Crashes
Reader 7.0 : Still opens PDFs. Fewer Crashes. New logo!
Reader 7.1 : Still opens PDFs. More Crashes
Reader 7.2 : Still opens PDFs. Fewer Crashes
Reader 8.0 : Still opens PDFs. More Crashes
Reader 8.1 : Still opens PDFs. Fewer Crashes
Reader 8.2 : Still opens PDFs. Fewer Crashes
Reader 8.3 : Still opens PDFs. Fewer Crashes
Reader 9.0 : Still opens PDFs. Fewer Crashes
Reader 9.1 : Still opens PDFs. Fewer Crashes
Reader 9.2 : Still opens PDFs. Fewer Crashes
Reader 9.3 : Still opens PDFs. More Crashes
Reader 9.4 : ???? A brave new world!

Wednesday, October 27, 2010

Dealing with Git Merge Revisions

Forgot to astroturf my post on the PLM tech blog. Everything you wanted to know about Dealing with Git Merge Revisions but were afraid to ask.

Tuesday, December 08, 2009

Decrypting Ruby AES Encryption

Want to encrypt something in Ruby and decrypt it in another language like AS3 using As3Crypto? Sure you can encrypt 'test string' and 'foo, without any problems. But say you want to encode something long.. like 16 character long. After much suffering, the following worked for me.



The main gotcha is that Base64.encode64 has a nasty habit of sprinkling newline characters to your strings. I can't imagine why it does this, particularly when there's a method b64encode(bin, len = 60) that is meant to do this, but I can tell you that AES sure doesn't like you adding random characters to ciphertexts.

Of course rails already thought of this and a method:
encode64s exists. So you could use encode64s instead of encode64, although I would argue that they should have called it something a bit more explicit eg Base64.encode64InAWayThatIsActuallyDecodesToTheSameStringYouStartedWith(string). Just a thought.

Anyway this As3Crypto Demo is a sweet way to test out the decrypt-ability of your encryption complete with all manners of paddings, encryption modes and schemes. Highly recommended.

Tuesday, November 17, 2009

Smoke

I SIT in a chair and wait for my IDE to respond.

Millions of lines of code are run unnecessarily, indices are created and forgotten, gigabytes of ram are expended, the developers of RubyMine, Netbeans, Aptana chase the windmills of autocompletion, smoke pours across the fields of circuits and the spectre of kill -9 looms.

I sit in a chair and wait for my IDE to respond.




(with infinite apologies to Carl Sandburg who wrote the thundering poem mimicked here)