The Example Problem:
Let's take a look at how to build SymptomsLikeMe. This is something that would look at a bunch of health reports and figure out who has symptoms that are most like mine.
The data model is pretty simple. A user has many symptom surveys. Each symptom survey has a symptom_id (ie pain, nausea, fatigue) and a severity (1,2,3,4).
So, here's the challenge in more details. Given all of a user's latest symptom surveys (1 for fatigue, 4 for pain), compare that to all other latest symptom surveys and produce a similarity score (least square). Return the top 10 closest symptom reports.
So what needs to happen?
- Get all of the latests symptom surveys
- Get just this users's latest symptom survey
- Compare that to all other latest symptom surveys and produce a similarity score (using least squares).
- Return the top 10 closest symptom reports.
How to do it?
Well, that's a lot to do, right? Let's look at a couple approaches.
- In Ruby? Wow. There are ~= 1,788,141 symptom reports eh? So we serialize them all into ruby and... no.
- We'll denormalize it! Urm... I'm not really sure what that would mean. We explode the combinatorial space of all symptoms to symptoms.... ouch.
Let's forget trying to do the whole problem in one massive statement and just do the four steps we listed out above.
Step 1 "Get all of the latest symptom surveys"
huh? Why not use a not exist well, because not exists have a bad habit of returning more than one row when things starts colliding and we have to tie break collisions or suffer the consequences of duplicate counts. rank = 1 guarantees us that we'll only get 1 result, no matter how many dupes there are.
Step 2 "Given this user's latest symptom survey"
Step 3 "compare that to all other latest symptom surveys and produce a similarity score"
So to compute the similarity, we need to compare our user, vs each user. If only there were a way to JOIN our user's info onto each row of a another user's data... oo a join!
Ok, so that's the join and then we just want the difference of the two columns 'minus' and we square that (for least squares) and um, sum().
Step 4 "Return the top 10 closest symptom reports."
All together now
But then we need all 4 pieces of this to work together and... omg we are done.
- CommonTableExpressions work to decompose SQL problems into manageable hunks.
- Read Steps 1-4 backwards and it's just a bunch of unfunded mandates that you need to fill in.
- I really didn't tell you how window functions work. Sorry. They're great. But they need another blog post to really explain. See http://www.postgresql.org/docs/9.1/static/tutorial-window.html which is really pretty good.