Postgres Common Table Expression Super Example (with a hot little window function action too)
Common table expressions are a PostgreSQL user's best friend. Let me show you them.
The Example Problem:
Let's take a look at how to build SymptomsLikeMe. This is something that would look at a bunch of health reports and figure out who has symptoms that are most like mine.
The data model is pretty simple. A user has many symptom surveys. Each symptom survey has a symptom_id (ie pain, nausea, fatigue) and a severity (1,2,3,4).
So, here's the challenge in more details. Given all of a user's latest symptom surveys (1 for fatigue, 4 for pain), compare that to all other latest symptom surveys and produce a similarity score (least square). Return the top 10 closest symptom reports.
So what needs to happen?
- Get all of the latests symptom surveys
- Get just this users's latest symptom survey
- Compare that to all other latest symptom surveys and produce a similarity score (using least squares).
- Return the top 10 closest symptom reports.
How to do it?
Well, that's a lot to do, right? Let's look at a couple approaches.
- In Ruby? Wow. There are ~= 1,788,141 symptom reports eh? So we serialize them all into ruby and... no.
- We'll denormalize it! Urm... I'm not really sure what that would mean. We explode the combinatorial space of all symptoms to symptoms.... ouch.
- ??????
PostgreSQL!!!!
Let's forget trying to do the whole problem in one massive statement and just do the four steps we listed out above.
Step 1 "Get all of the latest symptom surveys"
huh? Why not use a not exist well, because not exists have a bad habit of returning more than one row when things starts colliding and we have to tie break collisions or suffer the consequences of duplicate counts. rank = 1 guarantees us that we'll only get 1 result, no matter how many dupes there are.
Step 2 "Given this user's latest symptom survey"
easy peasy.
Step 3 "compare that to all other latest symptom surveys and produce a similarity score"
So to compute the similarity, we need to compare our user, vs each user. If only there were a way to JOIN our user's info onto each row of a another user's data... oo a join!
Ok, so that's the join and then we just want the difference of the two columns 'minus' and we square that (for least squares) and um, sum().
Easy peasy!!
Step 4 "Return the top 10 closest symptom reports."
All together now
But then we need all 4 pieces of this to work together and... omg we are done.
Conclusion
- CommonTableExpressions work to decompose SQL problems into manageable hunks.
- Read Steps 1-4 backwards and it's just a bunch of unfunded mandates that you need to fill in.
- I really didn't tell you how window functions work. Sorry. They're great. But they need another blog post to really explain. See http://www.postgresql.org/docs/9.1/static/tutorial-window.html which is really pretty good.