hbase scan: batch vs cache

Here's today's contribution to the Internet:tl;dr When it comes to HBase scanner settings, you want caching, not batchsize.Maybe this is totally clear to everyone else. But for those of us who are 'newer to hbase' I can never quite remember what I'm doing.

Say you've got this code:

Scan s = new Scan(startKey);s.setCaching(foo);s.setBatch(bar);ResultScanner scanner = new ResultScanner(s);for (final Result r : scanner) { //stuff}

But you're clever and you don't want to do RPC calls to HBase for every row. You might even say you'd like to 'batch' the results from your scanner.

So you read http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html public void setBatch(int batch)

Set the maximum number of values to return for each call to next()

public void setCaching(int caching)

Set the number of rows for caching that will be passed to scanners. If not set, the default setting from HTable.getScannerCaching() will apply. Higher caching values will enable faster scanners but will use more memory.

Annnd.... not sure. I mean, I only want one Result every time I call next()in my iterator, right? What would a number >1 even mean?

And I'm sure I shouldn't set 'caching' that sounds like it will 'cache' something. I want to read the real stuff.

But you do want caching. Caching is how many things come back in a batch from your scanner.

Ok. Fine. Caching got named poorly. What is batch?

Batch is in case you have super wide rows. Say you have 250 columns. Batch of 100 would give your iterator:

Iteration 1: Result id 0. Columns 0-99
Iteration 2: Result id 0. Columns 100-199
Iteration 3: Result id 0. Columns 200-249
Iteration 4: Result id 1. Columns 0-99
Iteration 5: Result id 1. Columns 100-199

Or at least that's what http://twitter.com/monkeyatlarge told me.