-
Notifications
You must be signed in to change notification settings - Fork 14
Home
In this document we describe the problem that Indexed HBase (IHBase) is attempting to solve, how it works and how best to take advantage of it. It assumes that the reader is already familiar with HBase concepts. We strongly recommend that readers first familiarize themselves with HBase with Lars George’s excellent HBase Architecture 101 article before proceeding.
The Problem
HBase is extremely good at retrieving data when a specific row key or range of row keys is known. In the case where you want to find rows based on the contents of a column, HBase is slow and resource intensive. In order to find the proverbial needle in the haystack HBase must resort to iterating over all rows to find those that contain the desired column value. This contribution attempts to improve the speed of scans while still maintaining an immediately consistent view of the table and honoring the existing scan contract.
The Solution
The solution that IHBase has taken is to reduce the number of rows that are scanned by using indices and index hints. Indices are placed on columns and maintained at each individual region. A client can then provide an index expression hint along with a scan to aid the regions in reducing the scope of the scan.
As described in the HBase Architecture 101 article, any data written to HBase is held in an in-memory store called the MemStore. When the MemStore reaches its configured maximum size a background thread flushes the in-memory data to a Store file. It’s at this point that the indexing process begins. Once the store file has been written but before the store file is made available to the region, an index rebuild of the rows held in the region begins. The indexing process, still in the background thread, scans the new and existing store files and creates indices on columns specified in the table definition.
As mentioned earlier, one of the goals of IHBase is to maintain an immediately consistent view of the data in a table. That means that any scans should be able to see rows that were written to the table just milliseconds before. In order to provide that consistent view of the data the rows held in the MemStore are not indexed.
TODO: EXPLAIN THE INDEX DATA STRUCTURE
From a client’s perspective a scan operates in much the same way they do with standard HBase scans with the added option of an index expression hint. Any results that match the scan are returned in key order and the filter provided on the scan still has the ability to influence the scan as before. The major difference is that the filter will potentially have to process a lot less rows.
The major caveat is that the index hint does not guarantee that only rows matching the index expression are included in the scan results. It ensures that the scan will get all rows that match the hint but it will probably receive rows that don’t. For this reason it is suggested that a filter that at least mirrors the criteria in the index expression hint is provided with the scan.
As the scan reaches a region the region queries the indices for a set of row keys that match. The scan then proceeds down to the core HBase scan logic, but after each row is processed by the scan a hook in IHBase is called to fast-forward the scan between potential matches.