Friday, May 29, 2015

Chaos Wrangler project, Phase 1: Coming up with a simplified picture of the HBase "Data Model"

Public domain image - Wikipedia, Jason Hise
It's only been a few days since I've launched my "Chaos Wrangler" project -- to see what I can do to help bring HBase metadata into the realm of controllability (or even just plain knowability) for enterprises who are beginning to use HBase as the back-end repository for their "Big Data" projects!

While my mind is already racing far down the road, envisioning what would be required to create both passive and active metadata discovery mechanisms and repositories to plug into HBase, it is vital that I take some time in these early days to be sure that I thoroughly understand the structures of HBase (both physical and conceptual). While I did find a useful resource or two, I found nothing that boiled things down into terms that I felt I could fully internalize, so I decided to trudge through the main reference guide on the Apache HBase site and derive for myself a simplified picture of what the HBase data model consists of and how it works.

As part of this exploratory process, I've begun writing a Java-based "analogy" of the HBase data model. I'm taking the concepts that are presented and discussed in the reference guide (e.g., "table", "column prefix", "cell", etc.) and building them into a small set of interrelated Java classes. This is proving useful for my understanding, and when I publish the code soon to gist.github.com/dvimont, my hope is that it might help to explain HBase to anybody who has prior understanding of basic Java and RDBMS concepts.

EDIT 2015-06-04: The code snippet mentioned in the preceding sentence is now available on Github

In the comments above my Java code, I began to write a few explanatory paragraphs, which soon included a character-based diagram of my conception of the HBase data model. Here are those comments, including the character-based diagram. The comments stand alone, apart from the Java code, as my first offering to the HBase community from the Chaos Wrangler project. My explanation is provided as a potential complement to more rigorous and comprehensive expositions of the HBase data model.

===============
The following is a Java-oriented analogy to the structures that comprise the HBase data model. It is presented under the assumption that the reader has a basic mastery of both Java and RDBMS concepts.

The task of coming to an understanding of HBase data structures is unfortunately made much more difficult than it otherwise might be, due to the fact that HBase uses the terms TABLE, COLUMN, and ROW, but the structures that are referred to by these names bear little resemblance to their RDBMS namesakes.

In fact, an HBase ROW is completely misnamed to the extent that the common term "row" is usually associated with a one-dimensional container of singular instances of data, whether it be a row in a spreadsheet or a row in an RDBMS table. In stark contrast, an HBase ROW could be conceived of as a container of an array of arrays of arrays!! (No wonder people run into difficulty understanding HBase data structures!)

Here is a conceptual representation of the hierarchy of structures in an HBase table. (Note that all relationships shown below are one-to-many -- e.g., one TABLE contains multiple COLUMN_FAMILY_PREFIX instances, one CELL contains multiple CELL_ENTRY instances, etc.):

A TABLE and its component COLUMN_FAMILY_PREFIXes are immutably defined when the TABLE is created. (While a TABLE could theoretically have an unlimited number of COLUMN_FAMILY_PREFIXes, the official reference says that the physical realities of the HBase architecture enforce a practical maximum of no more than two or three per TABLE!)  All lower-level constructs (from ROW on down) are created and maintained by an application via the HBase "put" method. Thus, crucially, all of these lower-level constructs (including so-called COLUMN_QUALIFIERs) are treated as application-managed DATA, and not as database-managed METADATA!!

There are a number of immediately apparent ramifications of this "shifting" of column-name-maintenance (not to mention column-datatype-maintenance) from the metadata-realm to the data-realm: not only does this potentially upend certain basic assumptions about application development and metadata management, but it could also call for a redefining/realignment of classical roles within IT organizations, which may otherwise be accustomed to an explicit separation of duties between DBAs (who are the traditional custodians of metadata) and developers (who are traditionally responsible for building applications that manipulate only data - not metadata!).


No comments:

Post a Comment