Rhetorical Flâneries: hbase

Showing posts with label hbase. Show all posts

Sunday, February 21, 2016

My first HBase project milestone reached: a new HBase-oriented Maven archetype infrastructure, authored by me, is now part of the core product

Image courtesy of NASA

My first adventure in contributing to the big-data-oriented open-source software world has reached its first major milestone: a new infrastructure for development and management of HBase-oriented Maven archetypes (authored by me) has now been pushed through to the main branch of the Apache HBase project (which is to say, it is now a formal part of HBase). While my offering is not yet publicly available (that still requires a backport to the next 1.x release of the product), it is formally "in the can", and I'm very pleased!

Maven archetypes, to me, represent some of the most useful tools in state-of-the-art Java development. Generically speaking, they are used to allow very quick and robust automatic generation of a complete development environment (in Maven project format) relating to a specific technology (e.g., Hadoop, MySQL, etc.), often including fully-functional example code, so that a developer can get right to work understanding and using the technology, while spending close to zero time in the often-frustrating task of getting a workspace within their IDE properly configured. All of that is automatically and instantaneously done for them by the archetype.

When I saw the request go out for development of HBase-oriented archetypes (as an entry in the HBase project's JIRA system), I claimed it almost immediately. The fact that it was posted by one of the most prominent members of the project's management committee suggested to me that it was no idle wish-list item, but a serious request for development. I knew absolutely nothing at that point about Maven archetypes, nor how to develop them; but neither, apparently, did anybody else among all current HBase contributors. It would be my chance to study up on a new-to-me technology and apply it in a completely new-to-the-world (but very-useful-to-the-world) way. In other words -- it was a chance to do yet again what I've successfully done so many times before in my software engineering career with other technologies, but this time in an open-source context!

Not long after beginning to acclimate myself to the Maven archetype world, I realized that if I simply created the archetypes being requested and "dropped them into" the HBase project, that in all future builds of HBase there would be no proper testing of significant components of the archetypes. Just as problematically, it would leave things difficult to manage in the future for those who would need to (1) maintain the existing archetypes and (2) create new archetypes for different purposes. The available tools for producing Maven archetypes did not seem to lend themselves to the full software development lifecycle, and they are not particularly well-documented in all necessary regards. So my additional goal became to provide a very simple and well-documented infrastructure for development, maintenance, and future additions to a collection of HBase-oriented Maven archetypes.

In the end, I have provided an infrastructure which (1) provides for ongoing testing of all components of each archetype as new HBase builds are done, (2) allows future HBase contributors to add new archetypes in a straightforward away, via brief documentation provided in the README.md text found in the hbase-archetypes subproject's root directory.

Now we proceed on with backporting HBase-oriented archetypes to an upcoming 1.x minor release (likely 1.3), and then deploying the archetypes to the Central Maven Repository, allowing for public access to them.

Besides this, I have a few other projects on my plate, all in the open-source realm, consisting of:

a new Java-collection-oriented tool-set that I am currently finalizing, which I plan to ultimately offer up to the Apache Commons project, and
another HBase-oriented tool-set (unrelated to archetypes) that I'll be publishing on my own and then offering up to the HBase project as a candidate for adoption into the main product.

Other than that, as Mr. Bennet said, I am quite at my leisure, and will entertain any serious proposals (or even idle-but-intriguing proposals) for future software engineering engagements. Please feel free to contact me at:

daniel [at] commonvox [dot] org

Saturday, June 13, 2015

A "Hello World" for HBase (a brief tutorial Java program)

As I continue my research and development efforts in the general vicinity of HBase, it's become clear to me how difficult it is to find good introductory resources -- even something as simple as a "Hello World" program -- to serve as the launchpad for a newbie's explorations into the workings of the HBase client API. Not finding any such thing (at least nothing that was usable), I decided to cook up my own.

Without further ado, here it is...

(Use the horizontal scroll bar down at the bottom of the frame, or you might find it easier to view on its original page on Github.)

Friday, May 29, 2015

Chaos Wrangler project, Phase 1: Coming up with a simplified picture of the HBase "Data Model"

Public domain image - Wikipedia, Jason Hise

It's only been a few days since I've launched my "Chaos Wrangler" project -- to see what I can do to help bring HBase metadata into the realm of controllability (or even just plain knowability) for enterprises who are beginning to use HBase as the back-end repository for their "Big Data" projects!

While my mind is already racing far down the road, envisioning what would be required to create both passive and active metadata discovery mechanisms and repositories to plug into HBase, it is vital that I take some time in these early days to be sure that I thoroughly understand the structures of HBase (both physical and conceptual). While I did find a useful resource or two, I found nothing that boiled things down into terms that I felt I could fully internalize, so I decided to trudge through the main reference guide on the Apache HBase site and derive for myself a simplified picture of what the HBase data model consists of and how it works.

As part of this exploratory process, I've begun writing a Java-based "analogy" of the HBase data model. I'm taking the concepts that are presented and discussed in the reference guide (e.g., "table", "column prefix", "cell", etc.) and building them into a small set of interrelated Java classes. This is proving useful for my understanding, and when I publish the code soon to gist.github.com/dvimont, my hope is that it might help to explain HBase to anybody who has prior understanding of basic Java and RDBMS concepts.

EDIT 2015-06-04: The code snippet mentioned in the preceding sentence is now available on Github.

In the comments above my Java code, I began to write a few explanatory paragraphs, which soon included a character-based diagram of my conception of the HBase data model. Here are those comments, including the character-based diagram. The comments stand alone, apart from the Java code, as my first offering to the HBase community from the Chaos Wrangler project. My explanation is provided as a potential complement to more rigorous and comprehensive expositions of the HBase data model.

===============
The following is a Java-oriented analogy to the structures that comprise the HBase data model. It is presented under the assumption that the reader has a basic mastery of both Java and RDBMS concepts.

The task of coming to an understanding of HBase data structures is unfortunately made much more difficult than it otherwise might be, due to the fact that HBase uses the terms TABLE, COLUMN, and ROW, but the structures that are referred to by these names bear little resemblance to their RDBMS namesakes.

In fact, an HBase ROW is completely misnamed to the extent that the common term "row" is usually associated with a one-dimensional container of singular instances of data, whether it be a row in a spreadsheet or a row in an RDBMS table. In stark contrast, an HBase ROW could be conceived of as a container of an array of arrays of arrays!! (No wonder people run into difficulty understanding HBase data structures!)

Here is a conceptual representation of the hierarchy of structures in an HBase table. (Note that all relationships shown below are one-to-many -- e.g., one TABLE contains multiple COLUMN_FAMILY_PREFIX instances, one CELL contains multiple CELL_ENTRY instances, etc.):

A TABLE and its component COLUMN_FAMILY_PREFIXes are immutably defined when the TABLE is created. (While a TABLE could theoretically have an unlimited number of COLUMN_FAMILY_PREFIXes, the official reference says that the physical realities of the HBase architecture enforce a practical maximum of no more than two or three per TABLE!) All lower-level constructs (from ROW on down) are created and maintained by an application via the HBase "put" method. Thus, crucially, all of these lower-level constructs (including so-called COLUMN_QUALIFIERs) are treated as application-managed DATA, and not as database-managed METADATA!!

There are a number of immediately apparent ramifications of this "shifting" of column-name-maintenance (not to mention column-datatype-maintenance) from the metadata-realm to the data-realm: not only does this potentially upend certain basic assumptions about application development and metadata management, but it could also call for a redefining/realignment of classical roles within IT organizations, which may otherwise be accustomed to an explicit separation of duties between DBAs (who are the traditional custodians of metadata) and developers (who are traditionally responsible for building applications that manipulate only data - not metadata!).

Saturday, May 23, 2015

Yes Mr. Bloor, there IS a Hadoop "metadata mess", but it ain't "waiting in the wings" -- IT'S HERE (and here's what I intend to do about it)

I've spent the last week getting a great introduction to Hadoop technologies by studying the excellent book, HADOOP: THE DEFINITIVE GUIDE, by Tom White. At this point on my trip up the learning curve, I'm both impressed and distressed:

impressed by the straightforward architecture Hadoop (and the technology stack built upon it) uses to manage both distributed data storage and data processing, but...
distressed by the apparent lack of any means of documenting or even passively discovering fundamental information about the data being stored.

It boils down to an apparent absence of any technology to serve as either an active or passive metadata repository (data dictionary, to use the old parlance).

Unfortunately, it may well be the case that a majority of the individuals who have bothered to read this far don't even know I'm talking about, perhaps not even being sure what "metadata" means in this context and why anyone would be worried about the lack of a means to manage it. If that's where you find yourself, please take a couple of minutes and read Robin Bloor's concise and clear exposition of the problem in his October 2013 analysis piece "Hadoop: Is There a Metadata Mess Waiting in the Wings?"; then please come on back here and read my remaining few paragraphs -- it's okay, I'll wait.

Given that the majority of my IT career up until 2008 was spent in the creation of tools for metadata extraction, transformation, and mapping between disparate architectures, I am particularly horrified at the state of metadata management in HBase (a distributed database that rides atop Hadoop and HDFS structures, and is somewhat reminiscent of ADABAS in its internal structures). Not only is it apparent that no tools exist to discover and maintain a directory to the fundamental structures of any HBase table (i.e., its columns), but there appears to be no awareness that such a thing is even needed (much less that it is, in my opinion, supremely vital). The only mention of such things that I've been able to find on the Web so far is in a posting to a HortonWorks Q&A page, in which someone very sensibly asks how one might obtain a list of the columns in an HBase table. They are very succinctly informed by one of HortonWork's experts that it simply CANNOT BE DONE!!

Yes, at this point in the evolution of HBase, apparently you can put anything you want to into it (in huge, exceedingly diverse quantities), but when the next person comes along and wants to utilize the data you've put there, you'd better be sure to hand them all those cocktail napkins upon which you jotted down the names and structures of the columns in your HBase tables. Otherwise you may be looking at a future where all those exabytes of data may just have to sit there as folks idly stare at it, scratch their heads, and wonder "what the hell is all this?".

I say all this by way of explanation of my soon-to-commence R&D project: to build a passive data dictionary for HBase which will discover the names (and where feasible, the structures) of the columns in any given HBase table, storing the results in (of course!) an HBase table. My working name of this open-source project is "Chaos Wrangler".

EDIT 2016-07-17: The open-source project, ColumnManager for HBase, is now in beta-02, with binaries and documentation available both on GitHub (where the project is housed) and via the Maven Central Repository!