Saturday, May 23, 2015

Yes Mr. Bloor, there IS a Hadoop "metadata mess", but it ain't "waiting in the wings" -- IT'S HERE (and here's what I intend to do about it)

I've spent the last week getting a great introduction to Hadoop technologies by studying the excellent book, HADOOP: THE DEFINITIVE GUIDE, by Tom White. At this point on my trip up the learning curve, I'm both impressed and distressed:

  • impressed by the straightforward architecture Hadoop (and the technology stack built upon it) uses to manage both distributed data storage and data processing, but...
  • distressed by the apparent lack of any means of documenting or even passively discovering fundamental information about the data being stored. 

It boils down to an apparent absence of any technology to serve as either an active or passive metadata repository (data dictionary, to use the old parlance).

Unfortunately, it may well be the case that a majority of the individuals who have bothered to read this far don't even know I'm talking about, perhaps not even being sure what "metadata" means in this context and why anyone would be worried about the lack of a means to manage it. If that's where you find yourself, please take a couple of minutes and read Robin Bloor's concise and clear exposition of the problem in his October 2013 analysis piece "Hadoop: Is There a Metadata Mess Waiting in the Wings?"; then please come on back here and read my remaining few paragraphs -- it's okay, I'll wait.

Given that the majority of my IT career up until 2008 was spent in the creation of tools for metadata extraction, transformation, and mapping between disparate architectures, I am particularly horrified at the state of metadata management in HBase (a distributed database that rides atop Hadoop and HDFS structures, and is somewhat reminiscent of ADABAS in its internal structures). Not only is it apparent that no tools exist to discover and maintain a directory to the fundamental structures of any HBase table (i.e., its columns), but there appears to be no awareness that such a thing is even needed (much less that it is, in my opinion, supremely vital). The only mention of such things that I've been able to find on the Web so far is in a posting to a HortonWorks Q&A page, in which someone very sensibly asks how one might obtain a list of the columns in an HBase table. They are very succinctly informed by one of HortonWork's experts that it simply CANNOT BE DONE!!

Yes, at this point in the evolution of HBase, apparently you can put anything you want to into it (in huge, exceedingly diverse quantities), but when the next person comes along and wants to utilize the data you've put there, you'd better be sure to hand them all those cocktail napkins upon which you jotted down the names and structures of the columns in your HBase tables. Otherwise you may be looking at a future where all those exabytes of data may just have to sit there as folks idly stare at it, scratch their heads, and wonder "what the hell is all this?".

I say all this by way of explanation of my soon-to-commence R&D project: to build a passive data dictionary for HBase which will discover the names (and where feasible, the structures) of the columns in any given HBase table, storing the results in (of course!) an HBase table. My working name of this open-source project is "Chaos Wrangler".

EDIT 2016-07-17: The open-source project, ColumnManager for HBase, is now in beta-02, with binaries and documentation available both on GitHub (where the project is housed) and via the Maven Central Repository!

No comments:

Post a Comment