Friday, May 29, 2015

Chaos Wrangler project, Phase 1: Coming up with a simplified picture of the HBase "Data Model"

Public domain image - Wikipedia, Jason Hise
It's only been a few days since I've launched my "Chaos Wrangler" project -- to see what I can do to help bring HBase metadata into the realm of controllability (or even just plain knowability) for enterprises who are beginning to use HBase as the back-end repository for their "Big Data" projects!

While my mind is already racing far down the road, envisioning what would be required to create both passive and active metadata discovery mechanisms and repositories to plug into HBase, it is vital that I take some time in these early days to be sure that I thoroughly understand the structures of HBase (both physical and conceptual). While I did find a useful resource or two, I found nothing that boiled things down into terms that I felt I could fully internalize, so I decided to trudge through the main reference guide on the Apache HBase site and derive for myself a simplified picture of what the HBase data model consists of and how it works.

As part of this exploratory process, I've begun writing a Java-based "analogy" of the HBase data model. I'm taking the concepts that are presented and discussed in the reference guide (e.g., "table", "column prefix", "cell", etc.) and building them into a small set of interrelated Java classes. This is proving useful for my understanding, and when I publish the code soon to, my hope is that it might help to explain HBase to anybody who has prior understanding of basic Java and RDBMS concepts.

EDIT 2015-06-04: The code snippet mentioned in the preceding sentence is now available on Github

In the comments above my Java code, I began to write a few explanatory paragraphs, which soon included a character-based diagram of my conception of the HBase data model. Here are those comments, including the character-based diagram. The comments stand alone, apart from the Java code, as my first offering to the HBase community from the Chaos Wrangler project. My explanation is provided as a potential complement to more rigorous and comprehensive expositions of the HBase data model.

The following is a Java-oriented analogy to the structures that comprise the HBase data model. It is presented under the assumption that the reader has a basic mastery of both Java and RDBMS concepts.

The task of coming to an understanding of HBase data structures is unfortunately made much more difficult than it otherwise might be, due to the fact that HBase uses the terms TABLE, COLUMN, and ROW, but the structures that are referred to by these names bear little resemblance to their RDBMS namesakes.

In fact, an HBase ROW is completely misnamed to the extent that the common term "row" is usually associated with a one-dimensional container of singular instances of data, whether it be a row in a spreadsheet or a row in an RDBMS table. In stark contrast, an HBase ROW could be conceived of as a container of an array of arrays of arrays!! (No wonder people run into difficulty understanding HBase data structures!)

Here is a conceptual representation of the hierarchy of structures in an HBase table. (Note that all relationships shown below are one-to-many -- e.g., one TABLE contains multiple COLUMN_FAMILY_PREFIX instances, one CELL contains multiple CELL_ENTRY instances, etc.):

A TABLE and its component COLUMN_FAMILY_PREFIXes are immutably defined when the TABLE is created. (While a TABLE could theoretically have an unlimited number of COLUMN_FAMILY_PREFIXes, the official reference says that the physical realities of the HBase architecture enforce a practical maximum of no more than two or three per TABLE!)  All lower-level constructs (from ROW on down) are created and maintained by an application via the HBase "put" method. Thus, crucially, all of these lower-level constructs (including so-called COLUMN_QUALIFIERs) are treated as application-managed DATA, and not as database-managed METADATA!!

There are a number of immediately apparent ramifications of this "shifting" of column-name-maintenance (not to mention column-datatype-maintenance) from the metadata-realm to the data-realm: not only does this potentially upend certain basic assumptions about application development and metadata management, but it could also call for a redefining/realignment of classical roles within IT organizations, which may otherwise be accustomed to an explicit separation of duties between DBAs (who are the traditional custodians of metadata) and developers (who are traditionally responsible for building applications that manipulate only data - not metadata!).

Saturday, May 23, 2015

Yes Mr. Bloor, there IS a Hadoop "metadata mess", but it ain't "waiting in the wings" -- IT'S HERE (and here's what I intend to do about it)

I've spent the last week getting a great introduction to Hadoop technologies by studying the excellent book, HADOOP: THE DEFINITIVE GUIDE, by Tom White. At this point on my trip up the learning curve, I'm both impressed and distressed:

  • impressed by the straightforward architecture Hadoop (and the technology stack built upon it) uses to manage both distributed data storage and data processing, but...
  • distressed by the apparent lack of any means of documenting or even passively discovering fundamental information about the data being stored. 

It boils down to an apparent absence of any technology to serve as either an active or passive metadata repository (data dictionary, to use the old parlance).

Unfortunately, it may well be the case that a majority of the individuals who have bothered to read this far don't even know I'm talking about, perhaps not even being sure what "metadata" means in this context and why anyone would be worried about the lack of a means to manage it. If that's where you find yourself, please take a couple of minutes and read Robin Bloor's concise and clear exposition of the problem in his October 2013 analysis piece "Hadoop: Is There a Metadata Mess Waiting in the Wings?"; then please come on back here and read my remaining few paragraphs -- it's okay, I'll wait.

Given that the majority of my IT career up until 2008 was spent in the creation of tools for metadata extraction, transformation, and mapping between disparate architectures, I am particularly horrified at the state of metadata management in HBase (a distributed database that rides atop Hadoop and HDFS structures, and is somewhat reminiscent of ADABAS in its internal structures). Not only is it apparent that no tools exist to discover and maintain a directory to the fundamental structures of any HBase table (i.e., its columns), but there appears to be no awareness that such a thing is even needed (much less that it is, in my opinion, supremely vital). The only mention of such things that I've been able to find on the Web so far is in a posting to a HortonWorks Q&A page, in which someone very sensibly asks how one might obtain a list of the columns in an HBase table. They are very succinctly informed by one of HortonWork's experts that it simply CANNOT BE DONE!!

Yes, at this point in the evolution of HBase, apparently you can put anything you want to into it (in huge, exceedingly diverse quantities), but when the next person comes along and wants to utilize the data you've put there, you'd better be sure to hand them all those cocktail napkins upon which you jotted down the names and structures of the columns in your HBase tables. Otherwise you may be looking at a future where all those exabytes of data may just have to sit there as folks idly stare at it, scratch their heads, and wonder "what the hell is all this?".

I say all this by way of explanation of my soon-to-commence R&D project: to build a passive data dictionary for HBase which will discover the names (and where feasible, the structures) of the columns in any given HBase table, storing the results in (of course!) an HBase table. My working name of this open-source project is "Chaos Wrangler".

EDIT 2016-07-17: The open-source project, ColumnManager for HBase, is now in beta-02, with binaries and documentation available both on GitHub (where the project is housed) and via the Maven Central Repository!

Wednesday, May 20, 2015


Here is the entry I just submitted to the Khan Academy Talent Search...

What subject do your videos teach, and what level do they target?

Long division, 3-12 math

Video #1

Video #2

Why are you interested in sharing your videos with the world through Khan Academy?

Usually I measure success in teaching on a fairly personal level, when I'm communicating concepts to an individual or a small group, and I can see the glint of recognition in human eyes (sometimes prefaced by the crinkled brow of confusion) right before me. But YouTube and Khan Academy give another way to find success in educating people. While I could just measure it in the raw "number of views" that a tutorial video gets, with YouTube I can also get direct feedback from people, sometimes with students asking for clarification, or requesting that I do another video on another topic that they're learning about. (See the comments section on my "Long Division With a Two-Digit Divisor" video to see what I'm talking about.) My presumption is that getting my content out on Khan Academy as well as YouTube would heighten the possibilities for this kind of interaction.

But here's what I really LOVE about this medium of learning: People come to get it when THEY need it, at the precise moment that they WANT to. Hungry, self-directed minds like these are the most open and receptive to a concise and intriguing explanation that can quickly lead to that "a-ha, now I get it" moment!

Is there anything else you'd like us to know?

The first video that I'm submitting, "Long Division With a Two Digit Divisor," was literally made for a single classroom-full of Grade 4 students that I was then working with. I posted it publicly on YouTube because I figured it might be of benefit to somebody else besides my students. After it started getting lots of views I decided to make two "prequels" explaining the basics of long division (one of which is my second submitted video, "Why Long Division Works").

In further-flung-yet-curiously-related matters:
Edit 2015-05-26:
Here's a cocktail napkin sketch of some ideas for projects I'd like to work on in the future (with folks like those at Khan Academy, or on my own) --
  • Numeracy project = learning resources for self-tutoring in numeracy skills -- geared toward adults who are at the very lower end of the numeracy spectrum. Would need to be built with an eye to overcoming (or steering around) the fears that drive the innumerate to remain innumerate. (We dare not underestimate the overwhelming, self-defeating power of the "I'm no good at math" mantra.) Goal: elimination of innumeracy in the adult population.
  • Literacy project = same as numeracy project, but with focus on literacy.
  • Object Oriented version of MIT’s Scratch = for learners of all ages, a game-building mechanism like MIT’s Scratch, but one that is built upon valid Object-Oriented (OO) principles and which has a direct path into working with genuine, mature programming languages in a real IDE (integrated development environment). ScratchOO must get junior programmers working with valid OO concepts from day one of playing in the environment. Given that the path leads toward mature programming in a real IDE, perhaps the learning environment should be available as a plug-in to an IDE (like Eclipse or NetBeans). If, for example, ScratchOO were built on top of Java, then all the ScratchOO tools and widgets would be built in Java. The beginner would work with a simple set of these tools to build games, but as they advance, then can get into the actual Java code that lies "behind" the simple tools they've been working with, to customize and extend them, and become fully competent Java programmers.

Thursday, May 14, 2015

Two new open source Java projects launched on GitHub & Sourceforge

In the world of Software Engineering, I've heard it said that "code is the new resume". 

If that's so, then I've just burnished up two fresh software engineering "CVs" and posted them to GitHub and SourceForge.

IndexedCollection: A Java class library which extends the standard Java collections framework to provide the IndexedCollection class, which offers simple, automatic, in-memory NoSQL composite-indexing of a standard Collection of objects. In many situations it could be a nice alternative (or a simpler complement) to complex ORM implementations, making for higher efficiency and lower TCO in the development and maintenance of Java applications. 
Gist (simple usage example code): 
Side note -- I've been trying to benchmark my IndexedCollection against another package called CQEngine, but unfortunately that other package's documentation is so sparse, I can't even figure out how to use it for the simplest of queries. The sample code that is published for it won't run against the current release, and both the explanatory text on Google Code and its Javadocs documentation are also out-of-sync with the actual code! So, from a usability perspective, the project seems to have been abandoned, even though the package itself was updated as recently as early 2014. Oh well...

LibriVox Explorer: From a technical point of view, this Windows/Mac/Linux desktop application makes extensive use of the above-mentioned IndexedCollection Java class library, providing the best currently-available example of how IndexedCollections are intended to work. From the end-user viewpoint, LibriVox Explorer is a new way to experience the LibriVox collection of Public Domain audiobooks, and it takes full advantage of the "eye candy" potential of that collection's vast body of intriguing cover-art.
Download it now, and take it for a spin on your desktop: