This is a guest post for the Computer Weekly Open Source Insider blog by James Dixon, CTO at open source Business Intelligence (BI) products company Pentaho.
Four years ago when Pentaho first released Hadoop support, Dixon coined the term ‘Data Lake’ to describe a vessel for holding data from a single source. When selecting it, he thought very carefully about its suitability as both an analogy and a metaphor.
Lexicon of (data) love
In one respect I’m pleased that the term has entered the data architecture lexicon.
Several companies have even designed products and services around the concept. Less pleasing is that since 2010 it’s been gradually redefined, then subsequently refuted based on these new definitions.
But hey, this kind of thing happens in any modern, digital debate and at least it indicates there’s a healthy interest in the subject matter. However, as one who spends most waking hours conceiving new information architectures to solve modern data problems, I thought it was time to revive the original Data Lake definition and explain its original role and relevance.
Clearing the air… and the water
In 2010, after speaking to many early Hadoop adopters, I learned that:
● 80-90% of companies were dealing with structured or semi-structured data (not unstructured).
● The source of the data was typically a single application or system
● The data was typically sub-transactional or non-transactional
● There were some known questions to ask of the data
● There were even more unknown questions that would arise in the future
● There were multiple user communities that would have questions of the data
● The data was of a scale or daily volume such that it won’t fit technically and/or economically into an RDBMS
The Data Lake concept considered all these and also took the limitations of traditional approaches like ‘data marts’ into account. A fundamental problem with data marts is that only a subset of data attributes can be examined, so only known, pre-determined questions can be asked. Also, because data is aggregated, visibility into the lowest levels is lost.
NOTE: A data mart is a repository of data gathered from operational data and other sources that is designed to serve a particular community of knowledge workers.
Having said this, a data lake does not replace a database, data mart, or data warehouse. At least not yet. I explain this concepts and more in my initial video on the topic here: Pentaho Hadoop Series Part 1: Big Data Architecture
Not exactly wrong, not exactly right
In their articles and reports, the statements they make are not wrong, yet not really right.
I agree with Devlin that the idea of putting all enterprise data into Hadoop (or any other data store) is not a viable option (at least right now). You should use the best tool for the job. Use a transactional database for transactional purposes. Use an analytic database for analytic purposes. Use Hadoop or MongoDB when they are the best fit for the situation. For the foreseeable future the IT environment is and will be a hybrid one with many different data stores.
Gartner’s take on Data Lakes says: “By its definition, a data lake accepts any data, without oversight or governance.”
However the way I originally defined Data Lakes, they only accept data from a single source.
These are just a few examples of conclusions based on an ‘evolved’ version of my original definition. Somewhere in these critiques my main premise for Data Lakes has been lost, which is:
You store raw data at its most granular level so that you can perform any ad-hoc aggregation at any time. The classic data warehouse and data mart approaches do not support this.
As I said before, I’m not overly taxed by this because it’s all part and parcel of debate. I’d be much more concerned if nobody was interested at all! However, if you’re a developer involved with having to design a modern information architecture that incorporates big data, I would certainly encourage you to revisit my original ‘data lake’ concept here and draw your own conclusions.