Pentaho CTO James Dixon has been more than vocal on the subject of data lakes in recent times and has strived to try and put plug (ouch! sorry) this topic before it drains away (double ouch!).
Dixon himself coined the term ‘Data Lake’ to describe a vessel for holding data from a single source and you can read more on this subject here.
Dixon now says that we need to reconsider what data is and what it does for a living on a typical workday — this is because (for many business applications) we can say that the applications are essentially workflow applications or state machines.
This notion of apps as state machines includes:
• CRM systems,
• ERP systems,
• Asset tracking tools,
• Case tracking tools,
• Call center functions and,
• Some financial systems.
The real-world entities (employees, customers, devices, accounts, orders etc.) represented in these systems are stored as a collection of attributes that define their current state explains Dixon
EXAMPLE: Instances of these attributes include (for example) someone’s current address or number of dependents, an account’s current balance, who is in possession of laptop X, which documents for a loan approval have been provided, and the date of Fluffy’s last Feline Distemper vaccination.
What can state machines do?
Dixon says that state machines are “very good at answering questions” about the state of things. They are, after all, machines that handle state.
But what about reporting on trends and changes over the short and long term?
How do we do this?
According to Dixon, “The answer for this is to track changes to the attributes in change logs. These change logs are database tables or text files that list the changes made over time. That way you can (although the data transformation is ugly) rewind the change log of a specific field across all objects in the system and then aggregate those changes to get a view over time. This is not easy to do and assumes that you have a change log. Typically, change logs only exist for the main fields in an application. There might only be change logs on 10-20% of the fields. So if you suddenly have an impulse so see how a lesser attribute has changed over time you are out of luck. It is impossible because that information is lost.”
The Pentaho CTO continues by asserting that this situation is similar to the way that old school business intelligence and analytic applications were built.
“End users listed out the questions they want to ask of the data, the attributes necessary to answer those questions were skimmed from the data stream, and bulk loaded into a data mart. This method works fine until you have a new question to ask. The Data Lake approach solves this problem. You store all of the data in a Data Lake, populate data marts and your data warehouse to satisfy traditional needs, and enable ad-hoc query and reporting on the raw data in the Data Lake for new questions,” he said.
The suggestion here is that a Data Lake can also be used to solve the problems of history and trending for workflow applications and state machines.
What if, we say what if?
What if these applications write their initial state into the Data Lake and then also write the change of every attribute in there as well asks Dixon?
While we are at it, let’s log all the application events coming from the user interface tier as well. From the application’s perspective this is a low-latency fire and forget scenario.
“Now we have the initial state of the application’s data and the changes to of all of the attributes, not just the main/traditional fields. We can apply this approach to more than one application, each with its own Data Lake of state logs, storing every incremental change and event. So now we have the state of every field of (potentially) every business application in an enterprise across time. We have the “Union of the State” today,” says Dixon.
With this data we have the ability to rewind the Union of the State to any point in time. What are the potential use cases for the Union of the State?
You can read the complete analysis on Dixon’s own blog here.
Editorial Disclosure: Adrian Bridgwater has worked on eBook materials for Pentaho.