Last year Teradata – the data storage and analysis specialists – acquired Aster Data Systems. I asked Mayank Bawa to explain, in lay terms, the philosophy behind the innovative technology that he and co-founder Tasso Argyros developed in Aster Data.
The role of the product, now known as the Teradata Aster MapReduce Platform, is the storage and querying of both relational and big data. "Big data" covers many sorts of data. but its unifying theme is that it is data that does not fit well within the relational database model.
Relational databases are great for many kinds of data – sales data, for example, sits happily in a relational database. Every sale has the same kind of data – date of sale, customer, value, location and so on. These are all stored in a table with one column for each piece of data.
On the other hand, a jpeg image file is an entirely different proposition. A jpeg's raison d'être is to be an image. The image itself does not break down in the same way into a tidy series of columns. You can, of course, also store data about the image, like the date on which it was generated, its size and so on, but that is metadata – data about the jpeg file, not the image itself. Image files – still and moving – emails, tweets and blogs all fall into the non-relational category and are conventionally embraced within the term "big data".
Until recently there have been two broad approaches to handling big data.
The first is to adapt existing relational database management systems (RDMBS) by creating a new data type to hold the non-relational data. The binary large object (Blob) data type is an example: images (including jpegs) have been stored as Blobs in RDBMSs for years. The good news here is that you can store and retrieve the image but you cannot process the data it contains, only the metadata. So by linking the Blob to the metadata you could bring back all of the images created before 15 March 2008, but you could not use the database engine to scan the images themselves and find all of those that are pictures of mountains.
The second approach is to build an entirely new type of database engine able to handle the particular flavour of big data in question. So, for example, you could create a new database engine for jpegs and create a set of functions that can look inside the jpeg and, for example, classify it as a sunset view, a person, a mountain view etc.
In practice, neither approach is satisfactory. The first cannot process the big data effectively and the second means a new database engine must be constructed for each of the many varieties of big data.
One engine for relational and non-relational data
This was the problem that Aster sought to address: how to handle relational data and also, within the same engine, multiple forms of non-relational data elegantly. It also needed to address the problem of how to facilitate querying across all of the different types. There are some natural limitations here, of course; there is no point trying to query a spreadsheet to see if it is a picture of a mountain, but there is huge virtue in being able to use the same query mechanism to find all PDFs, blog entries and word documents that include a specific phrase.
The route Bawa and Argyros chose was to design a system composed of three elements: a storage engine, a processing layer and a function library.
The storage engine
Data is held by the engine in one of two ways. Relational data is held in traditional tables. All big data is held as de-serialised objects which are inherently very similar to Blobs.
The processing layer
This is where the querying power resides in the form of a SQL engine which has been extended to include MapReduce functions. MapReduce provides a means of combining and managing data from multiple sources and is geared for large volumes of data. The data that is stored in relational tables can be queried with SQL, as usual, and the big data – stored as Blobs – can be queried using the MapReduce functions. MapReduce is not a querying language per se: its task is to make complex querying of vast data sets possible by reducing the problem into smaller steps, which can then be answered in parallel.
The function library
This is the critical element. The library exists within the analytical layer where users can write any function they require to query or manipulate any non-relational form of data stored as a Blob. Resulting functions are stored in the function library.
Imagine we were storing spreadsheet files in .xls format: they would be stored as Blobs and be queriable with MapReduce. If we wanted to find any which contained a string of characters that identifies a particular project, we could write that function and store it.
Now suppose, to the same data store, we add an archive of old emails. These will be stored and queriable in the same way as Blobs. We can write further functions to interrogate the emails, maybe to find those with a body of more than 100 words.
At this point we are storing many different flavours of non-relational data and are free to query them as we wish. Job done, you might think, but there is more. All of the functions we write can also be used to cross-query the different types of data – again, as long as this is logically appropriate.
Aster does this by making use of a principle from the relational database model called closure. The principle of closure says that any query run against a table or tables of data must return its answer in the form of a further table. These are often called answer tables and can be used in exactly the same ways as the original tables of data. So, an answer table can have queries run against it - and the answer tables from these too will exhibit the same behaviour. Closure is a fundamental principle for this very reason – it essentially allows the chaining of queries.
In the Teradata Aster MapReduce Platform, every function we write produces a table, regardless of what the function does or the data against which it was designed to run. For example, the function to identify all the long emails that we wrote to query the email archive returns a table. This might, for example, comprise a single column containing the body text of all the long emails. Imagine that we now want to find which of these long emails referencing a specific word or phrase. We could run the string-finding function we initially wrote to run against .xls spreadsheet files against the email answer table because, as well as running against the .xls file format, it will also run happily against tables of data.
The net result of this is remarkable; it means that you can run queries across relational and big data. And that is important. Enterprises have realised, over the years, that analysis of relational data is crucial. Enterprises have also become interested in big data and feel the need to analyse that as well. The idea that the two analyses will run independently is clearly a non-starter – in many cases the business people involved will not even differentiate between the types of data they need to analyse. They will just want to know whether, for example, a marketing campaign (relational data) has affected the standing of their company in the websphere (big data). So the ability to cross analyse is crucial. Whether the Aster data approach is going to be adopted as the de facto solution is open to question; but it is a fascinating answer to the problem
Mark Whitehorn works as a consultant for national and international companies. specialises in databases, data analysis, data modeling, data warehousing and business intelligence (BI). He also holds the chair of Analytics at the University of Dundee, where he works as an academic researcher, lecturer and runs a masters programme in business intelligence