Since their inception in the 1970s, relational databases have come to be the norm for storing corporate data. Typically, data in relational databases is stored in tables that comprise rows and columns. Why am I boring you with this minutiæ? My point is that relational databases have been with us a long time and are fantastically good at storing data that is well-structured: every piece of data in its place and a place for every piece of data.
For many years, this was fine because by far the majority of business data is very highly structured: every one of your existing customers has a date upon which they first interacted with your organisation; every customer has a name and so on.
Recently, however, we have seen an upsurge in data such as email, social network traffic, digital images and output from sensors, radio frequency identification (RFID tags) and devices enabled by the Global Positioning System (GPS) -- and some of this data is much less formally structured than transaction data is. For many people, it is primarily this new surge of data that has come to be known as “big data” (as opposed to “small data,” but no one seems to use that term).
Big data management: It’s not just a snap
While definitions often include other characteristics, such as data variety and complexity, “big” is clearly a reference to the volume of data involved. How many megapixels is your most recent digital camera -- 12? Well, to put that into perspective, if we assume 1 KB to store the details of one customer in a database, then even with compression, a single snap is the equivalent of 6,000 customers. Oh, I’m sorry, it was a video camera you bought; in that case … You get the idea: “Big” doesn’t just mean twice as large as previously; it means orders-of-magnitude bigger.
Some big data remains highly structured; take sensor data, for example. A temperature sensor might send a 3-byte value 10 times per second. Over time the numbers of readings will build up (particularly if you are using a large number of sensors), but each is a well-formed, 3-byte temperature reading.
GPS data is larger (in the sense that it is more than 3 bytes per reading), but all the locations sent to the database are still structured in the same way as each other.
However, much big data is “semi-structured” in nature, which simply means that the data has some elements that are highly structured, others less so. Take an email as an example: The data elements such as the time and date sent and time and date received are reasonably highly structured; however, the body of the email is unstructured free text. The data therein is relatively easy to scan -- you can, for example, find all of the emails that contain the word “Professor” in your system. But the really useful information in the email body is often qualitative, and the meaning has to be teased out by reading it, understanding it and gauging the tone. Is this an angry email? Similarly, are the most recent postings on social networks for or against your new product marketing campaign?
Big data management beyond relational databases?
The challenges of big data are primarily how to store it and how to analyse it, and the answer to both is probably “not in a relational database.” For big data management, the answer often lies in technologies such as columnar databases, NoSQL databases, Hadoop and MapReduce.
These are relatively new technologies and tend not to be supplied as turnkey systems. As a consequence, adoption in the UK tends to be limited to those companies where the return on investment is worth the pain of innovation. In my experience as a consultant, we are talking about telecoms companies rather than banks and patient data as opposed to sales data.
But where it is being deployed effectively, big data technology is proving wildly successful. And already some people are looking at dual-storage mechanisms in which a relational database engine and a semi-structured data storage engine work in synchrony. I myself am currently working on a scientific system where we are interfacing the two types of database engines.
In the commercial world, a piece of big data might be stored in a NoSQL system. There, it can be analysed and identified as an email from Mr Angry, a tweet supporting a marketing campaign or a patient X-ray showing no potential tumours. The data extracted as part of the analysis process can then be structured and stored in a relational database engine for future review, analysis, reporting or other purposes. So, for example, you can query the relational database to find out how many pro and anti tweets have been received and you can, of course, still view each individual tweet by retrieving them from the NoSQL store.