Along with the increasing ubiquity of technology comes the increase in the amount of electronic data. Just a few years ago, corporate databases tended to be measured in the range of tens to hundreds of gigabytes. Now, multi-terabyte (TB) or even petabyte (PB) databases are quite normal.
The World Data Center for Climate (WDDC) stores over 6PB of data overall (although all but around 220TB of this is held in archive on tape) and the National Energy Research Scientific Computing Center (NERSC) has over 2.8PB of available data around atomic energy research, physics projects and so on.
Even companies such as Amazon are running with databases in the tens of terabytes, and companies that most would expect not to have to worry about such massive systems - such as ChoicePoint, a US company that tracks data about the whole of the US population and has one of the largest databases in the world - are dealing with databases in the hundreds of terabytes.
Others, where it is not surprising that large databases are in place, include telecoms companies and service providers. Just dealing with log files of all the events happening across such technology estates can easily build up database sizes. Others include social media sites, where even those that are text-only or primarily text, such as Twitter or Facebook, have big enough problems, and the likes of YouTube have to deal with massively expanding datasets.
Changing data requirements
Yet the biggest problem is not simply the sheer volume of data, but the fact that the type of data companies must deal with is changing. When it was rows and columns of figures held in a standard database, life was (relatively) simple. It all came down to the speed of the database and the hardware it was running on. Now, more and more binary large objects (BLOBs) are appearing in databases, which require a different approach to identifying and reporting on what the content actually is and in identifying patterns and making sense out of what this means to the user.
Even worse is the fact that less information is making it into standard databases. There is still an increasing amount of numerical and textual data being created that resides within a database, but this is being outstripped by the amount of information being created in a more ad hoc manner, with files that lie directly in a filing system.
At the formal data level, suppliers initially used various approaches, such as data warehouses, data marts and data cubes, to provide a fast and effective means of analysing and reporting on very large data sets. When this started to creak, master data management, data federation and other techniques such as in-memory databases and "sip of the ocean" indicative analysis were brought in to try to keep ahead of the curve. What has become apparent, however, is that such approaches were just stop gaps and database suppliers have really been struggling to keep up.
Is big data the answer?
To deal with the increasing amount of information being held within databases, an approach termed "big data" has come to the fore. Originally aimed at companies within markets such as oil and gas exploration, pharmaceuticals and others dealing with massive data sets, big data looked at how to move away from overly complex and relatively slow systems to one that could provide much greater visibility of what is happening at a data level, enabling those in highly data-centric environments to deal with massive data sets in the fastest time possible. But it has evolved into an idea being presented to the commercial world as a means of dealing with their own complex data systems, and also, in some cases, to deal with information being held outside of formal databases themselves.
The Apache Hadoop system is one such approach. This system utilises a proprietary file system to create a massively scalable and highly performant platform for dealing with different sorts of data, which can include textual or other data that has been brought into the Hadoop system through, for example, web crawlers or search engines.
Another approach was demonstrated by IBM with its Watson computer system, which gained fame by winning US quiz programme Jeopardy. The Watson system uses a mix of database technology and search systems, along with advanced analytic technologies, to enable a computer to appear to be "thinking" in the same way a human does, working backwards from a natural language answer to be able to predict what the question associated with that answer would have been.
Now being developed into a range of applications that can be sold commercially, Watson is not some highly-proprietary system built just for one purpose - IBM purposefully designed it on commercially available hardware and software (such as DB2, WebSphere, InfoStreams and so on) so it could be useful to the general user in as short a time as possible.
The problem remains that most organisations still regard "data" as rows and columns of numbers that can be mined and reported on using analytical tools that will end up with a graph of some sort. This is why Quocirca prefers the term "unbounded data" - the capability to pull together data and information from a range of disparate sources and to make sense of it in a way that a user needs.
Therefore, when looking for a solution to "big data", Quocirca recommends that organisations look for the following characteristics:
- Can this solution deal with different data types, including text, image, video and sound?
- Can this solution deal with disparate data sources, both within and outside of my organisation's environment?
- Will the solution create a new, massive data warehouse that will only make my problems worse, or will it use metadata and pointers to minimise data replication and redundancy?
- How can and will the solution present findings back to me - and will this only be based on what has already happened, or can it predict with some degree of certainty what may happen in the future?
- How will the solution deal with back-up and restore of data, is it inherently fault tolerant and can I apply more resource easily to the system as required?
With the massive growth of data volumes that is occurring, it is necessary to ensure that whatever solution is chosen, it can deal with such growth for a reasonable amount of time - at least five years. Therefore, flexibility is key, and a pure formal data focus based around rows and columns of data will not provide this.
Data measurement table
- Multiples of bytes
- SI decimal prefixes Binary IEC binary prefixes
- Name (Symbol) Value Usage Name (Symbol) Value
- kilobyte (kB) 103 210 kibibyte (KiB) 210
- megabyte (MB) 106 220 mebibyte (MiB) 220
- gigabyte (GB) 109 230 gibibyte (GiB) 230
- terabyte (TB) 1012 240 tebibyte (TiB) 240
- petabyte (PB) 1015 250 pebibyte (PiB) 250
- exabyte (EB) 1018 260 exbibyte (EiB) 260
- zettabyte (ZB) 1021 270 zebibyte (ZiB) 270
- yottabyte (YB) 1024 280 yobibyte (YiB) 280