Big data is big news in business IT, but what does big data storage require?
To answer that question we must first look at the nature of big data.
A concise, contemporary definition of big data from Gartner defines it as "high-volume, -velocity and -variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making".
So, big data can comprise structured and unstructured data, it exists in high volumes and undergoes high rates of change.
The key reason behind the rise of big data is its use to provide actionable insights. Typically, organisations use analytics applications to extract information that would otherwise be invisible, or impossible to derive using existing methods.
Industries such as petrochemicals and financial services have been using data warehousing techniques to process very large data sets for decades, but this is not what most understand as big data today.
The key difference is that today's big data sets include unstructured data and allow for extracting results from a variety of data types, such as emails, log files, social media, business transactions and a host of others.
For example, sales figures of a particular item in a chain of retail stores exist in a database and accessing them is not a big data problem.
But, if the business wants to cross-reference sales of a particular item with weather conditions at time of sale, or with various customer details, and to retrieve that information quickly, this would require intense processing and would be an application of big data technology.
What's different about big data storage?
One of the key characteristics of big data applications is that they demand real-time or near real-time responses. If a police officer stops a car they need data on that car and its occupants as quickly as possible.
Likewise, a financial application needs to pull data from a variety of sources quickly to present traders with correlated information that allows them to make buy or sell decisions ahead of the competition.
Data volumes are growing very quickly - especially unstructured data - at a rate typically of around 50% annually. As we move forward, this will only likely increase, with data augmented by that from growing numbers and types of machine sensors as well as by mobile data, social media and so on.
All of which means that big data infrastructures tend to demand high processing/IOPS performance and very large capacity.
Big data storage choices
The methodology selected to store big data should reflect the application and its usage patterns.
Traditional data warehousing operations mined relatively homogenous data sets, often supported by fairly monolithic storage infrastructures in a way that today would be considered less than optimal in terms of the ability to add processing or storage capacity.
By contrast, a contemporary web analytics workload demands low-latency access to very large numbers of small files, where scale-out storage - consisting of a number of compute/storage elements where capacity and performance can be added in relatively small increments - is more appropriate.
That implies a number of storage approaches.
Firstly, there is scale-out NAS.
This is file level access storage in which storage nodes can be daisy-chained together and storage capacity or processing power can be increased as nodes are added. Meanwhile, the presence of parallel file systems that scale to billions of files and petabytes of capacity allow for truly big data sets that can be linked together across locations and interrogated.
Major scale out NAS products for big data include: EMC Isilon with its OneFS distributed file system; Hitachi Data Systems’ Cloudera Hadoop Distribution Cluster reference architecture; Data Direct Networks hScaler Hadoop NAS platform; IBM SONAS; HP X9000, and; NetApp, which has now reached version 8.2 of its DATA Ontap scale-out operating system.
Another possible approach that allows to very large sets of data is object storage. This sees the replacement of the traditional tree-like file system with a flat data structure in which files are located by unique IDs, something like the DNS system on the internet. This potentially makes the handling of very large numbers of objects less taxing than is the case with a hierarchical structure.
Object storage products are increasingly able to work with big data analytics environments and products include Scality’s RING architecture, Dell DX, EMC’s Atmos platforms
Hyperscale, big data and ViPR
Then there are the so-called hyperscale compute/storage architectures that have risen to prominence due to their use by the likes of Facebook, Google etc. These see the use of many, many relatively simple, often commodity hardware-based nodes of compute with direct-attached storage (DAS) that are typically used to power big data analytics environments such as Hadoop.
Unlike traditional enterprise compute and storage infrastructures hyperscale builds in redundancy at the level of the entire compute/DAS node. If a component suffers a breakdown the workload fails over to another node and the entire unit is replaced rather than just the component within.
This approach has to date been the preserve of very large scale users such as the web pioneers mentioned.
But that might be set to change as storage suppliers recognise the opportunity (and the threat to them) from such hyperscale architectures, as well as the likely growth in big data comprised of data from myriad sources.
That appears to be what lies behind EMC’s launch of its ViPR software-defined storage environment. Announced at EMC World this year, ViPR places a scale-out object overlay across existing storage assets that allow them – EMC and other suppliers’ arrays, DAS and commodity storage – to be managed as a single pool. Added to this is the capability to link via APIs to Hadoop and other big data analytics engines that allow data to be interrogated where it resides.
This startup compute and storage systems in a box, and markets its cluster-capable 2U systems with four CPU sockets apiece as hyperscale nodes for Hadoop users. It uses SSDs and spinning media, offers data-tiering and compression, and can achieve a claimed throughput of up to 2GBps.
This was first published in July 2013