Big data analysis needs a split from the traditional approach of matching back-end infrastructure to application requirements.
Traditional approaches to building storage infrastructure may be wholly unsuitable to the analysis of large, real-time dataset. Enterprise storage can be very application-focused.
IT deploys storage area network (SAN) storage for transactional systems or network-attached storage (NAS) for file storage. Businesses usually think about their applications first and the back-end storage comes afterwards.
Big data needs a different approach, due to the large volumes of data involved. Ovum senior analyst Tim Stammers warns: “There is no clear consensus in the industry in what to sell customers.”
Some suppliers are offering object storage, clustered, scalable NAS or block level SANs. “All have their own advantages but it all depends on your environment,” he adds. Suppliers sell big data appliances with integrated storage, which improves performance, but it may also cause businesses issues when the data needs to be shared.
More articles on Big Data
Storage for Hadoop
Hadoop, the Apache open-source implementation of Google’s MapReduce algorithm, takes a different approach to processing data over the relational databases used to power transactional systems. Hadoop processes data by running parallel processing.
Data is effectively split across multiple nodes in a large computer cluster, allowing big data to be analysed across a large number of low-cost computing nodes. The cluster can be on-premise or hosted somewhere such as the Amazon cloud. “It maps data to store on computer nodes in a cluster and reduces the amount of data transferred to the cluster,” says Gartner research director Jie Zhang.
“Traditionally IT infrastructure is siloed and is very vertical, big data uses a scale-out architecture.” Server farms for big data Hadoop effectively splits the datasets into smaller pieces known as blocks through its filing system, known as the Hadoop Filing System (HDFS). Such a cluster puts a heavy load on the network.
According to IBM, for Hadoop deployments using a SAN or NAS, the extra network communication overhead can cause performance bottlenecks, especially for larger clusters. So NAS and SAN-based storage is out of the question. Ovum principal analyst Tony Baer has been looking at how to extend the performance and enterprise-readiness of Hadoop. Given that it relies on large numbers of low-cost disks, rather than enterprise-grade disk drives, factors such as the mean time between failures quoted by disk manufacturers become significant. In 2010 Facebook was the largest deployment of Hadoop with a 30PB database.
Now consider using 30,000 1TB drives for storage. For simplicity, assume the installation was built all in one go. If a typical drive has a mean time between failure (MBFT) of 300,000 hours, in a year each will run 8,766 hours. The total number of hours the 30PB storage system will run in a year is 263 million (8766 x 30,000). That means 877 drives will fail in a year, or 2.4 disk drive failures a day.
Luckily, HDFS has built-in redundancy so disk failure does not result in data loss. But one must feel a little sorry for the technician whose job it is to locate and replace the failed drive, even if it becomes part of a regular maintenance routine. However, in spite of its redundancy capabilities, Baer notes in his latest research paper Big Data Storage in Hadoop, that HDFS lacks many data protection, security, access and performance optimisation features associated with mature commercial file and data storage subsystems. Many of these features are not essential for existing Hadoop analytic processing patterns. He says: “Because big data analytics is a moving target, the Hadoop platform may have to accommodate features currently offered by more mature file and storage systems, such as support of snapshots, if it is to become accepted as an enterprise analytics platform.”
Big data in the real world cannot be processed by Hadoop, as it is batch-based. Bill
Cloud computing in its present form is not suitable for machineto- machine interactions
Bill Ruh, GE
Ruh, vice-president for the software centre at GE, explains the problem: “The amount of data generated by sensor networks on heavy equipment is astounding. A day’s worth of real-time feeds on Twitter amounts to 80GB. One sensor on a blade of a turbine generates 520GB per day and you have 20 of them.” These sensors produce real-time big data that enables GE to manage the efficiency of each blade of every turbine in a wind farm.
The performance of the blades are influenced not only by the weather, but also by the turbulence caused by the turbines in front of it. GE has developed its own software to take on some of the big data processing, but Hadoop is also used. Gartner’s Zhang says disk drives will become performance bottlenecks. “The hottest trend is SSD [solid-state disk] drives, to eliminate mechanical disk drives,” she explains. Zhang says Hadoop can be configured to mix and match SSD with hard disk drives: “In your disk array, not all the data is accessed all the time.
The really important data at any given moment is not a large dataset. This hot data needs to be quickly accessible, so can be migrated to SSD.” Just like traditional disk tiering, more historical data will be pushed down to cheaper, mechanical disk drives. The major suppliers are also addressing the real-time aspects of big data analysis, through vertically integrated appliances and in-memory databases like Hana from SAP. Clive Longbottom, founder of analyst Quocirca says: “New systems from the likes of IBM with PureData, Oracle with Exadata, and Teradata are providing architected solutions designed to deal with masses of data in real time.”
Cloud-based big data
Investment catches up with growth The company needed to invest £500,000 in upgrades. Gidda says the company probably needed to invest a further £500,000 three months later just to keep up with the data growth: “We needed an IT organisation the size of the whole business.” In 2009 Razorfish decided to move to Amazon. ”We use Elastic MapReduce on Amazon and Cascading, a tool to upload 500GB per day.” This represents a trillion impressions, clicks and actions per day. Processing this amount of data on its old infrastructure used to take three days. “It now takes four hours,” he adds. However, cloud-based processing of big data is not for everyone.
GE’s Ruh explains: “Cloud computing in its present form is not wholly suitable for machine-to-machine (M2M) interactions at GE. We are seeing more processing running on machines, due to latency.” The technology is based around in-memory database systems. Since the datasets are extremely large, Ruh says GE uses NoSQL and Hadoop. “We have also developed our own database for time series analysis,” he says. But GE is also working with Microsoft Azure and Amazon Web Services to investigate how to offload processing to the cloud.
Picture Credit: Thinkstock