But what exactly is Hadoop and what are the key points of Hadoop storage strategy?
Hadoop is a highly scalable analytics platform for processing large volumes of structured and unstructured data. By large scale, we mean multiple petabytes of data spread across hundreds or thousands of physical storage servers or nodes.
Hadoop, developed in 2005 and now an open source platform managed under the Apache Software Foundation, uses a concept known as MapReduce that is composed of two separate functions.
The Map step inputs data and breaks it down for processing across nodes within a Hadoop instance. These "worker" nodes may in turn break the data down further for processing. In the Reduce step, the processed data is then collected back together and assembled into a format based on the original query being performed.
To cope with truly massive-scale data analysis, Hadoop's developers implemented a scale-out architecture, based on many low-cost physical servers with distributed processing of data queries during the Map operation. Their logic was to enable a Hadoop system capable of processing many parts of a query in parallel to reduce execution times as much as possible. This can be contrasted with legacy-structured database design that looks to scale up within a single server by using faster processors, more memory and fast shared storage.
Looking at the storage layer, the design aim for Hadoop is to execute the distributed processing with the minimum latency possible. This is achieved by executing Map processing on the node that stores the data, a concept known as data locality. As a result, Hadoop implementations can use SATA drives directly connected to the server, thereby keeping the overall cost of the system as low as possible.
To implement the data storage layer, Hadoop uses a feature known as HDFS or the Hadoop Distributed File System. HDFS is not a file system in the traditional sense and isn't usually directly mounted for a user to view (although there are some tools available to achieve this), which can sometimes make the concept difficult to understand; it's perhaps better to think of it simply as a Hadoop data store.
HDFS instances are divided into two components: the namenode, which maintains metadata to track the placement of physical data across the Hadoop instance and datanodes, which actually store the data.
You can run multiple logical datanodes on a single server, but a typical implementation will run only one per server across an instance. HDFS supports a single file system name space, which stores data in a traditional hierarchical format of directories and files. Across an instance, data is divided into 64MB chunks that are triple-mirrored across the cluster to provide resiliency. Obviously in very large Hadoop clusters, component or even entire server failure will occur so the duplication of data across many servers is a key design requirement of HDFS.
Looking at the core features of Hadoop, how do they translate to storage?
Supporting Hadoop on shared storage doesn't work in the traditional sense, as workload and storage distribution are inherent to Hadoop. However, we are seeing storage supplier products that support Hadoop natively and they point the way to a likely future direction in storage of large-scale distributed architectures.
As already discussed, Hadoop was designed to move compute closer to data and to make use of massive scale-out capabilities. This doesn't fit well with traditional SAN implementations, which have a much higher cost per GB to deploy than can be achieved using local direct-attached storage (DAS).
It certainly isn't practical to consider using Fibre Channel in HDFS deployments due to the sheer cost of implementation in terms of host bus adaptors (HBA) and SAN ports. In addition, HDFS is designed to cater for streaming data, as Hadoop transactions typically write data once across the cluster then read it many times. This works well with directly-attached SATA drives but not so well with shared storage environments where the same underlying physical disk is used to support the Hadoop cluster.
However, there are scenarios where existing storage solutions could be used. For example, Hadoop nodes could be deployed with tiered storage to gain additional performance from flash SSD, PCIe flash and enterprise-class 15,000rpm hard disks.
Tiering can be achieved using open source software such as Flashcache, developed by Facebook, which keeps actively used blocks of data in high performance storage. There would be an associated cost to deploying flash into servers and this would need to be considered against the efficiency of each node but adding flash may make nodes capable of processing more queries.
The other main option is to look at storage suppliers that provide native HDFS support in their existing products, particularly those that are scale-out solutions and so have the same distributed nature in their own design.
Using a HDFS-ready storage solution can provide a number of benefits. Firstly, compute and storage can be scaled independently rather than within the fixed capacity of a node. There is also the option to provide faster data ingest and to view file contents directly in the cluster. Both data and capacity can be shared between multiple Hadoop instances and an increased level of protection around the HDFS metadata can be provided. In a standard Hadoop deployment, the namenode is a single point of failure, but can be manually replicated.
While HDFS provides features for automated data recovery and integrity checking, an HDFS-ready storage solution can offload this work (leaving the Hadoop instance able to process more) and reduce the need to maintain three copies of data across an instance.
Hadoop storage supplier roundup
EMC's Isilon scale-out NAS platform provides native support for Hadoop and adds extra features including a distributed namenode, data protection through snapshots and NDMP backups and multi-protocol support. An Isilon cluster can be used for multiple workloads, making it a good way to evaluate Hadoop solutions without large-scale expensive deployments.
Cleversafe supports Hadoop as a replacement for the HDFS storage layer. Its technology disperses data across multiple storage appliances, providing additional resilience and performance even across geographic boundaries. As the number of storage nodes in a Cleversafe solution increases, resiliency also increases without requiring additional capacity, making it an efficient storage solution.
NetApp's Open Solution for Hadoop is based on the E2660 direct-attached storage appliance and an FAS2040 controller that manages the namenode. The E2660 provides RAID protection to the data, reducing the need to retain multiple copies across a Hadoop instance and connects via 6Gbps SAS to up to four physical servers acting as datanodes.
Hitachi Data Systems (HDS) has a reference architecture for Hadoop based on the Cloudera Hadoop distribution. This uses HDS CR220S server product with integrated 3TB SATA drives, but doesn't use any of HDS's core storage products. This is also the approach taken by HP with their AppSystem for Apache Hadoop solution, which also uses direct-attached storage within the implementation.
There are other solutions in development that look to replace HDFS, including Lustre support from Intel, GPFS from IBM and open source solutions including Ceph and Cassandra. Other than the Intel distribution, these aren't directly supported vendor solutions as yet.
MapReduce being replaced in the eyes of Hadoop systems users