Sergey Nivens - Fotolia
The importance of analytics as a business tool continues to grow, so it is critical to understand the impacts of analytics applications on storage systems.
Analytics includes a number of variants and sub-specialities, including predictive analytics, in-database analytics, advanced analytics, web analytics, and so on.
Instead of short bursts of activity with response times measured in milliseconds that characterise transactional workloads, analytics requests can be fewer but more complex, and involve large volumes of data.
Input/output (I/O) for analytics use cases can be unpredictable and involve large volumes of temporarily staged data. A single table’s rows may number in the billions, tens of billions, or even hundreds of billions, and it is not uncommon for tables to contain tens or hundreds of columns.
In general, analytics is heavy on read I/O and light on write I/O, which suits flash storage that can be read faster than it can be written and suffers fewer wear issues when reading rather than writing.
However, if the analytics load is based around a lot of small or variable-sized chunks, then fixed-sized block, solid-state arrays will not perform as well as variable-sized block systems.
Read more about analytics
- In-memory databases offer high performance data processing, but how do you protect data in volatile DRAM and what kind of storage is needed for longer-term retention?
- Flash storage is now being applied to big data and analytics workloads as array makers exploit flash performance for rapid access and high throughput for big datasets.
Within analytics there are a number of workload types that include: big data environments such as Hadoop and Apache Spark; data warehousing with query-intensive workloads, usually emanating from structured data; streaming, such as ingestion technologies that store raw data and make it available for batch or stream processing; NoSQL for non-tabular data storage, and; search, such as that deployed by log file analysis companies like Splunk.
Organisations increasingly need a mix of different analytics capabilities. Some small, data-focused systems can easily shoulder the load of an SQL database on a standard platform.
But, information-focused analytics will require something like Hadoop with a completely different I/O footprint, while a NoSQL approach creates different I/O demands altogether. These can be boiled down to two basic models: synchronous vs asynchronous.
Asynchronous analytics involves three main steps: capture, record, and analyse; as used overnight for example to analyse data captured from retail points of sale during the day.
Another asynchronous scenario is off-site web analytics, used to measure website effectiveness and traffic.
Such systems use a traditional relational database management system RDBMS to convert the data into a structured format, and so offer storage challenges such as scalability, performance, and capacity. I/O attributes can vary hugely, from a small number of massive files (as in genome research) to a massive number of small files/blocks (such as in data warehousing, customer data, applications).
Synchronous analytics work in real time, a typical scenario being a social media site that tracks users and delivers ads and other preference-based content as the user browses – another name for on-site web analytics. These will tend to run on NoSQL databases supported by flash storage for speed, and perhaps linked to back-end storage capacity.
Storage architecture for analytics
Analytics tools often read and re-read the same data to build up profiles. So, data analytics queries are inherently parallelisable and are sensitive to latency when it is time to recombine data streams. This is why so many storage architectures stripe data across multiple physical nodes and multiple spindles – Hadoop is a classic example of this.
Reading high volumes of data with minimal latency is usually achieved by deploying flash as cache in front of spinning disk. However, unless the architecture is tuned for each particular use case such a design could add to rather than reduce latency. That’s because cache misses result in a double query, which can become more time consuming than querying the slower media in the first instance.
For that reason the right balance between cache and storage is critical, which creates a need to understand the I/O profile of the analytics application.
So, you need access to detailed metrics that show performance and resource usage. That way the effects of changes can be assessed and measured against the cost of additional resources.
Ideally, then, given the I/O-intensity of analytics workloads, low latency for the active data set is key for the performance of analytics applications.
Looking to the future
As analytics workloads and storage systems become increasingly virtualised, it becomes even more difficult to untangle the specific workloads resulting from, for example, a run of an analytics report against a background of other VM activity.
There may be mitigating actions to be taken, such as understanding the temporal hotspots – business reports that are typically run on the last day of the month, for example – and automated reconfiguration performed to ensure these get prioritised.
However, as memory becomes cheaper, in-memory analytics will become more common, making the storage architecture less of a bottleneck for all but the very largest of datasets.
Cisco IT Hadoop Platform & HyperFlex
The compute building block of the Hadoop Platform is the Cisco UCS C240 M3 Rack Server, a 2U server with 256GB RAM, and 24TB of local storage, 22TB of which is available for Hadoop distributed file system (HDFS). It consists of four racks, each containing 16 server nodes that support 384 TB of raw storage per rack. As with most clustered systems, data and metadata are replicated across nodes for high availability.
Cisco describes its HyperFlex Systems, also built on the company’s USC platform, as a low-latency, scale-out system. The HX Data Platform controller stripes data across the cluster’s Ethernet-connected, flash and/or mechanical drive nodes into a distributed, multi-tier, object-based data store. Although not marketed specifically for large analytics projects, smaller projects may be able to repurpose it.
EMC Isilon Scale-out Storage Solutions for Hadoop
EMC Isilon is a scale-out NAS platform that supports multiple instances of Apache Hadoop distributions from different suppliers, and integrates HDFS, which enables in-place analytics and avoids the cost of a separate Hadoop infrastructure.
It includes data deduplication, scales from 18TB to more than 20PB in a single cluster, and includes the high-availability OneFS operating system to eliminate the NameNode as a single point of failure in the Hadoop/HDFS storage layer.
IBM offers the Power8-based IBM DS8880F appliance, which claims to deliver “consistent microsecond application response times and uncompromised availability” with integration with IBM z Systems mainframes.
IBM claims the 8880F delivers up to 2.5 million input/output operations per second (IOPS) in random I/O workload environments. The Fibre Channel-connected enclosure supports flash cards from 400GB to 1.23PB of raw capacity in a 4U rack and supports RAID levels five, six and 10.
Hitachi Content Platform
HCP is a multi-node, object storage platform that distributes functions such as metadata management and storage placement across all nodes. The system consists of a series of access nodes (HCP G), that sit in front of an HCP cluster of Ethernet-attached HCP S nodes.
The G nodes perform service and management functions, and virtualise and federate back-end capacity supplied by S nodes, which can access block, file or object or cloud storage. The system scales by adding more nodes, and Hitachi Data Systems claims that system capacity is unlimited.
TeraData Appliance for Hadoop and Integrated Big Data Platform
TeraData offers its Appliance for Hadoop, with configurations with dual 12-core 2.5GHz Xeon processors or dual eight-core 2.6GHz Xeon processors with configurable memory options from 128GB to 512GB. Each Infiniband-connected node includes 12 4TB or 8TB drives, and from 256 to 768GB RAM, and runs an optimised version of Hadoop from Hortonworks and Cloudera.
Teradata also offers its Integrated Big Data Platform as part of its workload-specific family of systems. The system offers up to 2,048 active nodes enabling scale-up to 341PB from 3TB or 4TB spindles, 168 per cabinet, and up to 512GB RAM per node. All run the Teradata Database, which incorporates an analytics engine and can access data from other systems including the Hadoop appliance.