Big data is becoming an important analytical tool for customer retention and product positioning. But, as IT organisations deal with the challenges of implementing big data analytics, they must also determine their big data storage needs.
Existing storage architectures have developed to meet the needs of traditional enterprise application workloads. So, as firms start working with big data, they may need to re-think their storage environment to best support analytics activity and decide between direct-attached storage (DAS), SAN, scale-out NAS, object storage or dedicated big data appliances.
Two-phase big data storage strategies
To address big data storage needs, firms first need to understand how their data can help them and how they plan to use it.
IBM’s John Easton is distinguished engineer, advanced analytics infrastructures, based at the IBM Client Centre in Hursley, UK, where many analytics proof-of-concept activities from around Europe are tested.
Easton points out two distinct phases of big data analytics implementations he has seen in client installations: Pilot projects in the lab, and; implementation in the production environment.
In the initial phase, big data pilot projects in the IT organisation work with data sets to define models for the use of analytics in the enterprise, with a focus on customer knowledge and retention, one of the key use cases for big data today. Many focus on historic analytics, looking for why things occurred, rather than predictive analytics, as these datasets are smaller, and more specific in nature.
More on big data storage
For these kinds of big data projects, Easton sees IT teams using analytics clusters or grids of industry-standard x86 or ia64 servers with internal or dedicated storage along with application software. Often he finds these types of servers, with direct-attached storage, are used to power big data analytics environments such as Hadoop.
Proof of concept pilots seem to prefer a platform with native Hadoop integration. Many see Hadoop as useful for service innovation, including analysis of secondary data sets for modeling of “if-then” scenarios for products and services.
ABN AMRO’s big data lab
A good example of the initial phase of big data analytics work is illustrated by work at ABN AMRO’s lab in Amsterdam, which was set up in 2012 to find new ways to analyse internal data enhanced with external data.
According to Rob Wijhenke, manager at ABN AMRO’s big data lab, the aim, “is experimenting to learn in order to . . . improve both historic and predictive analytical abilities [for] customer-facing marketing activities”.
In the lab the bank works with different analytic models, with small but complex data sets that can provide new insights with production potential in a project known as ABN AMRO 2020 that aims to create an agile IT environment that is also cost-efficient. The initial lab infrastructure was supported by existing relationships with vendors such as IBM and TCS.
Big data storage choices in the lab are, “more focused on linear scalability than in a production environment, which is usually more appliance-based. Appliances can be expensive and difficult to scale in a production environment,” says Wijhenke.
He added that, “with big data one should focus on linear scalability. Hadoop-like vendor appliances are easy to expand as you can just buy a new box but you typically pay a very high price for the next box. A standard Hadoop-based setup as Apache designed it is linearly scalable and therefore cheap.” However, he also mentioned that in a Hadoop ecosystem you could potentially lose some storage space depending on how HDFS is configured.
The bank’s big data lab uses SPSS and R analytics in an IBM Big Insights (Hadoop) cluster. ABN AMRO’s challenge with big data storage in the financial services environment includes the security principles of infrastructural design, but also that loading and offloading of data is regulated.
Big data storage in the production environment
In production environments, key uses of big data range from knowing the customer better to gaining data efficiencies for operational purposes.
IBM’s Easton has seen big data applications leverage existing storage systems for use cases such as gains in operational efficiency , building better products and management of risk and security. For these types of use cases, production analytic environments typically have smaller data sets and so can scale to more traditional enterprise storage approaches.
An example is UPC Nederlands, part of Liberty Global Services, and the second largest cable operator in the Netherlands, which provides cable television, broadband Internet, and telephone services to residential and commercial customers.
He says that, “with dynamic IP addressing more traffic is generated, with records that need to be stored and maintained. And when you add roaming and hotspots you have a shorter timeframe and lots of traffic, so the question is how to best store it for analysis and compliance purposes.”
This often involves a trade-off. The most recent data needs to be close to hand, but data records need to be cleaned and archived in a timely manner for future use. In terms of a storage strategy, their choice is software-centric. “For now we manage data retention in an Oracle 11i database environment. To keep data access within acceptable time frames we partition the data on a weekly basis. So, all data is directly accessible via the Oracle database” explains van de Aa.
At this moment, UPC’s database contains about 5TB of physical data. On its current platform, it will grow to about 10TB by the end of this year. At that point, UPC may think about a move to a third party data storage solution.
As a business analyst making this storage forecast, van de Aa finds calculating the cost of storage today more difficult. The cost of storage to the company has changed over time, but data growth is such that the yearly increase in volume is not as clear as it used to be.
These Benelux use cases show that firms can use different tools and approaches to lessen the impact of storing ever-growing data sets.
The need for scalability in the production environment may impact the flexibility of the storage solution, so what works in the lab for a pilot project does not necessarily work in normal operations. Then, once you hit the production environment, the ability to scale storage is impacted by several operational issues, including regulatory compliance issues regarding data handling.