In this guest blog post, Naser Ali, head of solutions marketing for EMEA at Hitachi Vantara, explores the use cases and benefits for using cloud bursting in the enterprise.
When NASA needed to process petabytes of data from its Orbiting Carbon Observatory 2 (OCO-2), it expected to wait 100 days and spend $200,000 on new datacentre hardware to make it work.
Instead it used the Amazon Web Services (AWS) cloud to achieve the same thing for $7,000 in less than six days, allowing NASA engineers to gain deeper insights into the Earth’s carbon uptake.
This ephemeral use of cloud computing, known as ‘cloud bursting’ offers the scientific community an effective, affordable solution for processing high-volume data spikes. But can it be applied in the more earthly business world? Indeed it can. In fact, we’re working with two large global banks that are doing just that.
Banks are the perfect use case for cloud bursting because they process large volumes of data about things like interest rate returns and risk weighted assets. This data isn’t personal, but requires frequent short bursts of compute power to meet compliance reporting requirements.
Most banks can’t justify the capital expense of investing in on-premise capacity to process data that only comes in intermittently knowing that hardware will mostly remain idle. On the other hand, running a 24/7 technology stack in the cloud racks up too much operating expense. Cloud bursting not only strikes the perfect balance between storage capacity and cost, it gives data scientists a space to ‘play’ and gain new insights.
Cloud bursting in practice
Put simply, the basic steps data science teams take when cloud bursting are: move data into a cloud, spin up some computing power, spin up some storage, process a tranche of data, shut it down and take the results home. But how does this look in practice? One banking customer’s cloud bursting approach follows these steps:
- First it uses our data integration platform to automatically ship and load data to object storage in an AWS cloud, processing it in batches.
- Next it runs a script to fire up Carte, which is a simple web server for remotely executing data transformations (converting data from one format or structure into another) using Amazon Elastic Map Reduce (EMR) in Hadoop or the Amazon Redshift data warehouse.
- Once the data is cleaned, transformed and loaded into the temporary cloud, the data scientists can then “play” with the data by running ‘what-if’ scenarios on things like risk and interest rate return rates.
The greatest benefit of cloud bursting is agility, but there are hard cost savings as well. This bank had been spending 1 million on hardware and software for its risk reporting Hadoop cluster. We took a subset of its data, 12 million rows, ran it in AWS for $2.20 and in Google for .50 cents. So you can process a pretty large chunk of data for very little cost and no upfront investment. Another cost advantage is the ability to take advantage of ‘spot pricing’ on cloud capacity. This new type of dynamic pricing saves customers money by letting them running some applications only when spot prices fall below a specified price point. It also allows users to quickly run large jobs by outbidding other customers for available capacity.
Critical success factors
Although I’ve used banks as an example, cloud bursting for data science is suitable for any industry with similar requirements. Here’s what our experience tells us that organisations need to be successful with this approach:
- Agile data integration – choose a data integration tool that doesn’t need to be installed on every Hadoop cluster within the cloud. This is the key capability that allows you to spin up a cloud and process data in a matter of hours. Conversely, tools that require you to install software on multiple Hadoop nodes cancel out all the agility advantages of cloud bursting.
- Strong DevOps – organisations that are good at DevOps, particularly continuous integration, will have a smoother ride when it comes to preparing data to be processed in cloud bursts.
- Open standards environment – it’s very important that organisations abstract the data processing from the cloud environment and don’t get locked into a single cloud vendor. Open standards and open frameworks ensure that everything is moveable if necessary.
- Plan for repatriation – what happens if an organisation is successful with cloud bursting but later needs to add secure data to its analysis? In this case the whole compute process and data would need to be repatriated back on premise. This requires having an analytics platform that supports a hybrid computing environment and ideally supports containers like Docker, which make it fast and easy to repatriate processes and data. This is essential in financial services, where regulatory compliance mandates that systems are in place to prepare for this potential situation.
One giant leap
Following in NASA’s lunar footsteps, banks’ first cloud bursting efforts have been encouraging, achieving savings of multimillion pounds annually. These have largely come from reducing total cost of ownership and increased agility.
Of course the ‘giant leap’ organisations make by applying such a practical and cost-effective way to process data is that they minimise their exposure to financial risk.
The Financial Conduct Authority (FCA) has mandated that banks significantly raise their cash reserves and understand their risk exposures to avoid repeating the scenario that brought down Lehman Brothers and led to the RBS bail-out.
In a typical scenario where a bank has investments in many different financial institutions and companies, it takes just one entity to go bust and take down the others. In our era of economic turbulence, this ability to play with data in a temporary space and game different scenarios could affect a company’s very survival.