Done well, big data offers businesses the chance to gain a competitive edge by understanding their customers and staying ahead of market trends.
But managing and storing huge volumes of data requires careful planning. Data security, meeting the requirements of regulators and ensuring critical data is properly backed up is a major challenge for the CIO.
But big data doesn’t necessarily mean big infrastructure, a meeting of IT leaders at Computer Weekly’s 500 Club heard. Like space and time, big data is relative concept, and it does not always mean analysing petabytes of information.
Big data is any data that is too big, moves to fast or doesn’t fit the constraints of your existing databases, says Robert White, executive director for the infrastructure group, at investment bank Morgan Stanley, (see panel below).
“You only need to move into this paradigm when you are exceeding what you can do with the technology that you have,” he told the meeting. “What is big data to me, may not be big data to you.”
Big data is nothing new
Sean Sadler, head of infrastructure solutions at Kings College London
- Click here to download Sean Sadler's presentation
- Click here to watch a video interview with Sean Sadler
Robert White, executive director for enterprise infrastructure at Morgan Stanley
Big data may have become a big issue for IT suppliers over the last couple of years. But the truth is that IT departments have been processing large volumes of fast moving data for far longer. The finance industry got to grips with big data 15 years ago, and the principles learned then apply just as well today.
“Ten or 15 years ago, we were working with time series databases,” says White. “It wasn’t called big data, but you look at what it was doing and it was a kind of fire hydrant of non-stop market data, and we captured it.”
The best place to start is to focus on asking the right questions. If you know what you want to find out from your data, then many of the questions over which choice of infrastructure to use begin to become clearer.
Real-time processing is not essential
Often you can get the answers you need without having to process data in real time.
A definition of big data
Big data is data that exceeds the processing capacity of conventional database systems. The data is too big, moves too fast, or doesn’t fit the strictures of your database architectures. To gain value from this data, you must choose an alternative way to process it.
Source: O’Reilly Strata
Businesses can save themselves a lot of money and effort by aggregating data, and analysing it in a more leisurely way.
Most organisations don’t need to respond to every data tick, every like on Facebook, or every hashtag on Twitter, says White.
“Don’t get sucked in, thinking I must pull everything off the internet that has anything to do with my organisation, and what we need now is a big data solution. That is going to be a very expensive way forward,” he said.
How to manage unstructured data
By its very nature, big data, is often unstructured and does not fit neatly into the relational databases used by most organisations. Videos, comments on social media, and comments on Twitter, are not easy to manage.
There are specialists data base technologies that can analyse unstructured data. IBM, for example, offers a database called Optim, which is capable of analysing unstructured data from a wide range of sources. The database creates a dictionary, which is able to pull together data on the same subject from different data streams.
But for many organisations, it may well be easier and more cost-effective to convert unstructured data into a format that will work with their existing systems.
“One of the decisions you have to make in your organisation is do you put the investment into dealing with that unstructured data or do you invest in a conversion process,” says White.
Rather than taking raw feeds from Twitter and Facebook, it might make more sense to process the data, add some structure to it and use your existing infrastructure to process it.
Big data suppliers
Most big IT suppliers offer good big data technology and it often makes sense to stick with the suppliers you already have an established relationship with.
Established big data player, offers the mySQL database, acquired from Sun Microsystems.
One of the “grand-daddies” of the database world, offering a big range of infrastructure products and integration services.
Microsoft has stuck with its SQL server roots, but has formed partnerships to offer bolt-on Hadoop capability.
Rapidly growing company
Owner of Sybase and creator of the Hana, in-memory database, effectively a relational database with a cache later built on top.
Source: Robert White, Morgan Stanley
“If your organisation is very good at dealing with relational data, and has all the tools for that, maybe you should be looking to convert the data instead into a format that you are used to dealing with and can extract value from,” says White.
If analysing unstructured data is essential, there are specialist tools out there that will help you. But, if your focus is analysing customers comments on social media, it may make more sense to hire an agency to do the work for you.
Companies such as Amazon use a mixture of computer algorithms and human analysis to interpret meaning from social media. For example, it still takes a human to work out whether an exclamation mark in a product review indicates sarcasm or a genuine compliment. So outsourcing this work can be an effective solution.
Another option is to use a third party to clean and aggregate your data before you analyse it, says White. “Go to a supplier who has good credence in your world, and a reputation for cleaning data, and who has some understanding of it,” he said.
Managing historic data
Once you know what data you want to analyse, it is worth considering how long you will need to keep it.
Regulators require Morgan Stanley, as a financial services firm, to record data for 7 to 10 years.
The problem is, says White, that you may back up the data from version 7 of a database. But by the time you need to restore it, the supplier has moved on to version 12, which is incompatible with the original data.
There are two ways to deal with the issue. One common approach is to migrate back-ups to the latest version of the database whenever the supplier upgrades.
“So that means you have to actually consciously migrate back-ups,” he says. “It's quite a lot of hassle to deal with.”
Morgan Stanley’s approach is to save data in a generic format, almost a text file, that can be adapted to any future version of the database.
Ensuring data does not deteriorate over time is another potential headache.
In practice, uploading historic data into the database will usually give you confidence that data has not been corrupted, says White.
But if you want further assurance, it is possible to run test processes, and introduce “check-sums” – mathematical functions that allow you to check the authenticity of data – that will allow you to be sure.
Most organisations don’t go quite that far. “I think it's quite a neglected area,” says White. “You always have 101 things to do on the list, and that is probably going to be 102.”
De-duplication and compression
Read more about big data
- NoSQL database firms step up attack on relational in UK and Europe
- Social technology, big data can increase productivity, says McKinsey
- Organisational design of data analytics sparks brightest and best
- Big data spells new architectures
- Adfonic processes 50,000 mobile ads a second with SQL architecture
- Hadoop silos need integration, manage all data as asset, say experts
- Buy or build aflicts data scientist capability
Modern databases come with built-in compression technology, saving valuable storage space. They use algorithms that replace repeating themes in a database with a symbol that consumes much less space.
For example, the word “London” might be replaced by a single number, which would be translated back into “London” again when the database is processed.
“The beauty of it is you can turn it on, by saying you want to add compression to a table or to a whole file,” says White.
Building a big data capability goes hand in hand with building a big analytics capability and that means having the right people in place to make sense of the data.
There has been much discussion about the growing demand for skilled data scientists to help companies make sense of their data.
Data scientists are in short supply and they are commanding high salaries.
But most organisations will find existing employees more than equal to the work required, says White.
“It's true that, in some professional fields, you are going to have to employ data scientists, but I don’t think that is going to be the case for most,” he says.
Chris in marketing and Joe in accounts will know the right questions to ask, if you give them the tools to look for the answers, he says.
These days, compression technology means it is possible for non-specialists to analyse huge databases on standard office spread sheets such as Excel. Files of up to 8Gbytes are not uncommon.
“Business users love things like that because it’s a product they are familiar with and they use every day,” says White. “Suddenly it gives them an ability to process the big data volumes we are providing on an infrastructure level, in a tool they can use.”
Locking down the organisation
One of the challenges for any organisation is how to separate personal data from corporate data on the IT infrastructure.
Financial services company Morgan Stanley has sidestepped the issue by banning all personal data from the organisation.
The move is essential for a regulated company that has to guard against market-sensitive data leaking from the company’s trading floors.
“We have different devices for work and personal use so people are not allowed to use their own BlackBerry to log into the firm's systems,” says Robert White, executive director of Morgan Stanley’s infrastructure group.
However, as employees become more used to using their own mobile phones and computers at work, regulated companies will need to remain vigilant.
There has been some talk, for instance, about banning personal mobile phones on trading floors, to ensure market-sensitive data is not passed to third parties.
And Morgan Stanley locks down all PCs, so employees cannot use external social networking sites.
“Definitely regulators are worried about social media, almost as a way of doing insider trading,” says White.
Regulators have the capability to monitor company IP addresses for potential breaches, White revealed
“Fortunately we can hide behind the regulators a little bit. We can say that even if we wanted to permit personal devices or social media, the regulators would just not be happy about that,” he says.
The challenge with multiple spreadsheets will be to ensure the company has a “single version of the truth”. Companies will need to get a lot smarter about centralising their spreadsheets.
Choice of supplier
Choosing which supplier to go with is always difficult. The good news is that most of the major suppliers have got to grips with big data. So it makes sense to work with the suppliers you have already built relationships with, says White.
Companies such as Oracle, IBM and Microsoft have been around a long time, and they are aware of the pitfalls of data analysis.
“Don’t necessarily think it’s a new world, and jump to a different provider,” he says. “If you are used to dealing with these people, leverage those relationships.”
These companies have strong developer communities and they are thought leaders in their fields. Big data may not be there yet, but it is only a matter of time before non-relational databases reach the state that relational databases are in today.
Keep calm and carry on
Fundamentally, however, big data is no different to any other IT problem. IT has already gone through batch data processing and real-time data processing.
Now it is big data processing, but while the technology is changing, the same principles of common sense and good business practice apply.
“Don’t let the hype suck you in,” says White. “Keep your head up. Don’t panic, and just apply the same logic that you would apply to everything else that you do.”
Enterprise storage strategy checklist
Sean Sadler, head of infrastructure at Kings College, offers a checklist.
- Optimise your network
Consider the speed and frequency with which you need to access data.
If your network performance is critical, then you will need tiered storage: small, low-capacity disks for high-performance applications and lower-speed, high-capacity disks for less critical applications.
Consolidating your systems will improve the performance of your network, particularly if your systems are distributed.
If you have high performance computing requirements make sure you have the disk capacity necessary to support that.
- Capacity planning
Read more CW500 Club articles
- Shared services present challenges and opportunities for CIOs
- Five technology forces that will change businesses forever
- Skills shortages loom as development comes back in-house
- Inside the government's CloudStore
- The legal risks of migrating to cloud
- Managing the mobile workforce (part 1)
- Managing the mobile workforce (part 2)
- The boardroom view of IT
- The technology behind the BBC's Olympics coverage
- Agile techniques for software development
No-one has an ever-expanding budget, so you need to look at how long you really need to keep data for.
As data volumes grow, you need to plan how you are going to manage back-up and restoration. Options include disk-to-disk, tape-to-disk, or off-site storage.
Use de-duplication technology to reduce storage requirements and increase the speed of back-ups. De-duplication will also help you with backing-up data in the cloud, if that is what you decide to do.
- Using the cloud for scalability
Make sure you can cater for the storage requirements of the organisation two or three years ahead, either by using a private cloud, a private datacentre or hybrid cloud capacity.
Many organisations are concerned about the security of cloud storage, so it makes sense to start with a pilot programme. Concentrate on back-ups and disaster recover to prove the concept.
When you choose a cloud service, ask how easy it is to migrate from the service with your data intact and in a state that you can manipulate and use.
Be aware of hidden costs. A cloud service might appear cheap but there are likely to be hidden service management costs.
Make sure your cloud provider adheres to your security requirements. The Cloud Security Alliance, an independent body, offers advice on the questions you need to ask.
Consider data protection. Ask your cloud service provider where they store your data. Are you required to store in the European Union ?
Acid versus Base
The world of big data means taking a different approach to thinking about data.
Traditionally, database transactions follow a set of rules, known by the acronym Acid.
But in the world of big data, organisations are turning to a more flexible approach, known by the acronym Base.
Base makes more sense for analysing high speed data flows, but it requires organisations to be more accepting of approximations rather than definitive answers
- Atomic: Everything in a transaction succeeds or the entire transaction is rolled back.
- Consistent: A transaction cannot leave the database in an inconsistent state.
- Isolated: Transactions cannot interfere with each other.
- Durable: Completed transactions persist, even when servers restart etc.
- Basic Availability
- Soft state
- Eventually consistent
For a good explanation of Acid versus Base, read this article by John D Cook.