Choosing a platform to manage the big data mix

Big data requires a change in both mindset and technology as businesses seek to analyse various forms of data from myriad sources.

It seems that 2012 was the year of big data – at least as far as the main software and hardware suppliers were concerned.

Nearly every supplier in the market had something that could be marketed as part of a solution for big data.

This raft of misplaced technologies has led to misconceptions about what is really needed for organisations to draw value from escalating volumes of data.

But 2013 could be the year of big data rationality. Systems being offered from many suppliers now appear to have been built to deal with the many facets of the big data problem, rather than a mish-mash collection of existing offerings quickly bundled together to hit the market at the same time as everyone else.

Challenges of big data

So, what constitutes a big data issue? What springs to mind for many is “volume”. Surely, if I have lots of data, then this a big data issue? Well, it could be – or it may not be. It may just be a “lot of data” issue – which only needs a scalable database from the likes of Oracle, IBM or Microsoft with a good business analytics solution from these or independent suppliers, such as the SAS Institute, layered on top of it. No, size is not everything. Big data brings with it a lot more variables, in the guise of a series of “Vs”.

Big data is not just a change in mindset. It also requires a change in the technologies used to deal with the different types of data in play and the way they are analysed and reported

First, the variety of the information sources has to be taken into account. No longer can a stream of information just be regarded as a series of ones and zeros with a belief that all data is equal. Not everything will be held within a formal database. Files held in office automation formats, information scraped from search engines crawling the web, and other data sources all need to be included in what is used as the basic raw materials. IBM, Oracle, Teradata and EMC are all now building systems that incorporate a mix of technologies into single boxes that can deal with mixed data – or can offer matched systems that ensure that each type of data is managed and analysed in an optimised and coherent manner.

The rise of the use of images, voice, video and non-standard data sources, such as real-time production line data, smart-building and environmental systems, means data has to be dealt with in context. This requires the use of networking equipment that can deal with prioritised data streams and is capable of analysing such streams either at line speed or by copying enough of the stream and dealing with it as required outside of the live stream. The majority of networking equipment from the likes of Cisco, Juniper, Dell and HP will be able to deal with 802.1p/q priority and quality of service settings, and externally with packets tagged with the multiprotocol labelling service (MPLS).

Next is velocity. A retail operation needing to analyse transactions on a daily basis to organise its supply operations for the following day does not need real-time data analysis. However, an investment bank making investment decisions against incoming live data on rapidly changing information such as commodities pricing will need analysis to be carried out as near to real time as possible. Similarly, government security systems, online fraud detection and anti-malware systems need large amounts of data to be analysed in near real time to be effective. This is where the latest systems from the likes of IBM with PureData, Oracle with Exadata, and Teradata are providing solutions designed to deal with masses of data in real time.

Data accuracy

The veracity of the data also has to be considered. This has two components. One is the quality of the data under the control of the organisation. Information that is specific about a person – names, addresses, telephone numbers – can be dealt with through data cleansing from companies such as UKC hanges, PCA or Equifax; other information, such as mapping data, can be dealt with through cloud services such as Google or Bing Maps – thus outsourcing the problems of ensuring that data is accurate and up to date to organisations that can afford to be experts in the field.

The other aspect of veracity is around data that is not under the direct control of the organisation. Information drawn from external sources needs to be evaluated to determine the level of trust that can be put in the source. For named sources, such as those mentioned above, trust can be explicit. For other sources, cross-referencing may be required to see how many other people have quoted the source, whether the individual or organisation associated with the information is known and trusted by others, and so on. Here, the likes of Wolfram Alpha, Lexus Nexus, Reuters and others can provide corroborated information that can be regarded as more trustworthy than just direct internet traffic. Next is value. Two things need to be considered here – the upstream value and the downstream value.

Whereas most organisations have focused on internal data, there is now a need to reach outside the organisation (upstream) to other data sources. For example, a pharmaceutical company researching a new chemical or molecular entity needs to keep an eye on what its competitors are up to by monitoring the web, but it must also filter out all the crank stuff that has little bearing on what it is doing. The downstream side of the value is explicit – there is little point in providing analysis of data to a person if it is of little use to them.

Merging relational and NoSQL Overall, big data is not just a change in mindset. It also requires a change in the technologies used to deal with the different types of data in play and the way they are analysed and reported. The archetypal databases from Oracle, IBM, Microsoft and others are currently useful for dealing with structured data held in rows and columns, but struggle with less structured data that they have to hold as binary large objects – or blobs. The rise of the NoSQL, schema-less database market, exemplified through the likes of 10Gen with MongoDB, Couchbase, Cassandra and others, is showing how less structured data can be held in a manner that makes it easier for the data to be analysed and reported on. It is in pulling such technologies together where IBM, Teradata, EMC and others are beginning to create systems that are true big data engines.

However, there is still the need to combine the structured database and the less structured systems together and make sure that all the various data ends up in the right place. This is where something like Hadoop tends to fit in – using MapReduce, it can act as a filter against incoming data streams and the outputs can then be placed in either a structured or unstructured data store. For those who are heavily involved with SAP, HANA can be used in much the same manner.

Over the top of the data infrastructure has to be the analysis and reporting capability. Although those selling the hardware and databases tend to have their own systems – for example, Oracle has Hyperion, IBM has Cognos and SPSS , and SAP has Business Objects – there are plenty of choices outside of these suppliers. SAS Institute remains the independent 800-pound gorilla, but newer incomers such as QlikTech, Birst, Panopticon, Pentaho and Splunk are showing great promise in being able to provide deep insights across mixed data sources.

Although 2012 was filled with big data hype, it does not mean that big data is something that is not important. The capability to effectively analyse a broader mix of data in a manner that enables true knowledge to be extracted will be a powerful driver for future success in businesses. It is better to start planning for an effective platform for this now, rather than waiting and watching your competition beat you to it.

Read more on Big data analytics