It appears to make sense to take a single, well-architected approach to dealing with big data.
A specifically built mix of hardware and software should be better than a hand-built collection of bits cobbled together, surely?
This is the logic that has been used by many of the incumbents in the data management space.
For example, Oracle has taken its acquisition of Exadata and created a system it has called the Oracle Big Data Appliance, which combines its Sun hardware with various different software approaches to deal with different types of data in one appliance.
IBM has taken a similar approach after its acquisition of Netezza, creating a set of appliances it calls PureData.
Elsewhere, Dell also has a range of big data appliances, as does HP; Teradata acquired Aster and then launched its Integrated Big Data Platform; Hitachi Data Systems has its Hyper Scale-Out Platform; EMC has its Data Computing Appliance; and Data Direct Networks has its nattily named SFA12K Big Data Appliances.
There are many different ways to carry out big data analysis – build it yourself and big data as a service are just a couple – but these are full of issues suppliers are trying to help you avoid. It seems that an appliance approach to big data is all the rage, but is it as simple a choice as it seems?
To dig deeper, it is first necessary to understand what big data is really about.
The five' v's of big data
Too often it is still seen as about volume only. However, this is more an issue of a lot of data, rather than big data; volume is just one of the five “v”s of big data.
To understand the issues that big data brings to the fore, it is necessary to look at the other “v”s that create problems and offer opportunity in the big data world.
As mentioned, there is the volume of data being dealt with. However, if all of this is formal, structured data, then a standard database with an adequate scale-out compute, storage and network platform should be sufficient.
The problems really start when you look at data variety – the mix of structured and less structured data types that need to be dealt with. Most data has some level of structure, whether it is the formatting of the container for a Microsoft Word file, the comma delimiting of machine-to-machine data, or the headers for image, video or audio data. Then there is velocity, and this has two aspects. The first is the speed of data being presented to the analysis environment. For example, real-time data analytics dealing with internet of things data will often need to deal with small packages of data coming through in large numbers, with no human latency to slow things down. Second is the speed in which the results of the analysis are required.
For example, in financial trading, the person acting on the results downstream is looking to shave milliseconds off the time that they get the results compared with other traders. Production lines need to be able to identify a problem before it becomes an issue, enabling action to be taken so the line can continue operating, rather than being taken offline. Veracity is also key. Analysis of poor quality data will result in poor quality output.
Therefore, any big data system must be able to either check the quality of the data it is analysing, or be able to trust the upstream data sources. The last “v” is value. Actually, as this is the business driver behind any big data activity, it should really be the first “v”. The decision to carry out big data analysis has to be built on the value the business will get out of the results. Is it really worth carrying out this analysis?
What real impact will this have on the activity and success of the business? In some cases, Quocirca has seen big data analysis being carried out because it “seemed like a good idea” – but there needs to be solid business reasons behind why IT resources are being used. Therefore, any supplier touting a big data system to your organisation should have messaging against each of these “v”s. Taking all data and pushing it all into a relational database, with less structured data being forced in as binary large objects is not the way to deal with big data.
Similarly, those in their ivory towers that say the days of relational databases are over and that everything can now go into either a persistent Hadoop store or a NoSQL database are also – at this point – wrong. However, taking a disconnected approach of specialised disparate data systems will also not work. For example, having a non-persistent Hadoop system for data reduction using MapReduce with separate relational and non-relational persistent stores will result in an inability to deal with the requirements of big data velocity.
A single approach to analytics
For true big data analytics to be possible, the “v”s need to be dealt with and data brought together in a manner where a single approach to the actual business analytics can take place. This is where the appliance approach comes into its own. By taking a Hadoop environment and mixing it with relational and non-relational data stores in the same appliance, intelligence can be built into the overall system to ensure the right data resides in the right store at the right time. The required layers of analytics can be optimised to ensure that performance is fit for purpose. This is the battleground all of the aforementioned suppliers are fighting in.
However, there are still areas that anyone considering purchasing a big data appliance needs to be aware of. For most organisations, big data will involve high volumes of data. To provide the desired velocity of analysis, the majority of big data appliances will have large amounts of memory in them, to enable in-memory analytics to take place. Therefore, ensuring there is enough memory in the appliance is a key purchasing consideration. The appliance will need to be expandable, having too little memory in place on delivery will result in a slower than expected system, as data then has to be swapped in and out from lower-speed storage systems. Watch out for appliances that are purely spinning, magnetic disk-based. With the advent of solid-state storage, the speed of retrieving data from disk has increased massively – but is still well below that of an in-memory system. Systems using solid-state storage will be much faster than those using magnetic disk. Also, beware of hybrid systems where there is a mix of a top tier of solid state and a lower tier of magnetic disk storage. Unless there is intelligent software managing where the data resides at any one time, there could be major performance issues when the analytics system tries to get data from memory, sees it isn’t there, drops down to solid state, finds it isn’t there either and has to drop down to magnetic disk and pull the data from there into memory.
Read more about big data appliances
The Oracle big data strategy is about you buying a large, expensive server appliance and installing your big data platform on to it
Enterprises are using SAP Hana for in-memory data marts and SAP Business Warehouse implementations that integrate with other data warehouses
Look to the future
Look for systems that bring together Hadoop, NoSQL and a relational approach. However, also look to the future. For a long time, Quocirca advised against using Hadoop as a persistent store, instead depending on its MapReduce capabilities to act as a data filter to reduce the amount of data being analysed in any environment.
MapR is leading the Apache Drill initiative and Hortonworks has its Hive Stinger programme, both of which show promise in enabling SQL queries to be run against a Hadoop store. Suppliers such as IBM and Actian – with Vortex – have commercial Hadoop-SQL products that deal with some of the speed issues that are currently a problem with Hadoop as a persistent store. At the NoSQL end of the data stores, Basho is taking a different approach to many others. By enabling a mesh of its Riak NoSQL database nodes, each dealing with different aspects of big data, it is hoping to create the “one ring to rule them all”: a database that can deal with data reduction against the variety of different data types at speed.
Finally, look for systems that do not tie you into a specific way of working. Skills already built up in the use of existing business intelligence (BI) systems should not have to be thrown out and new skills learned – the big data system chosen should enable existing BI tools to be layered over it. The world of big data analysis is still at a relatively immature level. A build-it-yourself approach is unlikely to provide the return on investment required, while a specialised appliance may only solve the problem for a short while. Choose an appliance carefully – ensure that the value to the business is sufficient to warrant the expenditure.
Clive Longbottom is founder of analyst company Quocirca.