Size matters, in the big data cloud at least

As it’s the 4th of July today, I am anticipating a slightly slower news day for those of us in technology. Not that America runs the world you understand, but I think a day off for Silicon Valley will have some impact.

Given this reality, I want to revisit a subject I recently covered at feature level with some additional comments from companies that did not fit into my first draft.

The subject? — big data in the cloud.

But what is big data — and it is just marketing spin?

Well, it is spin to a degree, but we use the term to refer to datasets that have been colluded and collected into large “lumps” or perhaps terabytes (or even petabytes) of potentially dynamic fast moving data.

Crucially, it is the tools, processes and procedures around the data itself that define what big data is.

So what does the market think we should do cope with data management at this level as it now, either logically or inevitably, takes up residence in the cloud?

big cloud.JPG

“Businesses wanting to build big data stacks in the cloud need to make sure that they take the time to assess their options before choosing the technology that they are going to build on. There are a number of proprietary and open source tools out there for the taking, but picking the right one is not necessarily an easy task,” said Jason Reid, CTO of hosted and managed data centre-based company Carrenza.

“Some IT vendors, welcome increasing data volumes. But despite what storage vendors may have you believe, you can’t just keep throwing servers at your exponentially expanding data assets. Not all data is born equal and the importance of different types of data is far from constant; whereas today’s data might need to be replicated and recoverable in seconds, the chances are that last week’s data is less critical and can be stored on a cheaper medium,” said Keith Tilley, managing director UK and executive vice president Europe for SunGard Availability Service

“Big data analytical queries will create a new set of workload management problems for IT. This workload will be small to begin with (users submitting queries to running single reports), but will expand soon into a massive amount of requests (applications generating queries automatically to generate trends or continuously looking for patterns),” said Ken Hertzler, VP, product management & product marketing, Platform Computing.

Hertzler continued, “Whether the data sits in a cloud, or internally in a data center, workload scheduling and management of the MapReduce requests is not a trivial matter to solve. Users will expect results in guaranteed response times, high availability, multiple simultaneous requests on the same data sets, flexibility, and a host of other requirements that ensure the results are accurate and on-time.

“At the scale of big data, organising and arranging masses of information so it’s easy to analyse becomes a herculean task in itself: if you wait for data to be organised any insight you gain could well be out of date. The current generation of BI and analytics tools allow ‘train of thought’ analysis: querying unstructured stores of data in minutes or even seconds as and when needed, rather than in hours or days by appointment. As a result, organisations need to make sure that either their service providers can guarantee this level of access or that their internal cloud projects are using suitable technology. Otherwise, big data will only ever yield old news,” said Roger Llewellyn, CEO of data warehousing, business intelligence and analytics company Kognitio.

I could drop a conclusion in here, but this market is still-nascent as I keep saying, so let’s keep the communication channels open and revisit this topic soon.