HP Autonomy executive: big data transcends analytics

Robert Youngjohns, executive vice-president at HP, speaks to Computer Weekly about why big data does not reduce to analytics

Robert Youngjohns is an executive vice-president at HP, and headed the supplier’s Autonomy business until recently. His career includes stints at Microsoft and IBM, and he is a British transplant to the US, educated at Oxford in Physics and Philosophy in the mid-1970s. On a visit to London he briefed Computer Weekly on HP’s strategy for Autonomy. What follows is an edited version of that interview.

Q. Let me start by asking a very broad question. Imagine you're sitting in front of a British CIO. What's the first thing you want to put across now about HP Autonomy?

A. It would really be what we do. I focus on what we do now and the value we bring to our customers now. Our story is very much around how we help customers with almost every aspect of what we see is the big data problem. Big data for us is far more than just analytics, which many people think are one and the same, but they're not.

It's much more about how CIOs cope with the explosion of information that's going on in their enterprise. This has an impact on everything they do, including the way they do backup and recovery. I know that is a highly prosaic subject, and one that if I talked about it in a presentation would send people to sleep, but there you have it.

When you're drowning under information, how do you decide what to back up? How do you decide what you have to keep for regulatory reasons? How do you decide how you're going to find anything once you've put it into one of these great enterprise black holes like Microsoft SharePoint? The whole big data problem for us is far more than just how you do analytics. It's how you manage the totality of that life-cycle of information within the enterprise.

Q. Is the really new capability here Autonomy plus columnar database Vertica, acquired by HP in 2011? Is that the thing that's special?

A. Vertica is part of it. We tend to break the data problem down into three categories. There's what we call "machine data", which is typically highly structured and vast in quantity, so it's a log file or a sensor data or clickstream data.

Big data is far more than just how you do analytics. It's how you manage the totality of that life-cycle of information in the enterprise

Robert Youngjohns, HP

The second category is what we think of as "business information". This is the classic data that exists in the enterprise – the general ledger, the customer management system, the billing system, and all that sort of stuff. It's a significant source of data in the enterprise.

The third category, which is actually the biggest, is what we think of as "human information", which is really everything from email to text to voice to video and so on. We try to break the big data problem down to those three categories.

Now, if you think about the first category – machine data – when you're trying to read vast files of relatively structured records, traditional relational tools are pretty inefficient. So we have Vertica, a columnar database, which looks at the whole thing in a different way and therefore allows you to do analytics on those highly structured vast datasets. 

Business information is the classic place where Oracle and SAP and others play. That's fine. They'll continue to play there.

Then human information, which we think may account for as much as 90% of the data in some enterprises, needs a completely different approach. That's where the Autonomy product Idol [intelligent data operating layer] fits. So between the combination of Vertica and Idol we can cover the two fastest emerging parts of the big data problem.

Q. Broadly speaking, how do you see enterprise search on one hand and business intelligence and analytics on the other? How do you see those technologies evolving separately or together or in convergence?

A. Enterprise search is one of the most interesting challenges in the enterprise because everybody thinks it's a commodity. Everybody thinks you put in an appliance and somehow you do search. The problem you have in the enterprise is that the search techniques that work on the public internet don't actually translate that well into enterprise search. The reason is twofold.

Firstly, in the public internet, with Google or Bing, you have a natural feedback loop. As you put in your search element, you get responses. People click responses and from the responses they click they can immediately validate the accuracy or otherwise of the search. 

The problem is that the search techniques that work on the public internet don't actually translate that well into enterprise search

Robert Youngjohns, HP

That doesn't exist in the enterprise, because typically the volume isn't enough so you don't have that feedback loop. One of the things you have to think about is how to create a synthetic feedback loop that gives better search relevance when you don't get that constant check and balance on what's going on.

The second thing about public search tools is that if they miss things it's not normally the end of the world. You just put the search in again or you reconstitute the search. In enterprise search you typically want more complete answers and you want two searches to come back the same both times. We've recently been going through an upgrade with a client in the US, and after the upgrade they ran through some of the same searches as before the upgrade.

They came back saying, “In this chain we had 10,463,000 documents and after the second search we did after the upgrade we had 23 documents less. You need to help us find the 23 documents." In the Google world you really wouldn't care. You would just search again.

A lot of people put in a Google search appliance and expect it to work and it's okay, but one of my personal beefs is the lack of adequacy in most enterprise search tools. I find that intranets and SharePoint are places where data often goes never to be seen again, unless someone can actually give you the link. It's been a common problem in every business I've been in.

I use the story from one of the Indiana Jones movies, right at the end of The Raiders of the Lost Ark, where the guy is pushing the wooden packing case into the vast door and you know that afterwards the only person who'll know where it went is the person who pushed it in. The chances are that person has probably left the company. Enterprise search, to me, feels like that. People put stuff in and they're the only people who know where it is. Then no-one else can find it.

As for business intelligence (BI), the reality of real BI in the enterprise is it happens at the practitioner level. So it's the person in the finance department, or the person in sales or operations – they're the people who want to run the reports. 

A lot of people, when they think BI, tend to think about what happens in the mythical C-Suite. The CEO with the big dashboards around the office who says, "Wow. Look. There's a problem with my plant in Poland." That's like stuff from the movies in my view. The reality is most BI is practitioner-led and they value things like the ability to iterate quickly and to do queries again quickly, more so than the cosmetics of how it is presented.

Q. Let’s take the underlying approach of meaning-based computing in Autonomy’s technology since the 1990s. How unique is that? You've been around a while, so you must have seen technologies that are quite similar?

A. The approach has evolved a lot since then. I still use the term meaning-based computing, but I don't think of it that way. I think of what we're doing as building a portfolio of products to help our customers manage big data from end to end, and underlying those products is our core technology called Idol. One of Idol's capabilities is that it allows analysis of not just structured information, but more importantly the unstructured information that lies in the enterprise.

Our lead story is as much or more about the applications that we have built on top of Idol and Idol is our underlying technology. Customers care about things that bring immediate value to them.

Q. Can you explain Idol on demand? What does that mean? What form does it take?

A. I'm very pleased with the team in terms of the energy they have put into that and where we've got it to. Idol had about 200-400 built-in analytic functions. They might have been categorisation or sentiment or language detection and so on. To get access to these you had to take the whole Idol thing. It was quite complex to install and quite complex to set up, unsurprisingly.

We decided to look at whether we could we break all of these functions out and make them available individually over the web as web services. That's what we set out to do. The team started working on that six or seven months ago, and right now we have what we call an early access program available online, where anyone who is interested can play with some of these functions.

Q. What is your vision for HP Autonomy over the next year and beyond?

A. It's broadening the story and trying to get through to people that big data is not the same as analytics and that big data is a far broader problem than analytics. A lot of the practical issues that CIOs deal with every day from big data have nothing to do with analytics. 

They are associated with things I talked about, like how you decide what to back up.

Q. If you consider that to be an enduring misperception, that it's broader than it is conventionally defined as being, what do you think the explanation is for that misconstruction in the first place?

A. I think it's because of the compartmentalisation of those functions in the CIO's organisation.

It's that compartmentalisation that has led to a lot of stovepipe decisions. If you step back as a CIO and say, "What's really going on here? I'm drowning. I've got to start working out what I'm going to do with all of this data before it really stops me from having the ability to back this up, or back up my datacentre, or have effective file-sharing, or whatever." Then you have to take a different view of it.

Read more on Content management