Make sense of unstructured data

With an estimated 80% to 90% of corporate data held in the form of e-mail word documents­ - and increasingly voice and video files -­ much vital information is beyond traditional methods of data analysis and retrieval.

"As we know, there are known knowns. There are things we know we know. We also know that there are known unknowns. That is to say, we know there are some things we do not know. But there are also unknown unknowns, ones we do not know we do not know."

When former US secretary of defense Donald Rumsfeld gave his famous "known knowns and unknown unknowns" speech, he left out one configuration: unknown knowns.

Yet that, in a nutshell, is the unstructured data problem that many companies are facing these days: the things we do not know we know. With an estimated 80% to 90% of corporate data held in the form of e-mail word documents­ - and increasingly voice and video files -­ much vital information is beyond traditional methods of data analysis and retrieval.

"In that 80% or 90% is some really important stuff," says Ovum senior analyst Mike Davies. "There is a growing realisation in organisations that the information they hold is either an asset or, more importantly, a liability."

A number of issues have converged to bring unstructured data to the top of the agenda. A key one is the loss of people who have traditionally held the corporate memory, as the baby-boomer generation reaches retirement age.

"In the states they are now having to interview the people who built the nuclear reactors," Davies says. "Much of that information could have been written down, sitting in that unstructured data."

The proliferation of compliance and regulatory regimes such as the Markets in Financial Instruments Directive in Europe and the Federal Real Property Council in the US has also forced organisations to come to terms with the threat that not having a handle on unstructured data poses.

"Not being able to find an e-mail is not a get out of jail free card any more," says Costi Perricos, a director in Deloitte¹s consulting practice.

"You need to be able to categorically prove it was not sent. Lots of people file their inboxes, but how many people file their outbox? Yet in a client-service environment, your outbox is more important ­ if we give advice to a client, that is the advice we need to find."

These problems multiply in organisations where even the structured data is scattered among different systems. Take BAE Systems, Europe's largest aerospace company and the product of years of mergers and acquisitions.
"We have inherited a whole bunch of information systems and network drives that will not talk to each other and never will," says Richard West, head of organisational and e-learning at BAE.

"There are no common tagging or data-retention or versioning policies ­ how do we find key information from a knowledge-management and compliance perspective among hundreds of terabytes of unstructured information?"

BAE has adopted enterprise search software from Autonomy. The software not only sits on top of multiple information systems but is also able to burrow beneath the incompatible metadata attached to documents to uncover their meaning. For BAE this was vital, as it seeks to share best practices across the organisation without limiting people to their personal networks in their search for advice and expertise.

"People finding people is the key to me," says West. "I do not care so much about the documents themselves, but a document tells you who its owner is." Because the search is enterprise-wide, it can also pull in competency information from human-resources systems and search across discussion forums and wikis to find people who are the "hotspots" of thinking on particular topics.

Another feature of the system takes a leaf out of the product recommendation and profiling techniques that online retailers use to create "learner profiles", which can link individuals performing similar searches.
The increasing familiarity of users with sites such as, Facebook and has helped West sell the system.

"When you talk about knowledge management, it goes over people's heads," he says. "A few years ago, if you talked to a bunch of engineers about this, they might have been worried it would be used against them. Now they are all doing this stuff outside on the internet and we can explain that we can do something similar internally."

Sharing best practices

Benchmarking the implementation of enterprise search suggests it has achieved a 90% improvement in access to information. But the prize for West is the ability to re-use and share best practice across projects.
³"There have been a lot of false dawns of this sort of stuff," he says. "But the organisations that have got a handle on it are achieving huge competitive advantage."

That intuition is backed up by formal research by firms such as consultancy firm Accenture¹s "High Performance" project.

³The message from clients is clear," says Stephen Gallagher, global director of analytics in Accenture's Information Management Services practice. "High performing companies stand out from other companies in their use of analytics to a higher level to be competitive."

Unfortunately, matching this is not simply a question of acquiring the right technology, "There are two problems," says Gallagher. "One is that the software is quite immature and relatively difficult to use. The other is to find enough people to understand the analytics."

On the first point, Gallagher points out that smaller suppliers of unstructured data analysis tools are rapidly being bought up and integrated by big players such as IBM and Oracle.

The people issue is more thorny, but Gallagher notes Gartner¹s predictions that companies will increasingly form business intelligence competency centres to bring together scarce analytic skills scattered across their organisations.

"Clients are recognising a market need to consolidate their skills," Gallagher says. "The guy in the accounts department can use the same analytics skills he uses to detect fraud to do customer analysis."

Customer analysis using structured data has been a staple of business intelligence for years. Unstructured data contains its own gems, but it is when the two are combined that real benefits flow.

"It is not only about extracting keywords, or even concepts from unstructured data, it is about extracting sentiments, either positive or negative," says Olivier Jouve, vice-president of product marketing at data and text-mining specialists SPSS.

"You always call a call centre to complain about something, but it may not mean you are about to churn. You could call every week to complain, so what really makes sense is to combine structured and unstructured data so we know who you are."

This approach allows companies to tackle unhappy customers on two levels.
For example, Swiss cable operator Cablecom used data and text mining to analyse the free text responses of those customers in a survey that said they were unlikely to recommend the company to others.

The output not only allowed those customers to be tackled ­ and turned round ­ on a case-by-case basis, but also provided the basis for more wide-ranging changes that would affect every customer.

But increasingly companies are looking outside their own stores of data for customer insight.

"More and more people use data and text mining to look at external sources of data, such as forums, blogs and wikis, where people talk very freely," Jouve says. "They use it use to understand better how they are seen in the market."

The key difference between this approach and traditional business intelligence is the predictive nature of the information, as Mike Lynch, Autonomy¹s CEO, explains.

"It is about trying to pick up what is 'over the horizon', the things that are about to become very important," he says. "For example, trying to work out that the US sub-prime mortgage market was about to become important before everybody else realised."

Read more on IT risk management