In search of a better way to look for data

Search engine Google has enjoyed a successful flotation and its recently released desktop search tool is receiving lots of...

Search engine Google has enjoyed a successful flotation and its recently released desktop search tool is receiving lots of attention. But the success of the tool is not purely down to the significance of the Google brand.

The software catalogues files and e-mails, integrating them into its internet search so that a user can find local data through the traditional Google browser interface. It addresses a basic design flaw in the hierarchical file storage system: searching it is not intuitive.

Hierarchical file storage was fine 20 years ago, when PCs could only hold minimal amounts of textual data. Now, with the amount of stored information growing by about 30% each year, according to the Berkeley University School of Information Management and Systems, information is becoming more difficult to store. Users can buy more than a terabyte of storage for £600, and the diversity of file types it can hold has grown dramatically. Trawling through this information folder by folder is becoming less feasible.

"No one has come up with a good alternative yet," said Angela Ashenden, senior analyst at Ovum. "The file system has the same problem that you have with any legacy system. Everything is on it, and moving it into a new system is very difficult."

Microsoft, which has been working on object-oriented storage for the past decade, introduced the idea of the WinFS storage system as part of Longhorn - the next Windows release - a year ago, before removing it from the system in October, promising to ship it as a separate component at a later date. WinFS, which will use XML to create schemas that hook different files together according to different criteria, will, he hopes, deliver Bill Gates' dream of a unified storage system in which access to files is more "intelligent".

Being able to link files together by, say, person, event, or subject at the file system level will make computers easier to use. It is part of a major shift in the way we store files which is long overdue, according to Hamish Macarthur, founder of IT analyst firm Macarthur Stroud. "The question up until now has been the extent to which you look at a file system as simply recording a file, versus the file system being a database environment," he said.

"Now, we are looking at a situation where users will not differentiate between them in terms of the infrastructure they are using."

Closing the gap between flat file storage and database-like file services within the file system is what Microsoft has been trying to do with WinFS, which will share some of the same underlying source code as the SQL Server database. But with the first commercially usable version of WinFS unlikely to appear before 2007, users will have to grapple with traditional hierarchical storage until then.

Hence the immediate grassroots success of the Google desktop system, which joins other similar tools, such as Microsoft's Lookout, X1 and Blinkx. However, all of these tools have one drawback in common - they are restricted to searching the desktop, which limits their usefulness in an enterprise environment.

Alan Paul, head of ICT and security at Marshall Aerospace, said it was at the server level where an alternative to the hierarchical file storage system was really needed. He maintains thousands of hierarchical folders for different projects, with different access rights for more than 1,000 users.

"We need to maintain ordered and disciplined file systems on the network, but the more complex it is, the more difficult it is to find anything," he said. "We are seriously having to consider an enterprise search engine to sit on our intranet to look for files that we need."

Paul, like many IT directors, is realising that with no alternative file systems available, he needs to overlay something on top of the file system to make it more navigable. Various enterprise search tools exist for this purpose, produced by companies such as Verity, Autonomy, and Convera. ISDD, which has set itself up as a cheap competitor to Autonomy, uses Bayesian statistical analysis to mine large quantities of server-based data and return search results based on probability.

One downside is that the product is not designed to index data in real time, said ISDD managing director Sukhbir Sidhu. Instead, it uses batch indexing.

Iain Fletcher, European alliances manager at Convera, prefers other methods. "As data volumes get bigger, Bayesian analysis does not necessarily get better," he said, claiming that statistical analysis is better at returning generalised results. Instead, Fletcher relies on taxonomies, which can be likened to dictionaries of definitions that can be used to better reference data by exploring its meaning.

Taxonomies are key to semantic web technology, which is web inventor Tim Berners-Lee's brainchild. But Ian Black, managing director of Aungate, an Autonomy subsidiary that caters for the compliance market, is not convinced about the idea of attaching predefined meaning to data. He leans towards Bayesian-based pattern analysis, arguing that as key concepts change their meaning, taxonomies can become outdated and irrelevant.

"On 10 September 2001, the term 'ground zero' meant the centre of a nuclear disaster. A few days later, it took on a new meaning," he pointed out.

Software such as ISDD's allows the user to search both the enterprise and the local desktop at the same time, returning the information in a single screen. Although individual users can search their own desktops for their own data, this still leaves IT departments with a wealth of valuable data sitting on client desktops that they cannot see. Getting at server data is relatively easy, antiquated file structures notwithstanding, getting at that desktop data over the network is much harder.

One method is simply to encourage all users to save their data to a network drive, said Des Lekerman, managing director of IT services company Eurodata Systems, which advised Marshall Aerospace on its strategy.

Another approach is agentless back-up, where server-based software grabs data from desktops, synchronising it without needing a software client at the desktop level. Companies such as Asigra, with its "televaulting" software, offer this facility.

Companies that do not want to pull desktop-based data into a central server still have some options. BEA and IBM have developed products designed to poll diverse data sources, pulling the data into a centralised view. BEA's Liquid Data will query desktop-based data that you point it at, and IBM's DB2 Information Integrator does the same.

IBM's director of information integration Nelson Mattos explained that IT managers could remotely define which folders are included in searches on a desktop-by-desktop basis.

It will be a long time before the majority of desktops are using object-oriented file systems in which users are shielded from the underlying hierarchical storage model by database-like features. It will be even longer before the majority of applications take advantage of such features. In the meantime, the industry will have to find alternative ways of retrieving data from ever larger file stores.
Users can only hope that, when Microsoft finally ships its unified storage system, it will deliver on the promises that it has been making for the past 10 years. For many, it will be long overdue.

Read more on Operating systems software