Search engines get cleverer

Researchers are poised to revolutionise internet search technologies over the next few years.

Researchers are poised to revolutionise internet search technologies over the next few years.

The most common thrust is to personalise search engines so that they know, for example, that if the user is an IT professional and searches for "mouse",  the user is more likely to want information about a PC device rather than about a rodent.

Adele Howe, a computer science professor at Colorado State University, and Gabriel Somlo, a CSU graduate student, have built a proof of concept called QueryTracker, a software agent which sits between a user and a conventional search engine and looks for information of recurring interest, such as the latest news about a user's chronic illness.

QueryTracker submits a user's query to the search engine once a day and returns results from new web pages and pages that have changed since the previous search.

The magic in QueryTracker comes from its automatic generation of an additional daily query, Howe said, adding it is often superior to the user's original query which is based on what it learns about the user's interests and priorities over time. It filters the results of both queries for relevance and sends them to the user.

QueryTracker can generate its own searches, can compensate for the poorly formed queries that many users write.

"Even people knowledgeable about the web are often either lazy or they are just not informed about how to write good queries," she said. The most common mistake, she said, was that queries were too short.

Jeannette Jenssen, a mathematics professor at Dalhousie University in Halifax, Nova Scotia, is taking search personalisation techniques a step further, to the "crawlers" that index web content before it can be searched.

She said the popular search engines have three drawbacks. They are, increasingly, charging corporate users for their services, they skew results in favour of advertisers, and they often retrieve huge amounts of irrelevant information.

But Jenssen's "focused crawler" indexes only pages related to prespecified topics and then tailors the rankings to the interests of the user.

Filippo Menczer, a computer science professor at Indiana University in Bloomington, said conventional search engines determine a document's relevance by considering various things in isolation. They may first select a document because it contains the keywords in the query. Then, to rank the results, they may consider how many links point to the document.

Better results could be obtained from considering many such "measures of relevance" - including user preferences - in combination, and in considering combinations of pages rather than single pages, said Menczer.

Such complex and powerful searches will be practical in three to five years when computers are more powerful.

"We'll do brute-force, large-scale data mining over the whole web - over many terabytes of information," he said.

IBM's WebFountain is a huge Linux cluster which runs 9,000 programs continuously and crawls 50 million new pages every day. It applies natural-language analysis concepts to extract meaning from unstructured text.

For example, it determines whether an entity is a person's name, company name, location, product, price and so on, and then it attaches searchable XML metadata tags to it.

"We are tagging the entire web, all of Usenet news, all the wire services and so on," says Dan Gruhl, WebFountain's chief architect at IBM's Almaden Research Centre.

The software can extract and tag the semantic meaning of unstructured text, but Gruhl said more research is needed to do reliable "sentiment analysis", which, for example, would let companies automatically monitor the reputations of their products.

Gary H Anthes writes for Computerworld

Read more on PC hardware