Researchers are poised to revolutionise internet search
technologies over the next few years.
The most common thrust is to personalise search engines so that
they know, for example, that if the user is an IT professional and
searches for "mouse", the user is more likely to want information
about a PC device rather than about a rodent.
Adele Howe, a computer science professor at Colorado State
University, and Gabriel Somlo, a CSU graduate student, have built a
proof of concept called QueryTracker, a software agent which sits
between a user and a conventional search engine and looks for
information of recurring interest, such as the latest news about a
user's chronic illness.
QueryTracker submits a user's query to the search engine once a
day and returns results from new web pages and pages that have
changed since the previous search.
The magic in QueryTracker comes from its automatic generation of
an additional daily query, Howe said, adding it is often superior
to the user's original query which is based on what it learns about
the user's interests and priorities over time. It filters the
results of both queries for relevance and sends them to the
user.
QueryTracker can generate its own searches, can compensate for
the poorly formed queries that many users write.
"Even people knowledgeable about the web are often either lazy
or they are just not informed about how to write good queries," she
said. The most common mistake, she said, was that queries were too
short.
Jeannette Jenssen, a mathematics professor at Dalhousie
University in Halifax, Nova Scotia, is taking search
personalisation techniques a step further, to the "crawlers" that
index web content before it can be searched.
She said the popular search engines have three drawbacks. They
are, increasingly, charging corporate users for their services,
they skew results in favour of advertisers, and they often retrieve
huge amounts of irrelevant information.
But Jenssen's "focused crawler" indexes only pages related to
prespecified topics and then tailors the rankings to the interests
of the user.
Filippo Menczer, a computer science professor at Indiana
University in Bloomington, said conventional search engines
determine a document's relevance by considering various things in
isolation. They may first select a document because it contains the
keywords in the query. Then, to rank the results, they may consider
how many links point to the document.
Better results could be obtained from considering many such
"measures of relevance" - including user preferences - in
combination, and in considering combinations of pages rather than
single pages, said Menczer.
Such complex and powerful searches will be practical in three to
five years when computers are more powerful.
"We'll do brute-force, large-scale data mining over the whole
web - over many terabytes of information," he said.
IBM's WebFountain is a huge Linux cluster which runs 9,000
programs continuously and crawls 50 million new pages every day. It
applies natural-language analysis concepts to extract meaning from
unstructured text.
For example, it determines whether an entity is a person's name,
company name, location, product, price and so on, and then it
attaches searchable XML metadata tags to it.
"We are tagging the entire web, all of Usenet news, all the wire
services and so on," says Dan Gruhl, WebFountain's chief architect
at IBM's Almaden Research Centre.
The software can extract and tag the semantic meaning of
unstructured text, but Gruhl said more research is needed to do
reliable "sentiment analysis", which, for example, would let
companies automatically monitor the reputations of their
products.
Gary H Anthes writes for Computerworld