Minerva Studio - Fotolia

Search beyond search engines

All-purpose search engines can have drawbacks for professional searchers. Technologies are emerging to take consultants, journalists, academics and others beyond search engines

All-purpose search engines allow billions of people immediate access to immense volumes of information. They have been valued at a median of US$17,530 a year (about £14,000) by 80,000 participants in a study published this year by researchers from the Massachusetts Institute of Technology (MIT) and the University of Groningen.

Respondents were asked how much compensation they would need to give up various digital services. Search engines were valued more than twice as much as email, nearly five times as much as online maps, and more than 50 times more than social media.

The research set out to measure “consumer surplus” for such services, the difference between what people would pay and actually did – in most cases, nothing. The authors argue that such valuable but free services undermine economic measures, including gross domestic product.

But search engines provide even greater benefits for those who need to find information professionally. A few years ago, a journalist half-remembering this research several months after its publication could have spent hours searching for it. Now, entering “search engines consumer surplus” into DuckDuckGo.com generates a list topped with an MIT press release about the study.

But all-purpose search engines have their drawbacks. These can include loss of privacy, particularly with the dominant provider Google, but there are privacy-focused rivals such as DuckDuckGo. For professional users, there are more pressing problems, including finding certain types of material, imprecise handling of search terms, inability to set professionally important parameters, and methods for handling non-verbal material. Answering these can mean going beyond the all-purpose search engine to find more focused services.

Jisc provides technology services to its members, the UK’s universities, colleges and research organisations, including their libraries. In July, it completed development of the national bibliographic knowledgebase (NBK), a database of 41 million records created from 133 institutional library catalogues, which, as well as universities, include national libraries, charities such as the National Trust, museums such as the V&A and research institutions including Wellcome. It aims to cover 200 organisations by summer 2020.

NBK has replaced earlier “union” catalogues Copac and Suncat, but as well as incorporating more types of material and organisations, it adds search engine-style services. Library Hub Discover, which anyone can use, has a single search box as its main interface, although targeted searches, such as by author, subject and institution, are also available. Results can include links to a growing number of online resources, such as digitised versions of books managed by libraries or publishers’ websites.

“Google has defined ease of search for everybody,” says Neil Grindley, Jisc’s head of resource discovery, with students used to single search boxes. But in this case, all the results are items held by academic and research libraries, he says. “You are searching in a very large but banded territory.”

This matters because all-purpose search engines do a poor job of including such material, says Grindley. “Libraries haven’t been able to make themselves felt on the open web,” he adds. “We want to do something about that.”

As well as providing its own services based on NBK, Jisc is publishing the underlying data, so that search engines can use it to link directly to institutional library catalogues.

Pay-walled search

While Jisc is trying to increase and improve the quality of what can be found openly, commercial providers generally build pay-walled search and discovery systems focused on making knowledge-based professionals more efficient. This can include access to subscriber-only information, such as publications for lawyers and accountants, and can also involve using machine learning to improve keyword-based searches.

London-based Signal AI applies tens of thousands of “classifiers” to both the open and pay-walled material it manages for clients. These are equivalent to tags used by publications and blogs covering brands, countries, people and topics. But rather than being chosen by humans, the company used a machine learning-trained system that typically applies them in less than a minute. It also calculates other measures, including salience.

Earlier this year, Signal AI announced a deal with Deloitte that allows the consultancy to offer clients a service that monitors regulatory sources in more than 100 jurisdictions as well as media sources. This uses classifiers trained with Deloitte’s help to group material on specific types of taxation, even when different countries use different terminology. The results can be passed on through regular email newsletters.

Signal AI started in media monitoring, but has expanded into serving those working in compliance, risk and senior management. Amy Collins, the company’s vice-president for product, sees further potential in sales, product management and engineering. Although it is possible to build sophisticated queries in all-purpose search engines, this can be complicated and unreliable, she says.

“We’ve solved that through machine learning,” says Collins. “We’ve made the search problem very simple.”

The company also allows users to train their own classifiers through a system officially named Vulcan, but which Collins calls “Tinder for AI”, which involves them refining a search by accepting or rejecting its output.

Krzana, another London-based company, takes a similar approach in serving its media clients, including Reach, which runs local newsrooms including those for the Manchester Evening News and Birmingham Mail, as well as a national broadcaster and fact-checking services.

Read more about enterprise search

  • What is Enterprise search?
  • enterprise search engine business case relies on numbers – and more.
  • Experts agree that setting up an effective enterprise search strategy is difficult. It will take people and time for enterprise search to even begin to approach the consumer search experience.

Journalists are supposed to focus on the “five Ws” when writing stories – who, what, when, where and why – and the company helps them with the first by using machine learning to detect the people and organisations mentioned in material.

On “when”, founder and chief technology officer Toby Abel says the system’s architecture includes a “changelog” model focused on what has appeared recently. “That’s a focus that doesn’t exist in average search,” he says. For journalists covering geographically defined areas, “where” is important, so Krzana geolocates material so journalists in the West Midlands’ biggest city are not distracted by stories on Birmingham in Alabama.

The system can also help journalists to apply institutional practices, such as what a certain kind of story usually includes. “There’s a great deal of creativity in what they do, but there is also a great deal of pattern,” says Abel. “A tailor-made search engine can encode some of that.”

For example, the system can suggest that a story on a festival might include comments from local people, businesses, those attending and those negatively affected.

The Inject Project, a service being developed for journalists with European Union funding, aims to boost journalists’ creativity by using artificial intelligence to provide related but different material. It draws on 380 news sources and more than 16 million articles in six languages, with partners including German news agency Deutsche Presse-Agentur.

“We are not going to make journalists more creative,” says Neil Maiden, professor of digital creativity at Cass Business School, City, University of London. “What we think we can do is make them as creative but more quickly than they are at the moment.”

Leads and ideas

The system suggests leads and ideas, so, for example, a search on the resignation in May of Cyprus’ justice minister, Ionas Nicolaou, over the murder of foreign women generates links to the disappearance of foreigners in Greece and other Mediterranean countries. Maiden says the system aims to provide suggestions in four areas – evidence, human interest, quirky or humorous angles, and future ramifications.

“These angles aren’t particularly novel,” he says. “Our job has been to try to codify them in terms of manipulating existing news to nudge journalists towards new stories.”

All these services focus on words, but some people work mainly with images or data. New York-based image library Shutterstock says that well over 90% of users use keyword searches to find images – a process that it works to enhance by suggesting popular keywords to contributing photographers and image-makers when they write descriptions. The company plans to extend its use of natural language processing so that contributors can write in any of the 21 languages in which it already allows users to search.

Shutterstock has also launched ways to search using images, based on factors such as colours and objects shown, with technology that it calls “computer vision”. Reveal, a one-to-many search, aims to return images similar to those identified through a Chrome browser extension, with a just-launched version that can also return video footage. Refine, a many-to-many search, allows users to train the search facility with pictures they like.

The company says computer vision search pages are involved in 12% of search page views and 26% of downloads.

“We are just approaching 300 million images,” says Peter Silvio, senior vice-president for engineering and architecture. “The challenge of putting the right image in front of the person at the right time becomes an exponentially difficult problem to solve. Providing these additional discoverability channels really allows the user to dive deep into what exactly they’re looking for.”

On data, Google and others offer free online graphing services, known as data visualisation. However, paid-for services can offer lots of extras. Seattle-based Tableau has recently added Ask Data,which generates visualisations based on queries entered in normal language, and Explain Data, which uses statistical methods to suggest reasons for unexpected values in a set of data.

Ease of use

Paul Heather, director of public sector at Tableau, says the overall aim is ease of use, allowing data scientists and others in public services, such as healthcare, to visualise data more quickly. “It’s around saving lives or higher-quality treatment,” he adds.

It has a number of NHS users, including the Greater Manchester Health and Social Care Partnership, which uses Tableau to generate dashboards that help decide the best hospital for ambulance-borne patients.

Cambridge-based GeoSpock focuses on making vast amounts of machine-generated data searchable by space and time, with the intention of handling internet of things sensor data. In August, it announced a partnership with the Baltic Exchange, a maritime market information specialist, to develop a global spatial database for the industry with an initial focus on air emissions, given new regulations in this area.

Geospatial maritime data is increasing rapidly because of moves towards autonomous shipping and increasing demands for cargo tracking. GeoSpock chief executive Richard Baker says there is also potential from other types of logistics, local governments establishing smart sensors in physical infrastructure, mobile operators and data-focused advertisers. “What Google did for the web, we want to do for physical infrastructure,” he says.

The difference between all-purpose search engines and many of the organisations seeking to go beyond them is that the latter want to charge for them. But if they help professionals discover and exploit material more quickly and effectively, they may be worth paying for – even if the consumer surpluses are lower.

Read more on Content management

Data Center
Data Management