Security Zone: can you prevent scraping or data harvesting?

We have all innocently copied and pasted text and images from websites, but scripts and bots have made it possible to lift content of websites to industrial mining proportions in a practice known as 'data scraping'

We have all innocently copied and pasted text and images from websites, but scripts and bots have made it possible to lift content of websites to industrial mining proportions in a practice known as 'data scraping', writes Marino Zini, managing director at Sentor Managed Security Services UK.

Scrapers take for free what the company has spent large sums to develop, resulting in loss of revenue and loss of customer confidence with a brand. This is theft of digital property and an attack on the uniqueness of online brands. It is akin to the 'shrinkage' of goods in supermarkets. But can it be detected? Can it be stopped? And can scrapers be prosecuted?

The problem has grown tenfold in the past year alone, while the tools used have also grown more sophisticated and the methods more anonymous or stealthy. One just has to type 'scraping' into a search engine and dozens of such tools providing various level of sophistication and even guaranteed anonymity appear.

Sectors affected include online directories, such as; and the travel industry with recent legal cases from Ryanair and easyJet; online insurance companies, property listing sites and B2B portals, in fact any organisation with content rich web listings. UK cases reported in the national press have included the Royal Mail post code scraping and the National Gallery copying of its digital images.

Data scraping covers a number of different methods of obtaining data from a website or database. It is referred in various ways as 'screen scraping', 'web-scraping', 'web-harvesting' and even worryingly as 'rate-raping' by the insurance industry. Scrapers use scripts, 'bots', 'webots', 'crawlers', 'harvesters', or 'spiders' many of which are the same tools used by the likes of Google and Yahoo in searching and indexing, making it even more difficult to differentiate between good and bad bots when trying to identify scrapers. Furthermore, scrapers use anonymous proxies and TOR networks to avoid being tracked down.

How can be it be stopped? It is impossible for traditional network security devices such firewalls, intrusion detection and prevention, or even application layer firewalls to detect or block them as sophisticated scraping tools mimic user search patterns, however there are developing technical counter measures for detecting the practice. According to Nigel Ridgeon, head of analysis and information at, real-time user pattern analysis has been a very effective in keeping Yell ahead of the scrapers.

Some civil legal recourses have been conclusive but cases are expensive, time consuming and the evidence difficult to gather because of the anonymity of the methods used in scraping. A recent article called 'Scrapping Over Data' by London law firm DMH Stallard gives a very good exposé on the civil cases.

The stakes are high, effects may include loss of revenue, system overload as a result of massive bot activity, loss of advertisement revenue, loss of control of content and its subsequent devaluation. Scraping also has implications on search engine rankings. Cynically, scrapers can rate above brands they harvested from as they draw potential customers away to other sites. The implications are increasing attention on what amounts to a growing area of industrial espionage.

Security Zone: read more advice from (ISC)² qualified security professionals >>

This was last published in November 2009

Read more on IT risk management

Start the conversation

Send me notifications when other members comment.

By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States. Privacy

Please create a username to comment.