Protecting web sites from web scrapers

If you sell something, it makes sense that news of what you are selling spreads far and wide to attract as many buyers as possible. To this end, e-commerce sites want to be found near the top of relevant search lists and to be included on price comparison sites. It is accepted that automated software robots (bots) must access web sites to achieve this, including the web crawlers used by search engines and web scrapers used by price comparison sites; these are so-called good-bots.

However, not all bots are good and, as the digital commerce platform provider Datalex discovered, some bots can be very bad indeed. The Ireland-based company provides a unified e-commerce platform for travel operators, combining pricing, shopping, order management and analysis for journey bookings, all this across the variety of access channels would be travellers wish to use. Its European customers include Virgin Atlantic, Brussels Airlines, Swiss International Air Lines and Aer Lingus. It has many more across the globe.

Datalex enables its customers, which are mainly travel operators, to manage complex personalised bookings for travellers. As well as the actual ticket for a journey this might include increased baggage allowances, seat upgrades, lounge access, in-flight meals, car hire, flights, hotels, travel Insurance, ground transportation and so on.

The trouble is that such information is not just of interest to legitimate travellers planning their journeys and benign good-bots. Unscrupulous competitors use web scrapers to steal content from travel sites and re-post it on their own sites (which can negatively impact search engine optimisation), and to monitor and undercut prices.

Web scraping activity can be persistent, hurt performance and drive up back end costs as charges are run up for call-outs to other services, which are generated both by legitimate users and bad bots. Aggregated across the Datalex platform this can become a problem for all the customers it hosts, even the ones that are not being directly targeted.

Mitigating web scrapers is tricky as you do not want to block the good ones. In a recent e-book, The Ultimate Guide to Preventing Web Scraping, Quocirca looked at the problem of distinguishing good-bots from bad-bots and controlling their activity.

There is a protocol called the robot exclusion standard/protocol (or simply robots.txt) which is used by good bots to check which areas of a website they are welcome to visit; however, this relies on etiquette and bad-bots will just ignore it. Manually blocking the IP addresses that host bad-bots is tiresome as it is easy for the perpetrators to just move their web scrapers to new locations. As most bad-bots mimic legitimate user behaviours it is hard for web application firewalls, which focus on anomalies and vulnerabilities, to detect them. Login enforcement, strong authentication and “are you a human?” tests are all distractions for legitimate users and good-bots.

The answer for Datalex in the end was specialist bot detection and mitigation technology from a vendor called Distil Networks. The aim is to use a reverse web proxy to detect bots directly through a range of techniques including behavioural analysis, digital fingerprinting and machine learning. Bots can then be classified and policies applied; good bots can be white-listed and bad-bots, including unwanted web-scrapers, blocked. Datalex said it eliminated the unwanted hits against its customers’ sites, making them more stable and reducing backend infrastructure costs. On average, eliminating bad bots decreased traffic to Datalex customer sites by 20-30% with no impact on real human users.

Quocirca’s e-book, which was sponsored by Distil Networks, can be accessed here. Other bot control products are available from vendors such as Akamai, Imperva and Shape Security.

More information of the Datalex story can be seen here.