As anyone who’s spent any time on the scraping field will know, there are plenty of anti-scraping techniques on the market. And since I regularly get asked what the best way to prevent someone from scraping a site, I thought I’d do a post rounding up some of the most popular methods. If you think I’ve missed any out, please let me know in the comments below!
[box class=”note grey”]If you are interesting of how to find out if your site is being scraped, then turn to this post: How to detect your site is being scraped?[/box]
Type A. Server side filtering based on web requests
1. Blocking suspicious IP or IPs.
The blocking of suspicious IPs works quite well. It works by banning IPs that send too many requests from the same geo location. But, in today’s world, most of scraping is done using a proxy. So if you ban one IP address, the scrapers will surely just move on to another one, thus staying IP independent and undetected. Which means that as a long term plan, IP blocking isn’t a particularly good one.
2. Using DNS level filtering
Using DNS firewall is another popular anti-scraping measure. Basically, you connect to a web service called a private domain name servers (DNS) network that will filter and prevent bad requests before they reach your server. This sophisticated measure is provided by some companies for complex website protection. Here’s an example of such a service.
3. Have a custom script to track users’ statistic and drop troublesome requests
It is possible in some cases to detect the algorithm a scraper is using to crawl site URLs. You’d need a custom script that tracks the request URLs and based on this turns on protection measures. For this you have to activate a [shell] script in your server. A rather unfortunate side effect of this is that the system response timing will increase, slowing down your services. It’s also possible that the algorithm that you’ve detected might be changed thus invalidating this measure.
4. Limit requests frequency
You might set a limitation of the frequency of requests or downloadable data amount. The restrictions must be applied considering the usability for a normal user. When compared to the scraper’s insistent requests you might be able to set your web service rules to drop or delay unwanted activity. But, if the scraper gets reconfigured to imitate common user behaviour (through some nowadays well-known tools: Selenuim, Mechanize, iMacros), this measure will fail.
5. Setting maximum session length
This measure is similar to the previous one, in which you kill any activity that is longer than a certain time frame. But usually modern scrapers do perform session authentication, so cutting off session time is not that effective.
Type B. Browser based identification and prevention
1. Set CAPTCHAs for target pages
This is an age old technique that for the most part does solve scraping issue. But using it does mean that anyone who visits your site will have to fill in one of those annoying boxes – which could severely cut down on your site traffic. Additionally, if your scraping opponent knows how to leverage any of the anti-captcha services, this form of protection will most likely not work. You can read more on captcha solvers here.
Type C. Content based protection
1. Disguising important data as images
This method of content protection is widely used today. And it does prevent scrapers from collecting data. The side effect is that data obfuscated as images are hidden from search engine indexing, thus downgrading your site’s SEO. If scrapers manage to leverage an OCR system, this kind of protection could again be bypassed.
2. Frequent page structure change
This is a very effective way to prevent scraping. It works by not just changing elements’ ids and classes, but the entire hierarchy. It of course means that you will have to do a complete style restructuring every time you change it – which will cost you. But on the plus side, the scraper must adapt to the new structure if it wants to keep content scraping.