Categories
Uncategorized

My site is being scraped, how can I prevent being scraped?

As anyone who has spent any time on the scraping field will know, there are plenty of anti-scraping techniques on the market. And since I regularly get asked what the best way to prevent someone from scraping a site, I thought Id do a post rounding up some of the most popular methods. If you think we’ve missed any out, please let me know in the comments below!

If you are interesting of how to find out if your site is being scraped, then turn to this post:  How to detect your site is being scraped?

Type A. Server side filtering based on web requests

1. Blocking suspicious IP or IPs.

The blocking of suspicious IPs works quite well. It works by banning IPs that send too many requests from the same geo location. But, in today’s world, most of scraping is done using a proxy. So if you ban one IP address, the scrapers will surely just move on to another one, thus staying IP independent and undetected. Which means that as a long term plan, IP blocking isn’t a particularly good one.

2. Using DNS level filtering

Using DNS firewall is another popular anti-scraping measure. Basically, you connect to a web service called a private domain name servers (DNS) network that will filter and prevent bad requests before they reach your server. This sophisticated measure is provided by some companies for complex website protection. Here ís an example of such a service.

3. Have a custom script to track users’ statistic and drop troublesome requests

It is possible in some cases to detect the algorithm a scraper is using to crawl site URLs. You’d need a custom script that tracks the request URLs and based on this turns on protection measures. For this you have to activate a [shell] script in your server. A rather unfortunate side effect of this is that the system response timing will increase, slowing down your services. It ís also possible that the algorithm that you’ve detected might be changed thus invalidating this measure.

4. Limit requests frequency

You might set a limitation of the frequency of requests or downloadable data amount. The restrictions must be applied considering the usability for a normal user. When compared to the scraper’s insistent requests you might be able to set your web service rules to drop or delay unwanted activity. But, if the scraper gets reconfigured to imitate common user behaviour (through some nowadays well-known tools: Selenuim, Mechanize, iMacros), this measure will fail.

5. Setting maximum session length

This measure is similar to the previous one, in which you kill any activity that is longer than a certain time frame. But usually modern scrapers do perform session authentication, so cutting off session time is not that effective.

Type B. Browser based identification and prevention

1. Set CAPTCHAs for target pages

This is an age old technique that for the most part does solve scraping issue. But using it does mean that anyone who visits your site will have to fill in one of those annoying boxes which could severely cut down on your site traffic. Additionally, if your scraping opponent knows how to leverage any of the anti-captcha services, this form of protection will most likely not work. You can read more on captcha solvers here.

2. Injecting JavaScript logic into web service response

JavaScript code should arrive at your client’s browser (or scraping server) prior to or along with requested HTML content. This code will count and return a certain value to the target server. Based on this test the HTML code might be malformed or might even be not sent to the requester, thus leaving malicious scrapers off. The logic can be placed in one or more JavaScript-loadable files. This JavaScript logic can be applied, not just to the whole content, but also to only certain parts of site’s content (ex. prices). To bypass this measure scrapers would need to turn to even more complex scraping logic (usually JavaScript) that is highly customizable and thus costly. There are some libraries available to evaluate injected JavaScript that allow you to reach site content.

Type C. Content based protection

1. Disguising important data as images

This method of content protection is widely used today. And it does prevent scrapers from collecting data. The side effect is that data obfuscated as images are hidden from search engine indexing, thus downgrading your site’s SEO. If scrapers manage to leverage an OCR system, this kind of protection could again be bypassed.

2. Frequent page structure change

This is a very effective way to prevent scraping. It works by not just changing elements ids and classes, but the entire hierarchy. It of course means that you will have to do a complete style restructuring every time you change it which will cost you. But on the plus side, the scraper must adapt to the new structure if it wants to keep content scraping.

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.