7 Ways to Protect Website from Scraping and How to Bypass this Protection

stop-scrape In this article I’d love to revise few well-known methods of protecting website content from automatic scraping. Each one has its advantages and disadvantages, so you need to make your choice basing on the particular situation. None of these methods is ultimate and each one has its own ways around I will mention further.

If you are interesting of how to find out if your site is being scraped, then turn to this post: How to detect your site is being scraped?

1. IP-address ban

The easiest and most common way to determine attempts of website scraping is analyzing the frequency of requests to the server. If requests from a certain IP-address are too often or too much, the address might be blocked and it is often asked to enter captcha to unblock.

The most important thing in this protection method is to find the boundary between the common frequency and number of requests and attempts of scraping in order not to block ordinary users. Commonly this might be determined by analyzing common users’ behavior.

Example: Google might be a good example of using this method as it controls the number of requests from a certain IP address, issues a warning to block IP and prompts you to enter captcha.

Tools: Some services (like distilnetworks.com) allow you to automate the process of tracking suspicious activity on your site and even offer the authenticated user check with captcha.

Bypassing: One may bypass this protection using multiple proxies to hide the real IP-address of the scraper. The examples are BestProxyAndVPN providing affordable services such as cheap proxy, and SwitchProxy service, though more expensive, it is specially designed for automatic scrapers and withstands heavy loads. Another option is to apply rotating proxy services.

2. Using different accounts

With this protection method the data might be accessed by authorized users only. It simplifies the control on users’ behavior and blocking suspicious accounts regardless of the IP-address the client is working from.

Example: Facebook is a good example, as it is constantly controlling the users’ activity and blocking the suspicious accounts.

Bypassing: This protection might be bypassed by creating a set of accounts including the automatic ones. There are certain services (like bulkaccounts.com) selling accounts on well-known social networks. Verifying the account by phone (so-called, PVA-Phone Verified Account) to check its authenticity may create the essential complexity for automatic accounts creation, although it could be bypassed using disposable SIM-cards.

3. Usage of CAPTCHA

It’s a popular way of data protection from web scraping, too. In this case a user is invited to type captcha text to get access to the website. The inconvenience to the regular users forced to enter captchas is the significant disadvantage of this method. Therefore, it’s mostly applicable in systems where data is accessed not very often and upon individual requests.

Example: Website position’s testing in the SERP (eg http://smallseotools.com/keyword-position/) can be a good example of Captcha usage to prevent automated querying services.

Bypassing: Captcha might be bypassed using captcha recognising software and services. They might be divided into two main categories: automatic recognition without manpower (OCR, such as GSA Captcha Breaker) and recognition using manpower (somewhere in India people are sitting and processing online requests of images recognition, for example Bypass CAPTCHA service). Human-based option is usually more effective, but the payment in this case is per captcha recognized, comparing with one-time payment when software is purchased.

4. Usage of complex JavaScript logic

In this case browser sends a special code (or several codes) in its request to server and the codes are formed by complex logic written in JavsScript. The code is often obfuscated, and the logic is placed in one or more JavaScript-loadable files.

Example: Facebook is a good example of this way of protection from web scraping.

Bypassing: It might be bypassed through scraping with real browsers (for example using Selenium or Mechanize libraries). But it gives an additional advantage to this method: the scraper will show up in website traffic analytics (eg Google Analytics) when executing JavaScript, which allows webmaster immediately notice that something is going on.

EDIT: As Us0r noted in comments, this also can be bypassed.

5. Frequent update of the page structure

One of the most effective ways to protect websites against automatic scraping is to change their structure frequently. This can apply not only on changing the names of HTML element identifiers and classes, but even on the entire hierarchy. This makes writing scraper very complicated, although it overloads the website code and, sometimes, the entire system as well.

On the other hand, these changes can be made manually once a month (or several months). It makes scrapers’ lives tough anyway.

Bypassing: To bypass protection like this a more flexible and “intelligent” scraper is required, or just a scraper’s manual correction is needed when these changes occur.

6. Limitation of the frequency of requests and downloadable data allowance

This allows to make scraping of large amounts of data very slow and therefore impractical. At the same time the restrictions must be applied considering the needs of a common user, so that it would not reduce the overall usability of the site.

Bypassing: It might be bypassed through accessing the website from different IP-addresses or accounts (multiple users’ simulation).

7. Mapping the important data as images

This method of content protection makes automatic data collection more complicated and at the same time it maintains visual access for common users. Images often replace prices (), e-mail addresses and phone numbers, but some websites even manage to replace random letters in the text. Although nothing prevents to display the content of a website in graphic form (eg using Flash or HTML 5), it can significantly hurt its indexability for search engines.

Drawbacks: The negative effects of this method are that not all the content is indexed by search engines and that the users are not able to copy data to the clipboard.

Bypassing: It’s hard to bypass this protection as some automatic or manual images recognition is required, similar to the one used in captcha case.

You may get more a detailed and structured overview of the anti-scrape measures in this post.

7 replies on “7 Ways to Protect Website from Scraping and How to Bypass this Protection”

“the scraper will show up in website traffic analytics (eg Google Analytics) when executing JavaScript, which allows webmaster immediately notice that something is going on.”

I fixed this by adding hosts entries to 127.0.0.1 or a server I control.

127.0.0.1 munchkin.marketo.net
127.0.0.1 stats.g.doubleclick.net
127.0.0.1 http://www.google-analytics.com
127.0.0.1 google-analytics.com
127.0.0.1 ssl.google-analytics.com

Yes, this is a way. Thanks.

Though the steps mentioned look good at the first go, I would strongly advise using a professional web security platform. It would be more pocket friendly, lesser time consuming, far better security that the website will get.

Here is a detailed analysis of why, that I have answered in-depth on Quora.

[Full disclosure: I am the co-founder of InfiSecure, a web scraping and brute force protection platform]

Nice and instructive article;
However, if you don’t mind, it seems that some websites now use more advanced techniques so the article should be updated.
For example, now I’m trying make and Eeastbay scaper for a client and it seems that the site cannot be opened by either cURL or get_file_contents in PHP .. used several user-Agents but still nothing. The site does not respond at all. But in the same time it is very well opening in the browser.
Any ideas ?
Thank you !

Hi Marc,
seems to me this site is JS-stuffed to get protection from intrusive visitors. So, I’d recommend you the following posts:

That’s great. Also if you are on nginx, we can use the geo module to block scraping sites out of our website. I recently wrote an article about this, if it’s possible please check this out https://nucuta.com/how-to-prevent-web-scraping-with-nginx/

Wow, I thought the webmasters are defenseless, we will see if it actually works.

Свежие записи

Свежие комментарии

Архивы

Рубрики