Here we come to the next anti-scrape tool, called ScrapeShield.
The ScrapeShield app has been developed by CloudFlare to guard a site’s content. Its features are limited number, but it’s still an interesting tool to look at for anyone interested in web scraping.
In a nutshell, ScrapeShield’s app includes anti scrape measures such as:
- content tracking
- pinterest blocking
- email obfuscation
- hotlink protection
Let’s more closely consider each of these features and how they might be bypassed through scraping techniques.
Setting up DNS to CloudFlare’s one
First of all we need to set up our website on the CloudFlare’s domain name servers. This allows them to route potentially malicious traffic on your site through CloudFlare’s network filters and prevent bad requests before they reach your server. The byproduct of this is the acceleration of the content delivery. This means that CloudFlare now becomes your DNS provider and you don’t have to change your current hosting provider or registrar.
For it to be activated, you have to change your site’s name servers to those of CloudFlare. This can take between 1 and 48 hours. Then from your account you need to verify that you changed your name servers to the right location. You can find the path name by clicking ‘Details’ respective to a particular website domains under protection.
ScrapeShield Content Control
*For the free plan you might set up only one domain though.
Since the web traffic is being directed through the service’s DNS, Scrape-Shield is able to do requests’ monitoring, sifting and block suspicious web activity. It does securely block attacks and malicious traffic of all kinds, including abusive scrapers looking to get your text, images and email addresses.
CloudFlare also leverage the global spread cloud servers to increase your site’s performance.
If duplicate content appears, Scrape-Shield alerts the site’s owner of such a duplicate, this way the content tracking works automatically on every page of your website.
There is also an option to have a better track of specific html elements, for example by inserting html markers like this one: <!–scrape_shield–>
ScrapeShield produces reports daily for free users, and up to every 15 minutes for Pro users.
Blocking pictures repinning, emails hiding and hotlink protection
The service has developed a way to block people from pinning something to Pinterest by inserting the meta tag:
<meta name=”pinterest” content=’nopin’>
You can also block emails by obfuscating them to the malicious bots.
Another possibility, is that if you do not want other sites to reference your resources you can set your links to be protected. For this you get ”Hotlink protection” option on.
The security level is adjustable from high to low:
- the high security level imposes CAPTCHA against any visitor that have been witnessed engaging in malicious behavior on other websites..
- the medium security plan imposes CAPTCHA against visitors that have been witnessed frequently engaging in malicious behavior on other websites. This level also implies the strong defence yet does not want to make false promises.
- the low level security does challenge site visitors (again by CAPTCHA) if they regularly and repeatedly engage in spam, hacking or DoS attacks on other sites.
- the “essentially off” level works on very harsh abusers and it is best if only increase performance from CDN usage.
The service firewall works by dynamically throttling the bandwidth and speed if malicious bot is detected; the connection is held open to the scrapers and their resources are tied up.
Learning from other abusive scraper bots
CloudFlare learns from abuse targeted against one site to raise the protection levels for all sites under the CloudFlare’s ScrapeShield guard.
How to bypass these anti-scrape and anti-referencing measures?
The ways how to get around ScrapeShield protection
Since the anti-scrape works as an domain name system level, then I’d use all caution not to be detected. Actually the system checks if visitor (bot) has also been detected on other websites of the CloudFlare community. So, if you don’t caught, you won’t get blocked, keep behaving like a real web visitor. The best measure for it is to proxy your scraper so you stay unblocked on some of the IP addresses and regularly change the set of proxy IP addresses. Yet, as for the bloking pictures repinning you can’t do much, unless you download picture’s content and repost it.
CloudFlare’s ScrapeShield is a good example of some anti-scrape-bot features to be applyed for site’s content protection. You deploy ScrapeShield by changing the site’s current authoritative name servers to its domain name servers. This gives a limited scrape defence (content tracking in the web, Pinterest blocking, email obfuscation hotlink protection). Do note though that if your site has a large traffic volume, then it’ll be hard to for ScrapeShield to recognize if a site visitor is a high frequency scraping bot or not.
As far as content protection the service might check if your own content appears in other sites on the web (which could benefit publishers), yet it’s not effective if someone steals your proprietary info, for example, prices or other data to store them then in a secret database.