Recently we have encountered the web scrape detection issues in some of our projects. So as we’ve consulted with the Sequentum developers we present to you some points on this topic. Here are a few lines about web scraping detection and how Visual Web Ripper can help deal with this problem.
1. Detection by IP Address
If a website tries to detect web scraping activity, it will nearly always try to use your IP address. The website will simply count the number of requests from a single IP address and block your IP address or display a CAPTCHA if you are requesting too many webpages from a single IP address. Some websites may take it a step further and look at the frequency of requests, since web scrapers will normally request pages much faster than a human.
The best way to prevent detection from such websites is by using a proxy rotation service. Such as service will have a large pool of IP addresses and rotate the IP addresses every time you make a request for a webpage, so the target website will see very few requests from the same IP address and requests from the same IP address will have a long delay between them.
We offer a FREE account at Private Proxy Switch for our customers and trial users to use with Visual Web Ripper. Private Proxy Switch is a high performance proxy server with a large pool of elite anonymous proxies. A new IP address is randomly assigned to you when you make a request for a new webpage, making it impossible to detect your identity, and very difficult to detect and block web scraping activity. The free account is restricted to 500MB per month.
Visual Web Ripper also offers semi-automatic or full automatic processing of CAPTCHA pages. Full automatic processing of CAPTCHA requires a subscription to a CAPTCHA recognition service such as Decaptcher.
2. Detection by Request Signatures*
Most advanced web scrapers use an embedded version of a normal web browser to extract data, so the request signatures will be exactly the same as when using a normal version of the web browser, and it will therefore be very difficult to block access based on the request signatures.
You can use software such as Fiddler to monitor all web requests in order to confirm the web requests are in fact the same.
When Visual Web Ripper runs in Web Crawler mode, it does not use an embedded version of IE and works like a simple web scraper, so in this case the request signatures are likely to be different from IE. Websites can easily block such web scraping sessions by delivering some information asynchronously.
3. Scraping Data from Restricted Areas of a Website
A website may restrict some areas by a login or by only allowing specific IP addresses to access the area. If you need to extract data from such a website you need to be very careful if you don’t want your activity to be detected.
You can obviously not use a proxy rotation service, since the website can just count the page requests per login instead of per IP address, or you may not get access to the website at all if it is restricted by IP address.
First of all you need to accept that you cannot process more pages per day than what a normal human would reasonably be able to process manually. You may still want to use a web scraping tool, since it would be very tedious to process all these pages manually even if it was reasonably possible.
The following recommendations are for users who are very paranoid about being detected when scraping data from restricted areas of a website. Most websites will not have advanced techniques in place to detect web scraping, except for counting the number of page requests per user.
1. Always use a web scraping tool that uses an embedded web browser, so the request signatures are the same as a normal web browser. In Visual Web Ripper you need to process the website in Web Browser mode.
2. Set a random time delay between page requests, so the requests are consistent with how a normal human browses a website. Visual Web Ripper allows you to set a random time delay with a minimum and maximum delay.
3. Do not schedule your data extraction project to start at a fixed time every day.
4. You may consider building multiple versions of your data extraction project that extracts separate chunks of data, and then run the projects at random times during the day.