Recently we’ve ancounted a highly protected site — govets.com. Since the number of target brand items of the site was not big (under 3K) so I desided to get target data using the handy tools for a fast manual scrape.
Tag: scrape protection
Amazon scrape tip
Recently we’ve met requirements to scrape Amazon data in big quantities. So, first of all I’ve tested the data aggregator for being bot-proof or anti-bot protection. For that I used the Discord server Scraping Enthusiasts, namely Anti-bot channel.
Since Amazon is a hige data aggregator we recommend readers to get acquainted with the post Tips & Tricks for Scraping Business Directories.
Today, I’ll share of a Dicord server 1 and server 2 that accomodate a bot able to detect multiple modern scrape-protection and scrape-detection means. The server’s channels with the bot are #antibot-test
and #antibot-scan
respectively
Bot protected websites
In the post we summarize how to detect the headless Chrome browser and how to bypass the detection. The headless browser testing should be a very important part of todays web 2.0. If we look at some of the site’s JS, we find them to checking on many fields of a browser. They are similar to those collected by fingerprintjs2.
So in this post we consider most of them and show both how to detect the headless browser by those attributes and how to bypass that detection by spoofing them.
See the test results of disguising the browser automation for both Selenium and Puppeteer extra.
Given: a webpage to scrape.
If you inspect the DOM tree of that page you will find that quite a few tags are having the keyword dist. As an example:
<link rel="shortcut icon" type="image/x-icon" href="/wcsstore/ColesResponsiveStorefrontAssetStore/dist/30e70cfc76bf73d384beffa80ba6cbee/img/favicon.ico">
<link rel="stylesheet" href="/wcsstore/ColesResponsiveStorefrontAssetStore/dist/30e70cfc76bf73d384beffa80ba6cbee/css/google/fonts-Source-Sans-Pro.css" type="text/css" media="screen">
Imperva (that includes the former Distil anti-bot management) is a service providing many kinds of website protections. The present Imperva services include the following ones:
- Cloud Web Application Firewall (WAF)
- Bot Protection service (formerly Distil Networks)
- IP Reputation Intelligence
- Content Delivery Network (CDN)
- Attack Analytics solution (eg. DDoS)
As to the protection of the bot scraping activities we mention the following.
- Online marketplaces
In the marketplaces people offer their products for sale. Similar to garage sales, but online. (eg. eCrater, www.1188.no).
Easy to scrape since they are usually free and do not tend to protect their data. - Business directories
The usually huge online directories targeted at the general audience. (eg. Yellow Pages). They do protect their data to avoid duplication and loss of audience. See some posts on this.
Bypass Distil
The Distil scrape protection is a prominent one in the modern anti-scrape techniques. So, now we want to share with you some tips of how to bypass it. If you are interested, please make an inquiry to the following email: igor[dot]savinkin[at]gmail[dot]com