Amazon scrape tip

Recently we’ve met requirements to scrape Amazon data in big quantities. So, first of all I’ve tested the data aggregator for being bot-proof or anti-bot protection. For that I used the Discord server Scraping Enthusiasts, namely Anti-bot channel.

Since Amazon is a hige data aggregator we recommend readers to get acquainted with the post Tips & Tricks for Scraping Business Directories.

Challenge Development

Node.js & Privacy Pass application for Cloudflare scrape solution

Over 7.59 million of websites use Cloudflare protection, 26% of
them are among the top 100K website worldwide. As Cloudflare
establishes itself as the norm regarding service protection, chances are, the site you want to scrape is more likely to use it than not.

When it comes to scrapping websites, captchas and other type of
protections were always the main obstacle in providing reliable data collection solutions. And most often this would lead to consider bypass services which aren’t always free.

Challenge Development

Undetected ChromeDriver in Python Selenium

Selenium comes with a default WebDriver that often fails to bypass scraping anti-bots. Yet you can complement it with Undetected ChromeDriver, a third-party WebDriver tool that will do a better job.

In this tutorial, you’ll learn how to use Undetected ChromeDriver with Selenium in Python and solve the most common errors.

Challenge Development

How to bypass PerimeterX

You’ve found the website you need to scrape, set up your scraper and fired it, just to sadly realize PerimeterX has blocked you.

PerimeterX’s dynamically complex bot detection system relies on server-side and client-side checks to distinguish humans from bots. It deploys several layers of protection and, for the most part, manages to do its job without interrupting the user experience.

But don’t fall into despair! There are a couple of things you can try to bypass PerimeterX (called HUMAN now) before giving up on your goal of scraping that delicious data.

Challenge Development

Discord Bot to detect on-site anti-scrape & scrape-proof tools

Today, I’ll share of a Dicord server 1 and server 2 that accomodate a bot able to detect multiple modern scrape-protection and scrape-detection means. The server’s channels with the bot are #antibot-test and #antibot-scan respectively


Bot protected websites

We share here some bot-protected sites.

WebsiteProtection tool Notes
govets.comRecaptcha , CloudFlareThe following anti-bots got detected:
Headers: cf-chl-gen, cf-ray, cf-mitigated
Server header: cloudflare

Script loaded: recaptcha/api.js
JavaScript Properties: window.grecaptcha, window.recaptcha

Detected on 1 urls:
22.02.2024 at 8:26 PM EET, PerimeterX, ImpervaThe following anti-bots got detected: Recaptcha
JavaScript Properties: window.grecaptcha, window.recaptcha
PerimeterX -- Script loaded: init.js
Detected on 1 urls:
22.02.2024 at 7:15 PM EET
(detectinon tool)

app.impact.comCloudFlare ?No title
⚠ Error ⚠
Network related / Timeout / Bad status code.

23.02.2024 at 12:15 PM
zoominfo.comCloudFlare, PerimeterX
The following anti-bots got detected:

  • Headers: cf-chl-gen, cf-ray, cf-mitigated
  • Server header: cloudflare

  • Cookies: _px3, _pxhd, _px_vid
23.02.2024 at 12:39 PM
Challenge Development

Bypass GoDaddy Firewall thru VPN & browser automation

Recently we encountered a website that worked as usual, yet when composing and running scraping script/agent it has put up blocking measures.

In this post we’ll take a look at how the scraping process went and the measures we performed to overcome that.


Headless Chrome detection and anti-detection

In the post we summarize how to detect the headless Chrome browser and how to bypass the detection. The headless browser testing should be a very important part of todays web 2.0. If we look at some of the site’s JS, we find them to checking on many fields of a browser. They are similar to those collected by fingerprintjs2.

So in this post we consider most of them and show both how to detect the headless browser by those attributes and how to bypass that detection by spoofing them.

See the test results of disguising the browser automation for both Selenium and Puppeteer extra.


How to find out that website is Distil protected?

Given: a webpage to scrape.
If you inspect the DOM tree of that page you will find that quite a few tags are having the keyword dist. As an example:

  • <link rel="shortcut icon" type="image/x-icon" href="/wcsstore/ColesResponsiveStorefrontAssetStore/dist/30e70cfc76bf73d384beffa80ba6cbee/img/favicon.ico">
  • <link rel="stylesheet" href="/wcsstore/ColesResponsiveStorefrontAssetStore/dist/30e70cfc76bf73d384beffa80ba6cbee/css/google/fonts-Source-Sans-Pro.css" type="text/css" media="screen">

How Imperva protects against scraping bots

Imperva (that includes the former Distil anti-bot management) is a service providing many kinds of website protections. The present Imperva services include the following ones:

  1. Cloud Web Application Firewall (WAF)
  2. Bot Protection service (formerly Distil Networks)
  3. IP Reputation Intelligence
  4. Content Delivery Network (CDN)
  5. Attack Analytics solution (eg. DDoS)

As to the protection of the bot scraping activities we mention the following.