Categories
Development

Amazon scrape tip

Recently we’ve met requirements to scrape Amazon data in big quantities. So, first of all I’ve tested the data aggregator for being bot-proof or anti-bot protection. For that I used the Discord server Scraping Enthusiasts, namely Anti-bot channel.

Since Amazon is a hige data aggregator we recommend readers to get acquainted with the post Tips & Tricks for Scraping Business Directories.

Categories
Challenge Development

Discord Bot to detect on-site anti-scrape & scrape-proof tools

Today, I’ll share of a Dicord server 1 and server 2 that accomodate a bot able to detect multiple modern scrape-protection and scrape-detection means. The server’s channels with the bot are #antibot-test and #antibot-scan respectively

Categories
Challenge

Bot protected websites

We share here some bot-protected sites.

WebsiteProtection tool Notes
badmintonhub.inCloudFlare
govets.comRecaptcha , CloudFlareThe following anti-bots got detected:
Cloudflare
Headers: cf-chl-gen, cf-ray, cf-mitigated
Server header: cloudflare




https://www.govets.com/
https://www.govets.com/
https://www.govets.com/static/version1708067196/_cache/merged/ba2734da0d740dd8fa764a0ea52576d8.min.js
https://www.govets.com/static/version1708067196/_cache/merged/ba2734da0d740dd8fa764a0ea52576d8.min.js
https://www.govets.com/static/version1708067196/frontend/Amasty/JetTheme/en_US/Magento_Theme/js/utils/svg-sprite.min.js
https://www.govets.com/static/version1708067196/frontend/Amasty/JetTheme/en_US/Magento_Theme/js/utils/svg-sprite.min.js
https://www.govets.com/static/version1708067196/frontend/Amasty/JetTheme/en_US/Magento_PageCache/js/form-key-provider.min.js
https://www.govets.com/static/version1708067196/frontend/Amasty/JetTheme/en_US/Magento_PageCache/js/form-key-provider.min.js
https://www.govets.com/static/version1708067196/frontend/Amasty/JetTheme/en_US/js/bundle/cms.min.js
https://www.govets.com/static/version1708067196/frontend/Amasty/JetTheme/en_US/js/bundle/cms.min.js
https://www.govets.com/static/version1708067196/frontend/Amasty/JetTheme/en_US/Magento_ReCaptchaWebapiUi/js/jquery-mixin.min.js
https://www.govets.com/static/version1708067196/frontend/Amasty/JetTheme/en_US/Magento_ReCaptchaWebapiUi/js/jquery-mixin.min.js
https://www.govets.com/static/version1695119800/frontend/Amasty/JetTheme/en_US/Tapita_Tpbuilder/js/simi-pagebuilder-react@1.4.0.umd.min.js
https://www.govets.com/static/version1695119800/frontend/Amasty/JetTheme/en_US/Tapita_Tpbuilder/js/simi-pagebuilder-react@1.4.0.umd.min.js
https://static.zdassets.com/ekr/snippet.js?key=a51665c3-83da-4d85-9319-2953bf16b020

Recaptcha
Script loaded: recaptcha/api.js
JavaScript Properties: window.grecaptcha, window.recaptcha

Detected on 1 urls:
https://www.google.com/recaptcha/api.js?onload=globalOnRecaptchaOnLoadCallback&render=explicit
22.02.2024 at 8:26 PM EET
ticketmaster.co.ukRecaptcha, PerimeterX, ImpervaThe following anti-bots got detected: Recaptcha
JavaScript Properties: window.grecaptcha, window.recaptcha
PerimeterX -- Script loaded: init.js
Detected on 1 urls: https://epsf.ticketmaster.co.uk/asset/iamNotaRobot.js
22.02.2024 at 7:15 PM EET
(detectinon tool)

app.impact.comCloudFlare
lowes.com ?No title
https://www.lowes.com/
⚠ Error ⚠
Network related / Timeout / Bad status code.

undefined
23.02.2024 at 12:15 PM
zoominfo.comCloudFlare, PerimeterX
The following anti-bots got detected:

Cloudflare
  • Headers: cf-chl-gen, cf-ray, cf-mitigated
  • Server header: cloudflare




PerimeterX
  • Cookies: _px3, _pxhd, _px_vid
23.02.2024 at 12:39 PM
Categories
Development

Headless Chrome detection and anti-detection

In the post we summarize how to detect the headless Chrome browser and how to bypass the detection. The headless browser testing should be a very important part of todays web 2.0. If we look at some of the site’s JS, we find them to checking on many fields of a browser. They are similar to those collected by fingerprintjs2.

So in this post we consider most of them and show both how to detect the headless browser by those attributes and how to bypass that detection by spoofing them.

See the test results of disguising the browser automation for both Selenium and Puppeteer extra.

Categories
Development

How to find out that website is Distil protected?

Given: a webpage to scrape.
If you inspect the DOM tree of that page you will find that quite a few tags are having the keyword dist. As an example:

  • <link rel="shortcut icon" type="image/x-icon" href="/wcsstore/ColesResponsiveStorefrontAssetStore/dist/30e70cfc76bf73d384beffa80ba6cbee/img/favicon.ico">
  • <link rel="stylesheet" href="/wcsstore/ColesResponsiveStorefrontAssetStore/dist/30e70cfc76bf73d384beffa80ba6cbee/css/google/fonts-Source-Sans-Pro.css" type="text/css" media="screen">
Categories
Challenge

How Imperva protects against scraping bots

Imperva (that includes the former Distil anti-bot management) is a service providing many kinds of website protections. The present Imperva services include the following ones:

  1. Cloud Web Application Firewall (WAF)
  2. Bot Protection service (formerly Distil Networks)
  3. IP Reputation Intelligence
  4. Content Delivery Network (CDN)
  5. Attack Analytics solution (eg. DDoS)

As to the protection of the bot scraping activities we mention the following.

Categories
Challenge

Most popular web scraping targets and how to scrape them

  1. Online marketplaces
    In the marketplaces people offer their products for sale. Similar to garage sales, but online. (eg. eCrater, www.1188.no).
    Easy to scrape since they are usually free and do not tend to protect their data.
  2. Business directories
    The usually huge online directories targeted at the general audience. (eg. Yellow Pages). They do protect their data to avoid duplication and loss of audience. See some posts on this.

Categories
Challenge Development

Scraping a Javascript-dependent website with puppeteer

Support us by purchasing the book (under $5) on this topic.

In today’s web 2.0 many business websites utilize JavaScript to protect their content from web scraping or any other undesired bot visits. In this article we share with you the theory and practical fulfillment of how to scrape js-dependent/js-protected websites.

Categories
Development

Bypass Distil

The Distil scrape protection is a prominent one in the modern anti-scrape techniques. So, now we want to share with you some tips of how to bypass it. If you are interested, please make an inquiry to the following email: igor[dot]savinkin[at]gmail[dot]com


Categories
Uncategorized

Distil: Scrape Bot Protection Test

The anti scrape bot service test has been my focus for some time now. How well can the Distil service protect the real website from scrape? The only answer comes from an actual active scrape. Here I will share the log results and conclusion of the test. In the previous post we briefly reviewed the service’s features, and now I will do the live test-drive analysis.