Bot protected websites

We share here some bot-protected sites.

WebsiteProtection tool Notes
govets.comRecaptcha , CloudFlareThe following anti-bots got detected:
Headers: cf-chl-gen, cf-ray, cf-mitigated
Server header: cloudflare

Script loaded: recaptcha/api.js
JavaScript Properties: window.grecaptcha, window.recaptcha

Detected on 1 urls:
22.02.2024 at 8:26 PM EET, PerimeterX, ImpervaThe following anti-bots got detected: Recaptcha
JavaScript Properties: window.grecaptcha, window.recaptcha
PerimeterX -- Script loaded: init.js
Detected on 1 urls:
22.02.2024 at 7:15 PM EET
(detectinon tool)

app.impact.comCloudFlare ?No title
⚠ Error ⚠
Network related / Timeout / Bad status code.

23.02.2024 at 12:15 PM
zoominfo.comCloudFlare, PerimeterX
The following anti-bots got detected:

  • Headers: cf-chl-gen, cf-ray, cf-mitigated
  • Server header: cloudflare

  • Cookies: _px3, _pxhd, _px_vid
23.02.2024 at 12:39 PM
Challenge Development

AirTable scrape challenge

I need to get info from AirTable, see a table example.

The problem is that data are loaded highly dynamically. Html contains only the information that you currently see on the browser screen.

enter image description here

If there are a lot of records then it is difficult to collect such a table. One of the possible ways is to calculate the size of the screen and rows in the table. Then using the browser automation and use to make a script that will scroll through it bit by bit and collect data.

Is there any other feasible way to get data of a table? For example there is a HTTP requests coding way to get dynamic data.

JS infinite scroll does not work for AirTable either.

Please comment down here if having some tips , hints.

Challenge Development

Yelp scraping for high quality B2B leads

Recently we’ve performed the Yelp business directory scrape for acquiring high quality B2B leads (company + CEO info). This forced us to apply many techniques like proxying, external company site scrape, email verification and more.

Challenge Development

Bypass GoDaddy Firewall thru VPN & browser automation

Recently we encountered a website that worked as usual, yet when composing and running scraping script/agent it has put up blocking measures.

In this post we’ll take a look at how the scraping process went and the measures we performed to overcome that.

Challenge SaaS

Web Scraper IDE to scrape tough websites

Recently we encountered a new powerful scraping service called Web Scraper IDE [of Bright Data]. The life-test and thorough drill-in are coming soon. Yet now we want to highlight its main features that has badly (in positive sense, strongly) impressed us.

Challenge Data Mining

Finding maximum likelihood estimate for the Bernoulli distribution parameter

“Out of the 15 bank customers to whom the manager offered to connect autopayments, four agreed. Service activation is a binary feature that can be described by the Bernoulli distribution.”.

Let’s find the maximum likelihood estimate for the parameter p out of such a sample.

1) Likelihood function:

L(Xn, p) = ∏ p[Xi=1]*(1−p)[Xi=0] = p^4 * (1-p)^11

2) We find the maximum likelihood estimate for the parameter p.
We logarithm L(Xn, p) and get the following:

ln(p^4 * (1-p)^11) = 4*ln(p) + 11*ln(1-p)

3) Now we take its derivative and equate it to zero to find p.
[4ln(p) + 11ln(1-p)]` = 4 (ln(p))` + 11 (ln(1-p))` = 4/p + 11/(1-p) * (-1) = 0
Following: 4/p = 11/(1-p) => 4(1-p) = 11p => 15p = 4 => p = 4/15 =~ 0.26667.

Challenge Data Mining

Linear regression in example: overfitting and regularization

In the post we will set up a linear model to predict the number of bike rentals depending on the calendar characteristics of the day and weather conditions. We will choose the weights of the features so that to catch all the linear dependencies in the data and at the same time do not take into account extra features. This way the model will not overfit and will make fairly accurate predictions on new data.

We’ll also interpret the found linear dependencies. That means we check whether the discovered pattern corresponds to common sense. The main purpose of the task is to show and explain by example what causes overfitting and how to overcome it.

The code as an IPython notebook

Challenge Development

Human-operated and automated Browser Fingerprints testing and needed parameters

In a previous post we’ve considered the ways to disguise an automated Chrome browser by spoofing some of its parameters – Headless Chrome detection and anti-detection. Here we’ll share the practical results of Fingerprints testing against a benchmark for both human-operated and automated Chrome browsers.


How Imperva protects against scraping bots

Imperva (that includes the former Distil anti-bot management) is a service providing many kinds of website protections. The present Imperva services include the following ones:

  1. Cloud Web Application Firewall (WAF)
  2. Bot Protection service (formerly Distil Networks)
  3. IP Reputation Intelligence
  4. Content Delivery Network (CDN)
  5. Attack Analytics solution (eg. DDoS)

As to the protection of the bot scraping activities we mention the following.

Challenge Development

Business directory simple scraper (python) at pythonanywhere

business directoryMy goal was to retrieve data from a web business directory.

Since the business directories scrape is the most challenging task (beside SERP scrape) there are some basic questions for me to answer:

  1. Is there any scrape protection set at that site?
  2. How much data is in that web business directory?
  3. What kind of queries can I run to find all the directory’s items?