Recently we’ve performed the Yelp business directory scrape for acquiring high quality B2B leads (company + CEO info). This forced us to apply many techniques like proxying, external company site scrape, email verification and more.
Recently we encountered a website that worked as usual, yet when composing and running scraping script/agent it has put up blocking measures.
In this post we’ll take a look at how the scraping process went and the measures we performed to overcome that.
Recently we encountered a new powerful scraping service called Data Collector [of Bright Data]. The life-test and thorough drill-in are coming soon. Yet now we want to highlight it main features that has badly (in positive sense, strongly) impressed us.
“Out of the 15 bank customers to whom the manager offered to connect autopayments, four agreed. Service activation is a binary feature that can be described by the Bernoulli distribution.”.
Let’s find the maximum likelihood estimate for the parameter p out of such a sample.
1) Likelihood function:
L(Xn, p) = ∏ p[Xi=1]*(1−p)[Xi=0] = p^4 * (1-p)^11
2) We find the maximum likelihood estimate for the parameter p.
We logarithm L(Xn, p) and get the following:
ln(p^4 * (1-p)^11) = 4*ln(p) + 11*ln(1-p)
3) Now we take its derivative and equate it to zero to find p.
[4ln(p) + 11ln(1-p)]` = 4 (ln(p))` + 11 (ln(1-p))` = 4/p + 11/(1-p) * (-1) = 0
Following: 4/p = 11/(1-p) => 4(1-p) = 11p => 15p = 4 => p = 4/15 =~ 0.26667.
We’ll also interpret the found linear dependencies. That means we check whether the discovered pattern corresponds to common sense. The main purpose of the task is to show and explain by example what causes overfitting and how to overcome it.
The code as an IPython notebook
In a previous post we’ve considered the ways to disguise an automated Chrome browser by spoofing some of its parameters – Headless Chrome detection and anti-detection. Here we’ll share the practical results of Fingerprints testing against a benchmark for both human-operated and automated Chrome browsers.
Imperva (that includes the former Distil anti-bot management) is a service providing many kinds of website protections. The present Imperva services include the following ones:
- Cloud Web Application Firewall (WAF)
- Bot Protection service (formerly Distil Networks)
- IP Reputation Intelligence
- Content Delivery Network (CDN)
- Attack Analytics solution (eg. DDoS)
As to the protection of the bot scraping activities we mention the following.
My goal was to retrieve data from a web business directory.
Since the business directories scrape is the most challenging task (beside SERP scrape) there are some basic questions for me to answer:
- Is there any scrape protection set at that site?
- How much data is in that web business directory?
- What kind of queries can I run to find all the directory’s items?
- Online marketplaces
In the marketplaces people offer their products for sale. Similar to garage sales, but online. (eg. eCrater, www.1188.no).
Easy to scrape since they are usually free and do not tend to protect their data.
- Business directories
The usually huge online directories targeted at the general audience. (eg. Yellow Pages). They do protect their data to avoid duplication and loss of audience. See some posts on this.