Challenge Data Mining

Finding maximum likelihood estimate for the Bernoulli distribution parameter

“Out of the 15 bank customers to whom the manager offered to connect autopayments, four agreed. Service activation is a binary feature that can be described by the Bernoulli distribution.”.

Let’s find the maximum likelihood estimate for the parameter p out of such a sample.

1) Likelihood function:

L(Xn, p) = ∏ p[Xi=1]*(1−p)[Xi=0] = p^4 * (1-p)^11

2) We find the maximum likelihood estimate for the parameter p.
We logarithm L(Xn, p) and get the following:

ln(p^4 * (1-p)^11) = 4*ln(p) + 11*ln(1-p)

3) Now we take its derivative and equate it to zero to find p.
[4ln(p) + 11ln(1-p)]` = 4 (ln(p))` + 11 (ln(1-p))` = 4/p + 11/(1-p) * (-1) = 0
Following: 4/p = 11/(1-p) => 4(1-p) = 11p => 15p = 4 => p = 4/15 =~ 0.26667.

Challenge Data Mining

Linear regression in example: overfitting and regularization

In the post we will set up a linear model to predict the number of bike rentals depending on the calendar characteristics of the day and weather conditions. We will choose the weights of the features so that to catch all the linear dependencies in the data and at the same time do not take into account extra features. This way the model will not overfit and will make fairly accurate predictions on new data.

We’ll also interpret the found linear dependencies. That means we check whether the discovered pattern corresponds to common sense. The main purpose of the task is to show and explain by example what causes overfitting and how to overcome it.

The code as an IPython notebook

Challenge Development

Human-operated and automated Browser Fingerprints testing and needed parameters

In a previous post we’ve considered the ways to disguise an automated Chrome browser by spoofing some of its parameters – Headless Chrome detection and anti-detection. Here we’ll share the practical results of Fingerprints testing against a benchmark for both human-operated and automated Chrome browsers.


How Imperva protects against scraping bots

Imperva (that includes the former Distil anti-bot management) is a service providing many kinds of website protections. The present Imperva services include the following ones:

  1. Cloud Web Application Firewall (WAF)
  2. Bot Protection service (formerly Distil Networks)
  3. IP Reputation Intelligence
  4. Content Delivery Network (CDN)
  5. Attack Analytics solution (eg. DDoS)

As to the protection of the bot scraping activities we mention the following.

Challenge Development

Business directory simple scraper (python) at pythonanywhere

business directoryMy goal was to retrieve data from a web business directory.

Since the business directories scrape is the most challenging task (beside SERP scrape) there are some basic questions for me to answer:

  1. Is there any scrape protection set at that site?
  2. How much data is in that web business directory?
  3. What kind of queries can I run to find all the directory’s items?

Most popular web scraping targets and how to scrape them

  1. Online marketplaces
    In the marketplaces people offer their products for sale. Similar to garage sales, but online. (eg. eCrater,
    Easy to scrape since they are usually free and do not tend to protect their data.
  2. Business directories
    The usually huge online directories targeted at the general audience. (eg. Yellow Pages). They do protect their data to avoid duplication and loss of audience. See some posts on this.
Challenge Development

Scraping a Javascript-dependent website with puppeteer

Support us by purchasing the book (under $5) on this topic.

In today’s web 2.0 many business websites utilize JavaScript to protect their content from web scraping or any other undesired bot visits. In this article we share with you the theory and practical fulfillment of how to scrape js-dependent/js-protected websites.

Challenge Development

Scrape with Google App Script

In this post I want to let you how I’ve managed to complete the challenge of scraping a site with Google Apps Script (GAS).


Is there any way to skip CAPTCHA?

JavaScript powered CAPTCHA

Most of the answers to the question in internet forums are given by services that automatically solve captchas. They provide services to solve CAPTCHA rather than to fully skip it.


What are the best online resources to acquire data?

Recently I received this question: What are the best online resources to acquire data from?

The top sites for data scrape are data aggregators. Why are they top in data extraction?
They are top because they provide the fullest, most comprehensive data [sets]. The data in them are highly categorized. Therefore you do not need to crawl and fetch other resources and then combine multiple-resource data.

Those sites fall into 2 categories:

  1. Goods and services aggregators. Eg. AliExpress, Amazon, Craiglist.
  2. Personal data and companies data aggregators. Eg. Linkedin, Xing, YellowPages. For such aggregators another name is business directories.

The first category of sites and services is quite wide-spread. These sites and services promote their goods with the goal of being well-known online, to have as many backlinks as possible to them.

The second category, the business directories, does not tend to reveal its data to the public. These directories rather promote their brand and give scraping bots minimum opportunity for data acquiring*.

Consider the following picture where a company’s data aggregator gives to the user only 2 input fields: what and where.

You can find more of how to scrape data aggregators in this post.

*You have to adhere to the ToS of each particular website/web service when you perform its data scraping.