Categories
Data Mining

Предобработка данных и логистическая регрессия для задачи бинарной классификации

В задании вам будет предложено ознакомиться с основными техниками предобработки данных, а также применить их для обучения модели логистической регрессии. Ответ потребуется загрузить в соответствующую форму в виде 6 текстовых файлов.

Задача: по 38 признакам, связанных с заявкой на грант (область исследований учёных, информация по их академическому бэкграунду, размер гранта, область, в которой он выдаётся) предсказать, будет ли заявка принята. Датасет включает в себя информацию по 6000 заявкам на гранты, которые были поданы в университете Мельбурна в период с 2004 по 2008 год.

iPython notebook and data as CSV

Categories
Development

Cheerio.js, get items from html table into object

Suppose there is a table like below (1 info row only):

Blows
Minute (BPM)
Speed (RPM) Power, PSI Flow, PSI
Tool Sys
0-2500 0-250 1.8 HP 2.6-13.2 GPM SDS Max

How to scrape it using cheerio.js as a parser?

Case 1 (1 row only)

Categories
Data Mining

Bike Sharing Demand Problem, part 2 – Sklearn SGD regression model, scaling, transformation chain and Random Forest nonlinear model

The Bike Sharing Demand problem requires using historical data on weather conditions and bicycle rental to predict the number of occupied bicycles (rentals) for a certain hour of a certain day.

In the original problem statement, there are 11 features available. The feature set contains both real, categorical, and binary data. For the demonstration, a training sample bike_sharing_demand.csv is used from the original data.

See the Bike Sharing Demand, part 1 of the task where we performed some initial problem analysis.

Categories
Data Mining

Bike Sharing Demand Problem, part 1 – Sklearn regression model, scaling, transformation chain

The problem is taken from kaggle.com. Based on historical data on bicycle rental and weather conditions, it is necessary to evaluate the demand for bicycle rental.

In the original problem statement, there are 11 features available. The feature set contains both real, categorical, and binary data. For the demonstration, a training sample bike_sharing_demand.csv is used from the original data.

See the Bike Sharing Demand, part 2 of the task where we performed advanced problem analysis.

Categories
Data Mining

Finding Classifier parameters on the grid, Sklearn.grid_search

Let’s answer the question: how do the parameters of the model affect its quality? And how can we select the optimal parameters for the task to be solved? We will look at the grid_search module in the sklearn library and learn how to select model parameters from the grid.

Categories
Challenge Data Mining

Finding maximum likelihood estimate for the Bernoulli distribution parameter

“Out of the 15 bank customers to whom the manager offered to connect autopayments, four agreed. Service activation is a binary feature that can be described by the Bernoulli distribution.”.

Let’s find the maximum likelihood estimate for the parameter p out of such a sample.

1) Likelihood function:

L(Xn, p) = ∏ p[Xi=1]*(1−p)[Xi=0] = p^4 * (1-p)^11

2) We find the maximum likelihood estimate for the parameter p.
We logarithm L(Xn, p) and get the following:

ln(p^4 * (1-p)^11) = 4*ln(p) + 11*ln(1-p)

3) Now we take its derivative and equate it to zero to find p.
[4ln(p) + 11ln(1-p)]` = 4 (ln(p))` + 11 (ln(1-p))` = 4/p + 11/(1-p) * (-1) = 0
Following: 4/p = 11/(1-p) => 4(1-p) = 11p => 15p = 4 => p = 4/15 =~ 0.26667.

Categories
Uncategorized

User-Agents by browsers

We attach here a link to the User-Agents presented/selected by most popular browsers. U-A’s total number is over 1600.

  • Internet Explorer
  • Firefox
  • Chrome
  • Safari
  • Opera
Categories
Development

Redirect Node.js console output into file

node.exe index.js > scrape.log 2>&1

When executing file index.js we redirect all the console.log() output from console into a file scrape.log .

Categories
Development

Remove empty html tags recursively

Sometimes we have the code with html tags that contain nothing but whitespace characters. Often those tags are nested. See a code below:

<div> 
 <div> 
  <div></div> 
 </div> 
</div>

What regex might be used to find and remove those tags?

Obvious solution is <div>\s*?<\/div> .

\s stands for “whitespace character”. It includes [ \t\n\x0B\f\r]. That is: \s matches a space( ) or a tab (\t) or a line(\n) break or a vertical tab (\x0B) sometimes referred as (\v) or a form feed (\f) or a carriage return (\r) .

General case

In general case, we use the following regex:
<(?<tag>[a-z]+?)( [^>]+?|)>\s*?<\/(\k<tag>)>

where <tag> is a named match group: [a-z]+?

JAVA code

When applying it recursively we might use the following code, JAVA:

public static String removeEmptyTags(String html) {
        boolean compareFound = true;
        Pattern pattern = Pattern.compile("<(?<tag>[a-z]+?)( [^>]+?|)>\\s*?</(\\k<tag>)>", Pattern.MULTILINE | Pattern.DOTALL);
        while (compareFound) {
            compareFound = false;
            Matcher matcher = pattern.matcher(html);
            if(matcher.find()) {
                compareFound = true;
                html = matcher.replaceAll("");
            }
        }
        return html;
    }
Categories
Development

Simple JAVA scraper that handles user-agent, headers and cookies

How to handle cookie, user-agent, headers when scraping with JAVA? We’ll use for this a static class ScrapeHelper that easily handles all of this. The class uses Jsoup library methods to fetch from data from server and parse html into DOM document.