Categories
Data Mining

Random Forest vs Gradient boosting

The objective of the task is to build a model so that we can, as optimally as this data allows, relate molecular information, to an actual biological response.

We have shared the data in the comma separated values (CSV) format. Each row in this data set represents a molecule. The first column contains experimental data describing an actual biological response; the molecule was seen to elicit this response (1), or not (0). The remaining columns represent molecular descriptors (D1 through D1776), these are calculated properties that can capture some of the characteristics of the molecule – for example size, shape, or elemental constitution. The descriptor matrix has been normalized.

Categories
Data Mining

Bagging and Random Forest

In this post we do several tasks performing the Bagging and the Random Forest Classificators.

We gradually develop classifier for the Bagging on randomized trees that in its final stage matches the Random Forest algorithm.

We’ll also build the RandomForestClassifier of sklearn.ensemble and learn of its quality depending on (1) number of trees, (2) max features used for each tree node, and (3) max tree depth.

Bagging is an ensemble algorithm that fits multiple models on different subsets of a training dataset, then combines the predictions from all models.

Random forest is an extension of bagging that also randomly selects subsets of features used in each data sample.

Categories
Data Mining

Предобработка данных и логистическая регрессия для задачи бинарной классификации

В задании вам будет предложено ознакомиться с основными техниками предобработки данных, а также применить их для обучения модели логистической регрессии. Ответ потребуется загрузить в соответствующую форму в виде 6 текстовых файлов.

Задача: по 38 признакам, связанных с заявкой на грант (область исследований учёных, информация по их академическому бэкграунду, размер гранта, область, в которой он выдаётся) предсказать, будет ли заявка принята. Датасет включает в себя информацию по 6000 заявкам на гранты, которые были поданы в университете Мельбурна в период с 2004 по 2008 год.

iPython notebook and data as CSV

Categories
Data Mining

Finding Classifier parameters on the grid, Sklearn.grid_search

Let’s answer the question: how do the parameters of the model affect its quality? And how can we select the optimal parameters for the task to be solved? We will look at the grid_search module in the sklearn library and learn how to select model parameters from the grid.

Categories
Data Mining

Sklearn, Classification and Regression metrics

in the post will reviewed a number of metrics for evaluating classification and regression models. For that we use the functions we use of the sklearn library. We’ll learn how to generate model data and how to train linear models and evaluate their quality.

The code as an IPython notebook

Categories
Data Mining

Linear models, Sklearn.linear_model, Regression

In this post we’ll show how to build regression linear models using the sklearn.linear.model module.

See also the post on classification linear models using the sklearn.linear.model module.

The code as an IPython notebook

Categories
Data Mining Development

Linear models, Sklearn.linear_model, Classification

In this post we’ll show how to build classification linear models using the sklearn.linear.model module.

The code as an IPython notebook