Categories
Data Mining

Sklearn, Random Forest

The objective of the task is to build a model so that we can, as optimally as this data allows, relate molecular information, to an actual biological response.

We have shared the data in the comma separated values (CSV) format. Each row in this data set represents a molecule. The first column contains experimental data describing an actual biological response; the molecule was seen to elicit this response (1), or not (0). The remaining columns represent molecular descriptors (D1 through D1776), these are calculated properties that can capture some of the characteristics of the molecule – for example size, shape, or elemental constitution. The descriptor matrix has been normalized.

Source.

Data

Data description. We use the train.csv from the original task as bioresponse.csv file.

sklearn.ensemble.RandomForestClassifier

%pylab inline
from sklearn import ensemble, model_selection, metrics 
import numpy as np
import pandas as pd
bioresponce = pd.read_csv("bioresponse.csv", header=0, sep=",")
bioresponce.head()
Activity D1 D2 D3 D4 D5 D6 D7 D8 D9 D1767 D1768 D1769 D1770 D1771 D1772 D1773 D1774 D1775 D1776
0 1 0.000000 0.497009 0.10 0.0 0.132956 0.678031 0.273166 0.585445 0.743663 0 0 0 0 0 0 0 0 0 0
1 1 0.366667 0.606291 0.05 0.0 0.111209 0.803455 0.106105 0.411754 0.836582 1 1 1 1 0 1 0 0 1 0
2 1 0.033300 0.480124 0.00 0.0 0.209791 0.610350 0.356453 0.517720 0.679051 0 0 0 0 0 0 0 0 0 0
3 1 0.000000 0.538825 0.00 0.5 0.196344 0.724230 0.235606 0.288764 0.805110 0 0 0 0 0 0 0 0 0 0
4 0 0.100000 0.517794 0.00 0.0 0.494734 0.781422 0.154361 0.303809 0.812646 0 0 0 0 0 0 0 0 0 0

5 rows × 1777 columns

bioresponce.shape
(3751, 1777)
bioresponce.columns
Index(["Activity", "D1", "D2", "D3", "D4", "D5", "D6", "D7", "D8", "D9",
       ...
       "D1767", "D1768", "D1769", "D1770", "D1771", "D1772", "D1773", "D1774",
       "D1775", "D1776"],
      dtype="object", length=1777)

Data initial analysis and preparation

We get the targets and calculate the classes (0 & 1) amount.

bioresponce_target = bioresponce.Activity.values
print("bioresponse = 1: {:.2f}\nbioresponse = 0: {:.2f}".format(sum(bioresponce_target)/float(len(bioresponce_target)), 
                1.0 - sum(bioresponce_target)/float(len(bioresponce_target))))
bioresponse = 1: 0.54
bioresponse = 0: 0.46

From the classes amount we see that the classification task is almost balanced.

Now we cut data off the targets:

bioresponce_data = bioresponce.iloc[:, 1:]

Building Random Forest model with RandomForestClassifier

We use RandomForestClassifier with needed parameters. We may train and apply the model using fit() and predict() methods. The model estimation we may do with Exhaustive Grid Search or Randomized Grid Search. We do not do it here. You might see its implementation in here.

Here we analyse the model quality/presision dependence on the train objects amount.

Learning curves for small depth trees

We build the Random Forest model for 50 trees each having max depth = 2.

rf_classifier_low_depth = ensemble.RandomForestClassifier(
n_estimators = 50, max_depth = 2, random_state = 1)

The function learning_curve() allows to show the presision dependence on the training objects amount. It takes algorithm, targets, data and the proportion with witch we want to be trained.

Wit the learning_curve() method several models will be built, we get quality metric (scoring) for each one. And the method will return us the size of train set, quality scoring on train and on test.

Having those data we may ananlyse how the model quality changes/depends on the train set size.

train_sizes, train_scores, test_scores = model_selection.learning_curve(
rf_classifier_low_depth, bioresponce_data, bioresponce_target,                                                                        train_sizes=np.arange(0.1,1., 0.2), 
cv=3, scoring="accuracy")
np.arange(0.1, 1., 0.2) # [0.1, 0.3, 0.5, 0.7, 0.9]
print(train_sizes)
# we take mean for all CV folds.
print(train_scores.mean(axis = 1))
print(test_scores.mean(axis = 1))
[ 250 750 1250 1750 2250 ]
[0.74933333 0.71333333 0.68453333 0.69104762 0.69022222]
[0.62356685 0.64195598 0.65369955 0.66248974 0.66728527]
Now we build the dependence graphs:
pylab.grid(True)
pylab.plot(train_sizes, train_scores.mean(axis = 1), "g-", marker="o", label="train")
pylab.plot(train_sizes, test_scores.mean(axis = 1), "r-", marker="o", label="test")
pylab.ylim((0.0, 1.05))
pylab.legend(loc="lower right")

The conclusion of those data

The further growth of train set size (over 2250 items) will not influence the model quality.

Learning curves for trees of greater depth

We’ll try to increase the model difficulty, this might increase its quality metrics. Let’s set max_depth = 10.

rf_classifier = ensemble.RandomForestClassifier(
n_estimators = 50, max_depth = 10, random_state = 1)
train_sizes, train_scores, test_scores = 
model_selection.learning_curve(rf_classifier,
 bioresponce_data, bioresponce_target, 
 train_sizes=np.arange(0.1,1, 0.2), 
cv=3, scoring="accuracy")
pylab.grid(True)
pylab.plot(train_sizes, train_scores.mean(axis = 1), "g-", marker="o", label="train")
pylab.plot(train_sizes, test_scores.mean(axis = 1), "r-", marker="o", label="test")
pylab.ylim((0.0, 1.05))
pylab.legend(loc="lower right")

The conclusion for the model of higher complexity

The growth of train set indeed positively influences the model quality.

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.