Categories
Data Mining

Random Forest vs Gradient boosting

The objective of the task is to build a model so that we can, as optimally as this data allows, relate molecular information, to an actual biological response.

We have shared the data in the comma separated values (CSV) format. Each row in this data set represents a molecule. The first column contains experimental data describing an actual biological response; the molecule was seen to elicit this response (1), or not (0). The remaining columns represent molecular descriptors (D1 through D1776), these are calculated properties that can capture some of the characteristics of the molecule – for example size, shape, or elemental constitution. The descriptor matrix has been normalized.

Data

Data description. We use the train.csv from the original task as bioresponse.csv file.

(1) sklearn.ensemble.RandomForestClassifier
and
(2) xgboost.XGBClassifier

from sklearn import ensemble, model_selection, metrics 

import numpy as np
import pandas as pd
import xgboost as xgb
%pylab inline

Data

bioresponce = pd.read_csv("bioresponse.csv", header=0, sep=",")
bioresponce.head()
  Activity D1 D2 D3 D4 D5 D6 D7 D8 D9 D1767 D1768 D1769 D1770 D1771 D1772 D1773 D1774 D1775 D1776
0 1 0.000000 0.497009 0.10 0.0 0.132956 0.678031 0.273166 0.585445 0.743663 0 0 0 0 0 0 0 0 0 0
1 1 0.366667 0.606291 0.05 0.0 0.111209 0.803455 0.106105 0.411754 0.836582 1 1 1 1 0 1 0 0 1 0
2 1 0.033300 0.480124 0.00 0.0 0.209791 0.610350 0.356453 0.517720 0.679051 0 0 0 0 0 0 0 0 0 0
3 1 0.000000 0.538825 0.00 0.5 0.196344 0.724230 0.235606 0.288764 0.805110 0 0 0 0 0 0 0 0 0 0
4 0 0.100000 0.517794 0.00 0.0 0.494734 0.781422 0.154361 0.303809 0.812646 0 0 0 0 0 0 0 0 0 0

5 rows × 1777 columns

bioresponce_target = bioresponce.Activity.values
bioresponce_data = bioresponce.iloc[:, 1:]

Now we’ll run algorithms measuring their quality depending on the number of trees

Model RandomForestClassifier

We want to know how does the quality (accuracy) changes from the number of trees.

n_trees = [1] + list(range(10, 55, 5))
n_trees
[1, 10, 15, 20, 25, 30, 35, 40, 45, 50]
%%time
scoring = []
for n_tree in n_trees:
    estimator = ensemble.RandomForestClassifier(n_estimators = n_tree, min_samples_split=5, random_state=1)
    score = model_selection.cross_val_score(estimator, bioresponce_data, bioresponce_target, 
                                             scoring = "accuracy", cv = 3)    
    scoring.append(score)
scoring = np.asmatrix(scoring)
Wall time: 20.6 s

The result is the matrix where each line corresponds to the number of trees from the n_trees array

scoring
matrix([[0.66906475, 0.668     , 0.6704    ],
        [0.75859313, 0.7592    , 0.7504    ],
        [0.78097522, 0.7616    , 0.7592    ],
        [0.78417266, 0.7736    , 0.7648    ],
        [0.78257394, 0.7736    , 0.7704    ],
        [0.78816946, 0.7752    , 0.7736    ],
        [0.78816946, 0.7856    , 0.7776    ],
        [0.78896882, 0.78      , 0.7832    ],
        [0.78976819, 0.7856    , 0.7896    ],
        [0.79536371, 0.7816    , 0.7928    ]])

We draw a graph of scoring depending on the number of trees (n_trees).

pylab.plot(n_trees, scoring.mean(axis = 1), marker=".", label="RandomForest")
pylab.grid(True) # adding the grid to the graph
pylab.xlabel("n_trees")
pylab.ylabel("score")
pylab.title("Accuracy score")
pylab.legend(loc="lower right")

Model xgboost.XGBClassifier

Now we build the gradien boosting, xgb.XGBClassifier, and use the same set of max number of trees

Learning curves for trees of greater depth

  • max_depth – max tree depth
  • n_estimators – max number of trees
%%time
xgb_scoring = []
for n_tree in n_trees:
    estimator = xgb.XGBClassifier(learning_rate=0.1, max_depth=5, n_estimators=n_tree, min_child_weight=3)
    # the object xgb.XGBClassifier() is complient with cross_val_score() function
    score = model_selection.cross_val_score(estimator, bioresponce_data, bioresponce_target, 
                                             scoring = "accuracy", cv = 3)    
    xgb_scoring.append(score)
xgb_scoring = np.asmatrix(xgb_scoring)
Wall time: 1 min 48 s
matrix([[0.76498801, 0.756     , 0.756     ],
        [0.77617906, 0.7752    , 0.7688    ],
        [0.77857714, 0.7744    , 0.7768    ],
        [0.7873701 , 0.7784    , 0.7768    ],
        [0.79216627, 0.7736    , 0.7832    ],
        [0.79776179, 0.7776    , 0.7824    ],
        [0.79616307, 0.7816    , 0.78      ],
        [0.79296563, 0.7848    , 0.7792    ],
        [0.79856115, 0.7832    , 0.7808    ],
        [0.79936051, 0.7832    , 0.7832    ]])

We draw a grath with both RF and the Gradient boosting for comparison.

pylab.plot(n_trees, scoring.mean(axis = 1), marker=".", label="RandomForest")
pylab.plot(n_trees, xgb_scoring.mean(axis = 1), marker=".", label="XGBoost")
pylab.grid(True)
pylab.xlabel("n_trees")
pylab.ylabel("score")
pylab.title("Accuracy score")
pylab.legend(loc="lower right")

Results comparison

Classifier algorithm Timing Best accuracy
Random Forest 20 sec. 0.8
Gradient boosting 108 sec. 0.8

Consclusion

  1. Both algorithms are of high quality/accuracy, 0.8.
  2. Gradient boosting yields relatively high classification results with a low number of max trees compare to RF.
  3. Random Forest algorithm is much faster than that of Gradient boosting, XGBoost. See the results comparison table above.

More of xgboost:

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.