Random Forest vs Gradient boosting

The objective of the task is to build a model so that we can, as optimally as this data allows, relate molecular information, to an actual biological response.

We have shared the data in the comma separated values (CSV) format. Each row in this data set represents a molecule. The first column contains experimental data describing an actual biological response; the molecule was seen to elicit this response (1), or not (0). The remaining columns represent molecular descriptors (D1 through D1776), these are calculated properties that can capture some of the characteristics of the molecule – for example size, shape, or elemental constitution. The descriptor matrix has been normalized.

Data

Data description. We use the train.csv from the original task as bioresponse.csv file.

(1) sklearn.ensemble.RandomForestClassifier
and
(2) xgboost.XGBClassifier

from sklearn import ensemble, model_selection, metrics 

import numpy as np
import pandas as pd
import xgboost as xgb

%pylab inline

Data

bioresponce = pd.read_csv("bioresponse.csv", header=0, sep=",")
bioresponce.head()

	Activity	D1	D2	D3	D4	D5	D6	D7	D8	D9	…	D1767	D1768	D1769	D1770	D1772	D1775
0	1	0.000000	0.497009	0.10	0.0	0.132956	0.678031	0.273166	0.585445	0.743663	…	0	0	0	0	0	0
1	1	0.366667	0.606291	0.05	0.0	0.111209	0.803455	0.106105	0.411754	0.836582	…	1	1	1	1	1	1
2	1	0.033300	0.480124	0.00	0.0	0.209791	0.610350	0.356453	0.517720	0.679051	…	0	0	0	0	0	0
3	1	0.000000	0.538825	0.00	0.5	0.196344	0.724230	0.235606	0.288764	0.805110	…	0	0	0	0	0	0
4	0	0.100000	0.517794	0.00	0.0	0.494734	0.781422	0.154361	0.303809	0.812646	…	0	0	0	0	0	0

5 rows × 1777 columns

bioresponce_target = bioresponce.Activity.values
bioresponce_data = bioresponce.iloc[:, 1:]

Now we’ll run algorithms measuring their quality depending on the number of trees

Model RandomForestClassifier

We want to know how does the quality (accuracy) changes from the number of trees.

n_trees = [1] + list(range(10, 55, 5))
n_trees

[1, 10, 15, 20, 25, 30, 35, 40, 45, 50]

%%time
scoring = []
for n_tree in n_trees:
    estimator = ensemble.RandomForestClassifier(n_estimators = n_tree, min_samples_split=5, random_state=1)
    score = model_selection.cross_val_score(estimator, bioresponce_data, bioresponce_target, 
                                             scoring = "accuracy", cv = 3)    
    scoring.append(score)
scoring = np.asmatrix(scoring)

Wall time: 20.6 s

The result is the matrix where each line corresponds to the number of trees from the n_trees array

scoring

matrix([[0.66906475, 0.668     , 0.6704    ],
        [0.75859313, 0.7592    , 0.7504    ],
        [0.78097522, 0.7616    , 0.7592    ],
        [0.78417266, 0.7736    , 0.7648    ],
        [0.78257394, 0.7736    , 0.7704    ],
        [0.78816946, 0.7752    , 0.7736    ],
        [0.78816946, 0.7856    , 0.7776    ],
        [0.78896882, 0.78      , 0.7832    ],
        [0.78976819, 0.7856    , 0.7896    ],
        [0.79536371, 0.7816    , 0.7928    ]])

We draw a graph of scoring depending on the number of trees (n_trees).

pylab.plot(n_trees, scoring.mean(axis = 1), marker=".", label="RandomForest")
pylab.grid(True) # adding the grid to the graph
pylab.xlabel("n_trees")
pylab.ylabel("score")
pylab.title("Accuracy score")
pylab.legend(loc="lower right")

Model xgboost.XGBClassifier

Now we build the gradien boosting, xgb.XGBClassifier, and use the same set of max number of trees

Learning curves for trees of greater depth

max_depth – max tree depth
n_estimators – max number of trees

%%time
xgb_scoring = []
for n_tree in n_trees:
    estimator = xgb.XGBClassifier(learning_rate=0.1, max_depth=5, n_estimators=n_tree, min_child_weight=3)
    # the object xgb.XGBClassifier() is complient with cross_val_score() function
    score = model_selection.cross_val_score(estimator, bioresponce_data, bioresponce_target, 
                                             scoring = "accuracy", cv = 3)    
    xgb_scoring.append(score)
xgb_scoring = np.asmatrix(xgb_scoring)

Wall time: 1 min 48 s

matrix([[0.76498801, 0.756     , 0.756     ],
        [0.77617906, 0.7752    , 0.7688    ],
        [0.77857714, 0.7744    , 0.7768    ],
        [0.7873701 , 0.7784    , 0.7768    ],
        [0.79216627, 0.7736    , 0.7832    ],
        [0.79776179, 0.7776    , 0.7824    ],
        [0.79616307, 0.7816    , 0.78      ],
        [0.79296563, 0.7848    , 0.7792    ],
        [0.79856115, 0.7832    , 0.7808    ],
        [0.79936051, 0.7832    , 0.7832    ]])

We draw a grath with both RF and the Gradient boosting for comparison.

pylab.plot(n_trees, scoring.mean(axis = 1), marker=".", label="RandomForest")
pylab.plot(n_trees, xgb_scoring.mean(axis = 1), marker=".", label="XGBoost")
pylab.grid(True)
pylab.xlabel("n_trees")
pylab.ylabel("score")
pylab.title("Accuracy score")
pylab.legend(loc="lower right")

Results comparison

Classifier algorithm	Timing	Best accuracy
Random Forest	20 sec.	0.8
Gradient boosting	108 sec.	0.8

Consclusion

Both algorithms are of high quality/accuracy, 0.8.
Gradient boosting yields relatively high classification results with a low number of max trees compare to RF.
Random Forest algorithm is much faster than that of Gradient boosting, XGBoost. See the results comparison table above.

Data

(1) sklearn.ensemble.RandomForestClassifier and(2) xgboost.XGBClassifier

Data

Model RandomForestClassifier

Model xgboost.XGBClassifier

We draw a grath with both RF and the Gradient boosting for comparison.

Results comparison

Consclusion

More of xgboost:

Leave a Reply Cancel reply

(1) sklearn.ensemble.RandomForestClassifier
and
(2) xgboost.XGBClassifier