{ "cells": [ { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [], "source": [ "import numpy as np \n", "from sklearn import datasets, model_selection, metrics, tree, ensemble\n", "digits = datasets.load_digits() " ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Populating the interactive namespace from numpy and matplotlib\n" ] } ], "source": [ "%pylab inline" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ ".. _digits_dataset:\n", "\n", "Optical recognition of handwritten digits dataset\n", "--------------------------------------------------\n", "\n", "**Data Set Characteristics:**\n", "\n", " :Number of Instances: 5620\n", " :Number of Attributes: 64\n", " :Attribute Information: 8x8 image of integer pixels in the range 0..16.\n", " :Missing Attribute Values: None\n", " :Creator: E. Alpaydin (alpaydin '@' boun.edu.tr)\n", " :Date: July; 1998\n", "\n", "This is a copy of the test set of the UCI ML hand-written digits datasets\n", "https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits\n", "\n", "The data set contains images of hand-written digits: 10 classes where\n", "each class refers to a digit.\n", "\n", "Preprocessing programs made available by NIST were used to extract\n", "normalized bitmaps of handwritten digits from a preprinted form. From a\n", "total of 43 people, 30 contributed to the training set and different 13\n", "to the test set. 32x32 bitmaps are divided into nonoverlapping blocks of\n", "4x4 and the number of on pixels are counted in each block. This generates\n", "an input matrix of 8x8 where each element is an integer in the range\n", "0..16. This reduces dimensionality and gives invariance to small\n", "distortions.\n", "\n", "For info on NIST preprocessing routines, see M. D. Garris, J. L. Blue, G.\n", "T. Candela, D. L. Dimmick, J. Geist, P. J. Grother, S. A. Janet, and C.\n", "L. Wilson, NIST Form-Based Handprint Recognition System, NISTIR 5469,\n", "1994.\n", "\n", ".. topic:: References\n", "\n", " - C. Kaynak (1995) Methods of Combining Multiple Classifiers and Their\n", " Applications to Handwritten Digit Recognition, MSc Thesis, Institute of\n", " Graduate Studies in Science and Engineering, Bogazici University.\n", " - E. Alpaydin, C. Kaynak (1998) Cascading Classifiers, Kybernetika.\n", " - Ken Tang and Ponnuthurai N. Suganthan and Xi Yao and A. Kai Qin.\n", " Linear dimensionalityreduction using relevance weighted LDA. School of\n", " Electrical and Electronic Engineering Nanyang Technological University.\n", " 2005.\n", " - Claudio Gentile. A New Approximate Maximal Margin Classification\n", " Algorithm. NIPS. 2000.\n" ] } ], "source": [ "print(digits.DESCR)" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(1797, 64)" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "digits.data.shape" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[ 0., 0., 5., 13., 9., 1., 0., 0., 0., 0., 13., 15., 10.,\n", " 15., 5., 0., 0., 3., 15., 2., 0., 11., 8., 0., 0., 4.,\n", " 12., 0., 0., 8., 8., 0., 0., 5., 8., 0., 0., 9., 8.,\n", " 0., 0., 4., 11., 0., 1., 12., 7., 0., 0., 2., 14., 5.,\n", " 10., 12., 0., 0., 0., 0., 6., 13., 10., 0., 0., 0.],\n", " [ 0., 0., 0., 12., 13., 5., 0., 0., 0., 0., 0., 11., 16.,\n", " 9., 0., 0., 0., 0., 3., 15., 16., 6., 0., 0., 0., 7.,\n", " 15., 16., 16., 2., 0., 0., 0., 0., 1., 16., 16., 3., 0.,\n", " 0., 0., 0., 1., 16., 16., 6., 0., 0., 0., 0., 1., 16.,\n", " 16., 6., 0., 0., 0., 0., 0., 11., 16., 10., 0., 0.]])" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "digits.data[:2]" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(1797,)" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "digits.target.shape" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([0, 1, 2, ..., 8, 9, 8])" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "digits.target" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(1797, 8, 8)" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "digits.images.shape" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Feature names:\n", " ['pixel_0_0' 'pixel_0_1' 'pixel_0_2' 'pixel_0_3' 'pixel_0_4' 'pixel_0_5'\n", " 'pixel_0_6' 'pixel_0_7' 'pixel_1_0' 'pixel_1_1' 'pixel_1_2' 'pixel_1_3'\n", " 'pixel_1_4' 'pixel_1_5' 'pixel_1_6' 'pixel_1_7' 'pixel_2_0' 'pixel_2_1'\n", " 'pixel_2_2' 'pixel_2_3' 'pixel_2_4' 'pixel_2_5' 'pixel_2_6' 'pixel_2_7'\n", " 'pixel_3_0' 'pixel_3_1' 'pixel_3_2' 'pixel_3_3' 'pixel_3_4' 'pixel_3_5'\n", " 'pixel_3_6' 'pixel_3_7' 'pixel_4_0' 'pixel_4_1' 'pixel_4_2' 'pixel_4_3'\n", " 'pixel_4_4' 'pixel_4_5' 'pixel_4_6' 'pixel_4_7' 'pixel_5_0' 'pixel_5_1'\n", " 'pixel_5_2' 'pixel_5_3' 'pixel_5_4' 'pixel_5_5' 'pixel_5_6' 'pixel_5_7'\n", " 'pixel_6_0' 'pixel_6_1' 'pixel_6_2' 'pixel_6_3' 'pixel_6_4' 'pixel_6_5'\n", " 'pixel_6_6' 'pixel_6_7' 'pixel_7_0' 'pixel_7_1' 'pixel_7_2' 'pixel_7_3'\n", " 'pixel_7_4' 'pixel_7_5' 'pixel_7_6' 'pixel_7_7']\n" ] } ], "source": [ "print(\"Feature names:\\n\", np.transpose(digits.feature_names))" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [], "source": [ "X = digits.data\n", "y = digits.target" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "64" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(digits.feature_names)" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [], "source": [ "def write_answer(a, file_name): \n", " with open(file_name, \"w\") as fout:\n", " fout.write(str(a))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Task 1: DecisionTreeClassifier and its score\n", "Make **DecisionTreeClassifier** with default settings and measure the quality of its operation using **cross_val_score**. \n", "\n", "We create a classifier with default parameters, and the data sample does not need to be divided. Both **X** and **y** are fit for classifier and CV scoring." ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "DecisionTreeClassifier(random_state=0)" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DT_clf = tree.DecisionTreeClassifier(random_state=0)\n", "DT_clf.fit(X, y)" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [], "source": [ "DT_scoring = model_selection.cross_val_score(DT_clf, X, y, cv = 10) # scoring = \"accuracy\"," ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The **cross_val_score()** function returns a numpy ndarray, which will have **k** (*k=cv*) quality numbers in each of the k-fold cross validation experiments." ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[0.8 0.86111111 0.83333333 0.77222222 0.78888889 0.88333333\n", " 0.87777778 0.82681564 0.79329609 0.80446927]\n", "\n", "The mean of the CV scoring: 0.8241247672253259\n" ] } ], "source": [ "print(DT_scoring)\n", "\n", "print('\\nThe mean of the CV scoring:', DT_scoring.mean())\n", "\n", "write_answer(DT_scoring.mean(), 'dt_scoring.txt')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Task 2: Fit [bagging, bootstrap aggregating](https://en.wikipedia.org/wiki/Bootstrap_aggregating), over DecisionTreeClassifier\n", "We use **[BaggingClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html)** from sklearn.ensemble\n" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [], "source": [ "bagging_clf = ensemble.BaggingClassifier(DT_clf, n_estimators=100) " ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [], "source": [ "bagging_scoring = model_selection.cross_val_score(bagging_clf, X, y, cv = 10)" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[0.85555556 0.95 0.92777778 0.92777778 0.91666667 0.98888889\n", " 0.96111111 0.91620112 0.87709497 0.90502793]\n", "\n", "The mean of the CV scoring on bagging: 0.9226101800124148\n" ] } ], "source": [ "print(bagging_scoring)\n", "\n", "print('\\nThe mean of the CV scoring on bagging:', bagging_scoring.mean())\n", "\n", "write_answer(bagging_scoring.mean(), 'bagging_scoring.txt')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Task 3: Fit bagging with sqrt(d) features over DecisionTreeClassifier\n", "\n", "`max_features` parameter defines the max number of features to draw from X to train each base (DecisionTree) estimator. \n", "\n", "**Max features** defines the random subsets of features to consider when splitting a node. The lower **the greater the reduction of variance**, but also the greater **the increase in bias**. Read more [here](https://scikit-learn.org/stable/modules/ensemble.html#parameters)." ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Max features for bagging: 8\n" ] } ], "source": [ "max_features = np.int(np.sqrt(X.shape[1]))\n", "print('Max features for bagging:', max_features)\n", "bagging_featues_limit_clf = ensemble.BaggingClassifier(\n", " DT_clf, n_estimators=100, max_features = max_features) " ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [], "source": [ "bagging_featues_limit_scoring = model_selection.cross_val_score(bagging_featues_limit_clf, X, y, cv = 10)" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[0.91666667 0.95555556 0.93333333 0.87222222 0.95555556 0.94444444\n", " 0.97222222 0.97206704 0.89944134 0.90502793]\n", "\n", "The mean of the CV scoring on bagging with 8 featues: 0.9326536312849164\n" ] } ], "source": [ "print(bagging_featues_limit_scoring)\n", "\n", "print('\\nThe mean of the CV scoring on bagging with ', max_features\n", " ,' featues:', bagging_featues_limit_scoring.mean())\n", "\n", "write_answer(bagging_featues_limit_scoring.mean(), 'bagging_featues_limit_scoring.txt')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Task 4: Decision Tree with random features\n", "We choose random features not once for the entire tree at the Bagging stage, but when building each node of a Decision Tree.\n", "\n", "#### This is actually the Random Forest algorithm" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "DecisionTreeClassifier(max_features=8, random_state=0)" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DT_max_features_clf = tree.DecisionTreeClassifier(random_state=0, max_features= max_features)\n", "DT_max_features_clf.fit(X, y)" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [], "source": [ "bagging_dt_featues_limit_clf = ensemble.BaggingClassifier(\n", " DT_max_features_clf, n_estimators=100 ) " ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [], "source": [ "bagging_dt_featues_limit_scoring = model_selection.cross_val_score(bagging_dt_featues_limit_clf, X, y, cv = 10)" ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[0.90555556 0.97777778 0.94444444 0.93333333 0.93888889 0.96666667\n", " 0.97777778 0.96648045 0.90502793 0.93854749]\n", "\n", "The mean of the CV scoring on bagging with DT with 8 featues: 0.9454500310366232\n" ] } ], "source": [ "print(bagging_dt_featues_limit_scoring)\n", "\n", "print('\\nThe mean of the CV scoring on bagging with DT with ', max_features\n", " ,' featues:', bagging_dt_featues_limit_scoring.mean())\n", "\n", "write_answer(bagging_dt_featues_limit_scoring.mean(), 'bagging_dt_featues_limit_scoring.txt')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The classifier obtained in this task: the **Bagging on randomized trees** (in which, when constructing each node, a random subset of features is selected and the partition is searched only for them). This is exactly the **Random Forest** algorithm." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Task 5: RandomForestClassifier\n", "Let's build the **RandomForestClassifier** of *sklearn.ensemble* and learn of its quality depending on number of trees (N estimators), features used (Max features) for each tree node, and max tree depth (Max depth). " ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "RandomForestClassifier(max_features=8, random_state=0)" ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "RF_clf = ensemble.RandomForestClassifier(random_state=0, n_estimators = 100, max_features= max_features)\n", "RF_clf.fit(X, y)" ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [], "source": [ "rf_scoring = model_selection.cross_val_score(RF_clf, X, y, cv = 10)" ] }, { "cell_type": "code", "execution_count": 90, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[0.9 0.96111111 0.93888889 0.92222222 0.97777778 0.96111111\n", " 0.96666667 0.96648045 0.94972067 0.93296089]\n", "\n", "The mean of the CV scoring on RF 8 features: 0.9476939788950961\n" ] } ], "source": [ "print(rf_scoring)\n", "\n", "print('\\nThe mean of the CV scoring on RF', max_features\n", " ,'features:', rf_scoring.mean())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### We want to quality the RF classification on a given dataset depends on the following parameters:\n", " 1. The number of trees\n", " 2. The number of features selected when constructing each node of a tree\n", " 3. The restrictions on the depth of a tree" ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "dict_keys(['bootstrap', 'ccp_alpha', 'class_weight', 'criterion', 'max_depth', 'max_features', 'max_leaf_nodes', 'max_samples', 'min_impurity_decrease', 'min_impurity_split', 'min_samples_leaf', 'min_samples_split', 'min_weight_fraction_leaf', 'n_estimators', 'n_jobs', 'oob_score', 'random_state', 'verbose', 'warm_start'])" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "RF_clf.get_params().keys()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we evaluate best parameters of *RandomForestClassifier* using GridSearchCV." ] }, { "cell_type": "code", "execution_count": 72, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "100%|██████████| 1/1 [05:15<00:00, 315.59s/it]\n" ] } ], "source": [ "from tqdm import tqdm \n", "params ={ 'n_estimators' : range(2, 20), \n", " 'max_depth' : range(2, 10),\n", " 'max_features' : range(8, 64, 10)\n", " }\n", "for n in tqdm(range(1)): \n", " search_rf = model_selection.GridSearchCV(ensemble.RandomForestClassifier(), param_grid= params)\n", " search_rf.fit(X, y)" ] }, { "cell_type": "code", "execution_count": 92, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "GridSearchCV(estimator=RandomForestClassifier(),\n", " param_grid={'max_depth': range(2, 10),\n", " 'max_features': range(8, 64, 10),\n", " 'n_estimators': range(2, 20)})" ] }, "execution_count": 92, "metadata": {}, "output_type": "execute_result" } ], "source": [ "search_rf" ] }, { "cell_type": "code", "execution_count": 75, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "RandomForestClassifier(max_depth=9, max_features=18, n_estimators=16)" ] }, "execution_count": 75, "metadata": {}, "output_type": "execute_result" } ], "source": [ "search_rf.best_estimator_" ] }, { "cell_type": "code", "execution_count": 93, "metadata": {}, "outputs": [], "source": [ "grid_scores = search_rf.cv_results_\n", "#grid_scores\n", "#grid_scores[\"params\"]" ] }, { "cell_type": "code", "execution_count": 95, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Total parameters variations: 864\n", "Best params: {'max_depth': 9, 'max_features': 18, 'n_estimators': 16}\n" ] } ], "source": [ "# arrays with data for scoring depending on one of each parameter \n", "xx=[]\n", "yy=[]\n", "zz=[]\n", "print('Total parameters variations:', len(grid_scores['params']))\n", "print('Best params:', search_rf.best_params_)\n", "best_max_depth = search_rf.best_params_['max_depth']\n", "best_n_estimators = search_rf.best_params_['n_estimators']\n", "best_max_features = search_rf.best_params_['max_features']\n", "\n", "for mean_score, parameters in zip(grid_scores[\"mean_test_score\"], grid_scores[\"params\"]):\n", " #print(mean_score, parameters)\n", " if parameters[\"n_estimators\"] == best_n_estimators and \\\n", " parameters[\"max_features\"] == best_max_features:\n", " yy.append([np.sqrt(mean_score), parameters['max_depth']])\n", " if parameters[\"max_depth\"] == best_max_depth and \\\n", " parameters[\"max_features\"] == best_max_features:\n", " zz.append([np.sqrt(mean_score), parameters['n_estimators']])\n", " if parameters[\"max_depth\"] == best_max_depth and \\\n", " parameters[\"n_estimators\"] == best_n_estimators:\n", " xx.append([np.sqrt(mean_score), parameters['max_features']])" ] }, { "cell_type": "code", "execution_count": 96, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "8 \n", " [[0.828331320324343, 2], [0.8551200320811582, 3], [0.9041592054341305, 4], [0.9299464346950542, 5], [0.9459544249401011, 6], [0.9450788009243831, 7], [0.9547491612998596, 8], [0.9599807204295354, 9]] \n", "\n", "18 \n", " [[0.8779266393916481, 2], [0.9069447317880542, 3], [0.9290532249389274, 4], [0.9293546647273637, 5], [0.9403536214941346, 6], [0.9441999064414529, 7], [0.9506533940297671, 8], [0.9483153872322252, 9], [0.9550400601850904, 10], [0.9509536824793225, 11], [0.9553333002881406, 12], [0.9564956543872287, 13], [0.954456552405717, 14], [0.9588161352971158, 15], [0.9599807204295354, 16], [0.9585255751632453, 17], [0.9599766903700595, 18], [0.9591074141506128, 19]]\n", "6 \n", " [[0.9515385251478821, 8], [0.9599807204295354, 18], [0.9521213832506896, 28], [0.9544573630805719, 38], [0.9477261062755679, 48], [0.9403593813094615, 58]]\n" ] } ], "source": [ "#print(len(yy))\n", "print(len(yy), '\\n', yy, '\\n')\n", "print(len(zz), '\\n', zz)\n", "print(len(xx), '\\n', xx)" ] }, { "cell_type": "code", "execution_count": 98, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "pylab.plot( list(x[1] for x in yy), list(x[0] for x in yy), label = \"Max depth\")\n", "pylab.xlabel(\"Max depth\")\n", "pylab.ylabel(\"Accuracy score\")\n", "pylab.title(\"RF classifier scoring on 'Max depth' parameter, \\nfixed 'N estimators' (best) = %2d and fixed 'Max features' (best) = %2d\" % (best_n_estimators, best_max_features))\n", "pylab.legend(loc = \"best\") \n", "pylab.rcParams[\"figure.figsize\"] = [10, 6]" ] }, { "cell_type": "code", "execution_count": 99, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "pylab.plot( list(x[1] for x in zz), list(x[0] for x in zz), label = \"N estimators\")\n", "pylab.xlabel(\"N estimators\")\n", "pylab.ylabel(\"Accuracy score\")\n", "pylab.title(\"RF classifier scoring on 'N estimators' parameter, \\nfixed 'Max depth' (best) = %2d and fixed 'Max features' (best) = %2d\" % ( best_max_depth, best_max_features) )\n", "pylab.legend(loc = \"best\") \n", "pylab.rcParams[\"figure.figsize\"] = [10, 6]" ] }, { "cell_type": "code", "execution_count": 100, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "pylab.plot( list(x[1] for x in xx), list(x[0] for x in xx), label = \"Max features\")\n", "pylab.xlabel(\"Max features\")\n", "pylab.ylabel(\"Accuracy score\")\n", "pylab.title(\"RF classifier scoring on 'Max features' parameter, \\nfixed 'Max depth' (best) = %2d and and 'N estimators' (best) = %2d\" % ( best_max_depth , best_n_estimators) ) \n", "pylab.legend(loc = \"best\") \n", "pylab.rcParams[\"figure.figsize\"] = [10, 6]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Some statements related to the RF classification\n", "1) The random forest is highly over-trained with the growth of the number of trees - **False**." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "2) With a very small number of trees (5, 10, 15), a random forest performs worse than with a larger number of trees - **True**." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "3) As the number of trees in a random forest grows, at some point there are enough trees for a high quality classification, and then the quality does not change significantly. - **True**." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "4) With a large number of features (for a given dataset - 40, 50), the quality of classification becomes worse than with a small number of features (5, 10). This is due to the fact that the fewer features are selected at each node, the more different the trees are (because trees are highly unstable to changes in the training sample), and the better their composition works. - **True**." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "5) With a large number of features (40, 50, 60), the quality of classification is better than with a small number of features (5, 10). This is due to the fact that the more features - the more information about the objects, which means that the algorithm can make predictions more accurately. - **False**." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "6) With a small maximum depth of trees (5-6), the quality of the random forest is much better than without a depth limit, since the trees are not retrained. As the depth of the trees increases, the quality deteriorates. - **False**." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "7) With a small maximum depth of trees (5-6), the quality of the random forest is noticeably worse than without restrictions, since the trees are obtained under-trained. With increasing depth, the quality first improves, and then does not change significantly, because due to averaging forecasts and differences in trees, their over-training in bagging does not affect the final quality (all trees are pre-trained differently, and when averaging, they compensate for each other's over-training). - **True**." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 4 }