{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## [sklearn.model_selection](http://scikit-learn.org/stable/modules/cross_validation.html)\n", "First we load the *Iris* dataset. Now we split the dataset into the *Train* and *Test* sets with the function *train_test_split()*. Test set proportion [to all data amount] should be defined as *test_size* parameter. In our case # 30% of data objects are in the Test set (0.3), the rest are in Train set." ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "from sklearn import model_selection, datasets\n", "import numpy as np\n", "\n", "iris = datasets.load_iris()\n", "train_data, test_data, train_labels, test_labels = \\\n", " model_selection.train_test_split(iris.data, iris.target, test_size = 0.3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let us check the proportion of Test data:" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.3" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(test_data)/len(iris.data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Train set size and Test set size:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Train set size: 105 objects \n", "Test set size: 45 objects\n" ] } ], "source": [ "print('Train set size: {} objects \\nTest set size: {} objects'\\\n", " .format(len(train_data), len(test_data)))" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Test set head:\n", " [[6.5 3. 5.2 2. ]\n", " [5.8 2.8 5.1 2.4]\n", " [5.5 2.4 3.7 1. ]\n", " [6.7 3. 5. 1.7]\n", " [7. 3.2 4.7 1.4]]\n", "\n", "\n", "Class labels for the test set:\n", " [2 2 1 1 1 0 1 2 1 2 2 2 2 1 0 2 2 1 2 1 1 1 0 0 1 0 0 2 2 1 0 2 0 1 0 0 2\n", " 0 1 1 2 0 1 1 2]\n" ] } ], "source": [ "print('Test set, head:\\n', test_data[:5])\n", "print('\\n')\n", "print('Class labels for the test set:\\n', test_labels)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Cross-validation stratages\n", "Now we consider different cross-validation stratages. \n", "First we generate a small similarity of a dataset. Its elements values match the sequence number." ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "range(0, 10)" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X = range(0,10)\n", "X" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 1. [KFold](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html)\n", "This stratagy splits a dataset into **k** folds in a random way. In our case we take k=5. In each line the first set (8 elements) are for training and the last (2 eleements) are for testing. Each time there are different (yet consecutive) values in the test set: (0,1), (2,3)... " ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "KFold(n_splits=5, random_state=None, shuffle=False) \n", "\n", "[2 3 4 5 6 7 8 9] [0 1]\n", "[0 1 4 5 6 7 8 9] [2 3]\n", "[0 1 2 3 6 7 8 9] [4 5]\n", "[0 1 2 3 4 5 8 9] [6 7]\n", "[0 1 2 3 4 5 6 7] [8 9]\n" ] } ], "source": [ "kf = model_selection.KFold(n_splits = 5)\n", "print(kf, '\\n')\n", "for train_indices, test_indices in kf.split(X):\n", " print(train_indices, test_indices)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We add **shuffling** so that silmilar object (one class label) would not be in a test set." ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[0 1 2 3 4 7 8 9] [5 6]\n", "[0 1 3 5 6 7 8 9] [2 4]\n", "[0 1 2 3 4 5 6 7] [8 9]\n", "[0 2 4 5 6 7 8 9] [1 3]\n", "[1 2 3 4 5 6 8 9] [0 7]\n" ] } ], "source": [ "kf = model_selection.KFold(n_splits = 5, shuffle = True)\n", "for train_indices, test_indices in kf.split(X):\n", " print(train_indices, test_indices)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For reproducible output across multiple function calls we add **random_state**." ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[0 1 3 4 5 6 7 8] [2 9]\n", "[0 1 2 3 5 7 8 9] [4 6]\n", "[1 2 4 5 6 7 8 9] [0 3]\n", "[0 2 3 4 5 6 8 9] [1 7]\n", "[0 1 2 3 4 6 7 9] [5 8]\n" ] } ], "source": [ "kf = model_selection.KFold(n_splits = 5, shuffle = True, random_state = 1)\n", "for train_indices, test_indices in kf.split(X):\n", " print(train_indices, test_indices)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 2. [StratifiedKFold](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html)\n", "The folds are made by preserving the percentage of samples for each class in training and test sets. \n", "Stratified's synonim is *class-conscious*." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Dataset: [0 0 0 0 0 1 1 1 1 1] \n", "Folds\n", "[3 4 6 8 9] [0 1 2 5 7]\n", "[0 1 2 5 7] [3 4 6 8 9]\n" ] } ], "source": [ "y = np.array([0] * 5 + [1] * 5)\n", "print('Dataset:', y, '\\nFolds')\n", "\n", "skf = model_selection.StratifiedKFold(n_splits = 2, shuffle = True, random_state = 0)\n", "for train_indices, test_indices in skf.split(X, y):\n", " print(train_indices, test_indices)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Even if the indices values (class labels) are not consecutive that method splits the set preserving the percentage of samples for each class, **0** or **1**." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[0 1 0 1 0 1 0 1 0 1]\n", "[0 1 2 3 5] [4 6 7 8 9]\n", "[4 6 7 8 9] [0 1 2 3 5]\n" ] } ], "source": [ "target = np.array([0, 1] * 5)\n", "print(target)\n", "skf = model_selection.StratifiedKFold(n_splits = 2,shuffle = True)\n", "for train_indices, test_indices in skf.split(X, target):\n", " print(train_indices, test_indices)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 3. [ShuffleSplit](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.ShuffleSplit.html)\n", "The method yields random permutation cross-validator. Contrary to other cross-validation strategies, random splits do not guarantee that all folds will be different, although this is still very likely for sizeable datasets." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[5 1 8 2 4 9 7 0] [6 3]\n", "[1 4 2 7 5 0 3 9] [8 6]\n", "[4 7 2 5 1 0 3 6] [9 8]\n", "[8 9 4 2 3 1 5 7] [6 0]\n", "[7 8 9 2 0 4 1 6] [5 3]\n", "[0 5 6 7 9 4 3 1] [2 8]\n", "[0 2 6 9 1 4 7 3] [5 8]\n", "[3 7 8 4 9 0 5 6] [1 2]\n", "[0 1 4 3 8 5 9 2] [6 7]\n", "[0 2 5 4 1 9 3 7] [6 8]\n" ] } ], "source": [ "ss = model_selection.ShuffleSplit(n_splits = 10, test_size = 0.2)\n", "for train_indices, test_indices in ss.split(X):\n", " print(train_indices, test_indices)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 4. StratifiedShuffleSplit\n", "The method includes the characteristics of both abovementioned methods." ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[0 0 0 0 0 1 1 1 1 1]\n", "[0 9 5 3 7 1 6 4] [8 2]\n", "[4 2 1 8 6 7 0 9] [5 3]\n", "[7 0 5 4 9 6 3 1] [2 8]\n", "[2 5 4 3 6 9 7 1] [8 0]\n" ] } ], "source": [ "target = np.array([0] * 5 + [1] * 5)\n", "print(target)\n", "\n", "sss = model_selection.StratifiedShuffleSplit(n_splits = 4, test_size = 0.2)\n", "for train_indices, test_indices in sss.split(X, target):\n", " print(train_indices, test_indices)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 5. [Leave-One-Out](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.LeaveOneOut.html)\n", "In this method **each sample is used once as a test set** (singleton) while the remaining samples form a training set. This strategy is best when we have a *small dataset*." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[1 2 3 4 5 6 7 8 9] [0]\n", "[0 2 3 4 5 6 7 8 9] [1]\n", "[0 1 3 4 5 6 7 8 9] [2]\n", "[0 1 2 4 5 6 7 8 9] [3]\n", "[0 1 2 3 5 6 7 8 9] [4]\n", "[0 1 2 3 4 6 7 8 9] [5]\n", "[0 1 2 3 4 5 7 8 9] [6]\n", "[0 1 2 3 4 5 6 8 9] [7]\n", "[0 1 2 3 4 5 6 7 9] [8]\n", "[0 1 2 3 4 5 6 7 8] [9]\n" ] } ], "source": [ "loo = model_selection.LeaveOneOut()\n", "\n", "for train_indices, test_index in loo.split(X):\n", " print(train_indices, test_index)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Conclusion\n", "\n", "Due to the high number of test sets (which is the same as the number of samples) the *Leave-One-Out* cross-validation method can be very costly. For large datasets one should favor *KFold*, *ShuffleSplit* or *StratifiedKFold*.\n", "\n", "More cross-validation methods one may find [here](http://scikit-learn.org/stable/modules/cross_validation.html#cross-validation-iterators)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 1 }