In the post we’ll get to know the Cross-validation strategies as from the Sklearn module. We’ll show the methods of how to perform k-fold cross-validation. All the iPython notebook code is correct for Python 3.6.
The iPython notebook code
Read more how to work with inbuilt datasets of Sklearn.
sklearn.model_selection
First we load the Iris dataset. Now we split the dataset into the Train and Test sets with the function train_test_split(). Test set proportion [to all data amount] should be defined as test_size parameter. In our case # 30% of data objects are in the Test set (0.3), the rest are in Train set.
from sklearn import model_selection, datasets import numpy as np iris = datasets.load_iris() train_data, test_data, train_labels, test_labels = \ model_selection.train_test_split(iris.data, iris.target, test_size = 0.3)
Let us check the proportion of Test data:
len(test_data)/len(iris.data)
0.3
Train set size and Test set size:
print("Train set size: {} objects \nTest set size: {} objects"\ .format(len(train_data), len(test_data)))
Train set size: 105 objects Test set size: 45 objects
print("Test set, head:\n", test_data[:5]) print("\n") print("Class labels for the test set:\n", test_labels)
Test set head: [[6.5 3. 5.2 2. ] [5.8 2.8 5.1 2.4] [5.5 2.4 3.7 1. ] [6.7 3. 5. 1.7] [7. 3.2 4.7 1.4]] Class labels for the test set: [2 2 1 1 1 0 1 2 1 2 2 2 2 1 0 2 2 1 2 1 1 1 0 0 1 0 0 2 2 1 0 2 0 1 0 0 2 0 1 1 2 0 1 1 2]
Cross-validation stratages
Now we consider different cross-validation stratages. First we generate a small similarity of a dataset. Its elements values match the sequence number.
X = range(0,10) X
range(0, 10)
1. KFold
This stratagy splits a dataset into k folds in a random way. In our case we take k=5. In each line the first set (8 elements) are for training and the last (2 eleements) are for testing. Each time there are different (yet consecutive) values in the test set: (0,1), (2,3)…
kf = model_selection.KFold(n_splits = 5) print(kf, "\n") for train_indices, test_indices in kf.split(X): print(train_indices, test_indices)
KFold(n_splits=5, random_state=None, shuffle=False) [2 3 4 5 6 7 8 9] [0 1] [0 1 4 5 6 7 8 9] [2 3] [0 1 2 3 6 7 8 9] [4 5] [0 1 2 3 4 5 8 9] [6 7] [0 1 2 3 4 5 6 7] [8 9]
We add shuffling so that silmilar object (one class label) would not be in a test set.
kf = model_selection.KFold(n_splits = 5, shuffle = True) for train_indices, test_indices in kf.split(X): print(train_indices, test_indices)
[0 1 2 3 4 7 8 9] [5 6] [0 1 3 5 6 7 8 9] [2 4] [0 1 2 3 4 5 6 7] [8 9] [0 2 4 5 6 7 8 9] [1 3] [1 2 3 4 5 6 8 9] [0 7]
For reproducible output across multiple function calls we add random_state.
kf = model_selection.KFold(n_splits = 5, shuffle = True, random_state = 1) for train_indices, test_indices in kf.split(X): print(train_indices, test_indices)
[0 1 3 4 5 6 7 8] [2 9] [0 1 2 3 5 7 8 9] [4 6] [1 2 4 5 6 7 8 9] [0 3] [0 2 3 4 5 6 8 9] [1 7] [0 1 2 3 4 6 7 9] [5 8]
2. StratifiedKFold
The folds are made by preserving the percentage of samples for each class in training and test sets. Stratified’s synonim is class-conscious.
y = np.array([0] * 5 + [1] * 5) print("Dataset:", y, "\nFolds") skf = model_selection.StratifiedKFold(n_splits = 2, shuffle = True, random_state = 0) for train_indices, test_indices in skf.split(X, y): print(train_indices, test_indices)
Dataset: [0 0 0 0 0 1 1 1 1 1]
Folds
[3 4 6 8 9] [0 1 2 5 7]
[0 1 2 5 7] [3 4 6 8 9]
Even if the indices values (class labels) are not consecutive that method splits the set preserving the percentage of samples for each class, 0 or 1.
target = np.array([0, 1] * 5) print(target) skf = model_selection.StratifiedKFold(n_splits = 2,shuffle = True) for train_indices, test_indices in skf.split(X, target): print(train_indices, test_indices)
[0 1 0 1 0 1 0 1 0 1] [0 1 2 3 5] [4 6 7 8 9] [4 6 7 8 9] [0 1 2 3 5]
3. ShuffleSplit
The method yields random permutation cross-validator. Contrary to other cross-validation strategies, random splits do not guarantee that all folds will be different, although this is still very likely for sizeable datasets.
ss = model_selection.ShuffleSplit(n_splits = 10, test_size = 0.2) for train_indices, test_indices in ss.split(X): print(train_indices, test_indices)
[5 1 8 2 4 9 7 0] [6 3] [1 4 2 7 5 0 3 9] [8 6] [4 7 2 5 1 0 3 6] [9 8] [8 9 4 2 3 1 5 7] [6 0] [7 8 9 2 0 4 1 6] [5 3] [0 5 6 7 9 4 3 1] [2 8] [0 2 6 9 1 4 7 3] [5 8] [3 7 8 4 9 0 5 6] [1 2] [0 1 4 3 8 5 9 2] [6 7] [0 2 5 4 1 9 3 7] [6 8]
4. StratifiedShuffleSplit
The method includes the characteristics of both abovementioned methods.
target = np.array([0] * 5 + [1] * 5) print(target) sss = model_selection.StratifiedShuffleSplit(n_splits = 4, test_size = 0.2) for train_indices, test_indices in sss.split(X, target): print(train_indices, test_indices)
[0 0 0 0 0 1 1 1 1 1] [0 9 5 3 7 1 6 4] [8 2] [4 2 1 8 6 7 0 9] [5 3] [7 0 5 4 9 6 3 1] [2 8] [2 5 4 3 6 9 7 1] [8 0]
5. Leave-One-Out
In this method each sample is used once as a test set (singleton) while the remaining samples form a training set. This strategy is best when we have a small dataset.
loo = model_selection.LeaveOneOut() for train_indices, test_index in loo.split(X): print(train_indices, test_index)
[1 2 3 4 5 6 7 8 9] [0] [0 2 3 4 5 6 7 8 9] [1] [0 1 3 4 5 6 7 8 9] [2] [0 1 2 4 5 6 7 8 9] [3] [0 1 2 3 5 6 7 8 9] [4] [0 1 2 3 4 6 7 8 9] [5] [0 1 2 3 4 5 7 8 9] [6] [0 1 2 3 4 5 6 8 9] [7] [0 1 2 3 4 5 6 7 9] [8] [0 1 2 3 4 5 6 7 8] [9]
Conclusion
Due to the high number of test sets (which is the same as the number of samples) the Leave-One-Out cross-validation method can be very costly. For large datasets one should favor KFold, ShuffleSplit or StratifiedKFold.
More cross-validation methods one may find here.