Cross-validation strategies and their application

In the post we’ll get to know the Cross-validation strategies as from the Sklearn module. We’ll show the methods of how to perform  k-fold cross-validation. All the iPython notebook code is correct for Python 3.6.

The iPython notebook code

Read more how to work with inbuilt datasets of Sklearn.


First we load the Iris dataset. Now we split the dataset into the Train and Test sets with the function train_test_split(). Test set proportion [to all data amount] should be defined as test_size parameter. In our case # 30% of data objects are in the Test set (0.3), the rest are in Train set.

from sklearn import model_selection, datasets
import numpy as np
iris = datasets.load_iris()
train_data, test_data, train_labels, test_labels = \
   model_selection.train_test_split(,, test_size = 0.3)

Let us check the proportion of Test data:


Train set size and Test set size:

print("Train set size: {} objects \nTest set size: {} objects"\
      .format(len(train_data),  len(test_data)))
Train set size: 105 objects 
Test set size: 45 objects
Print them out:
print("Test set, head:\n", test_data[:5])
print("Class labels for the test set:\n", test_labels)
Test set head:
 [[6.5 3.  5.2 2. ]
 [5.8 2.8 5.1 2.4]
 [5.5 2.4 3.7 1. ]
 [6.7 3.  5.  1.7]
 [7.  3.2 4.7 1.4]]
Class labels for the test set:
 [2 2 1 1 1 0 1 2 1 2 2 2 2 1 0 2 2 1 2 1 1 1 0 0 1 0 0 2 2 1 0 2 0 1 0 0 2
 0 1 1 2 0 1 1 2]

Cross-validation stratages

Now we consider different cross-validation stratages. First we generate a small similarity of a dataset. Its elements values match the sequence number.

X = range(0,10)
range(0, 10)

1. KFold

This stratagy splits a dataset into k folds in a random way. In our case we take k=5. In each line the first set (8 elements) are for training and the last (2 eleements) are for testing. Each time there are different (yet consecutive) values in the test set: (0,1), (2,3)…

kf = model_selection.KFold(n_splits = 5)
print(kf, "\n")
for train_indices, test_indices in kf.split(X):
    print(train_indices, test_indices)
KFold(n_splits=5, random_state=None, shuffle=False) 
[2 3 4 5 6 7 8 9] [0 1]
[0 1 4 5 6 7 8 9] [2 3]
[0 1 2 3 6 7 8 9] [4 5]
[0 1 2 3 4 5 8 9] [6 7]
[0 1 2 3 4 5 6 7] [8 9]

We add shuffling so that silmilar object (one class label) would not be in a test set.

kf = model_selection.KFold(n_splits = 5, shuffle = True)
for train_indices, test_indices in kf.split(X):
    print(train_indices, test_indices)
[0 1 2 3 4 7 8 9] [5 6]
[0 1 3 5 6 7 8 9] [2 4]
[0 1 2 3 4 5 6 7] [8 9]
[0 2 4 5 6 7 8 9] [1 3]
[1 2 3 4 5 6 8 9] [0 7]

For reproducible output across multiple function calls we add random_state.

kf = model_selection.KFold(n_splits = 5, shuffle = True, random_state = 1)
for train_indices, test_indices in kf.split(X):
    print(train_indices, test_indices)
[0 1 3 4 5 6 7 8] [2 9]
[0 1 2 3 5 7 8 9] [4 6]
[1 2 4 5 6 7 8 9] [0 3]
[0 2 3 4 5 6 8 9] [1 7]
[0 1 2 3 4 6 7 9] [5 8]

2. StratifiedKFold

The folds are made by preserving the percentage of samples for each class in training and test sets. Stratified’s synonim is class-conscious.

y = np.array([0] * 5 + [1] * 5)
print("Dataset:", y, "\nFolds")
skf = model_selection.StratifiedKFold(n_splits = 2, shuffle = True, random_state = 0)
for train_indices, test_indices in skf.split(X, y):
    print(train_indices, test_indices)
Dataset: [0 0 0 0 0 1 1 1 1 1]
[3 4 6 8 9] [0 1 2 5 7]
[0 1 2 5 7] [3 4 6 8 9]

Even if the indices values (class labels) are not consecutive that method splits the set preserving the percentage of samples for each class, 0 or 1.

target = np.array([0, 1] * 5)
skf = model_selection.StratifiedKFold(n_splits = 2,shuffle = True)
for train_indices, test_indices in skf.split(X, target):
    print(train_indices, test_indices)
[0 1 0 1 0 1 0 1 0 1]
[0 1 2 3 5] [4 6 7 8 9]
[4 6 7 8 9] [0 1 2 3 5]

3. ShuffleSplit

The method yields random permutation cross-validator. Contrary to other cross-validation strategies, random splits do not guarantee that all folds will be different, although this is still very likely for sizeable datasets.

ss = model_selection.ShuffleSplit(n_splits = 10, test_size = 0.2)
for train_indices, test_indices in ss.split(X):
    print(train_indices, test_indices)
[5 1 8 2 4 9 7 0] [6 3]
[1 4 2 7 5 0 3 9] [8 6]
[4 7 2 5 1 0 3 6] [9 8]
[8 9 4 2 3 1 5 7] [6 0]
[7 8 9 2 0 4 1 6] [5 3]
[0 5 6 7 9 4 3 1] [2 8]
[0 2 6 9 1 4 7 3] [5 8]
[3 7 8 4 9 0 5 6] [1 2]
[0 1 4 3 8 5 9 2] [6 7]
[0 2 5 4 1 9 3 7] [6 8]

4. StratifiedShuffleSplit

The method includes the characteristics of both abovementioned methods.

target = np.array([0] * 5 + [1] * 5)
sss = model_selection.StratifiedShuffleSplit(n_splits = 4, test_size = 0.2)
for train_indices, test_indices in sss.split(X, target):
    print(train_indices, test_indices)
[0 0 0 0 0 1 1 1 1 1]
[0 9 5 3 7 1 6 4] [8 2]
[4 2 1 8 6 7 0 9] [5 3]
[7 0 5 4 9 6 3 1] [2 8]
[2 5 4 3 6 9 7 1] [8 0]

5. Leave-One-Out

In this method each sample is used once as a test set (singleton) while the remaining samples form a training set. This strategy is best when we have a small dataset.

loo = model_selection.LeaveOneOut()
for train_indices, test_index in loo.split(X):
    print(train_indices, test_index)
[1 2 3 4 5 6 7 8 9] [0]
[0 2 3 4 5 6 7 8 9] [1]
[0 1 3 4 5 6 7 8 9] [2]
[0 1 2 4 5 6 7 8 9] [3]
[0 1 2 3 5 6 7 8 9] [4]
[0 1 2 3 4 6 7 8 9] [5]
[0 1 2 3 4 5 7 8 9] [6]
[0 1 2 3 4 5 6 8 9] [7]
[0 1 2 3 4 5 6 7 9] [8]
[0 1 2 3 4 5 6 7 8] [9]


Due to the high number of test sets (which is the same as the number of samples) the Leave-One-Out cross-validation method can be very costly. For large datasets one should favor KFold, ShuffleSplit or StratifiedKFold.

More cross-validation methods one may find here.

