Categories
Data Mining

Cross-validation strategies and their application

In the post we’ll get to know the Cross-validation strategies as from the Sklearn module. We’ll show the methods of how to perform  k-fold cross-validation. All the iPython notebook code is correct for Python 3.6.

The iPython notebook code

Read more how to work with inbuilt datasets of Sklearn.

sklearn.model_selection

First we load the Iris dataset. Now we split the dataset into the Train and Test sets with the function train_test_split(). Test set proportion [to all data amount] should be defined as test_size parameter. In our case # 30% of data objects are in the Test set (0.3), the rest are in Train set.

from sklearn import model_selection, datasets
import numpy as np
iris = datasets.load_iris()
train_data, test_data, train_labels, test_labels = \
   model_selection.train_test_split(iris.data, iris.target, test_size = 0.3)

Let us check the proportion of Test data:

len(test_data)/len(iris.data)
0.3

Train set size and Test set size:

print("Train set size: {} objects \nTest set size: {} objects"\
      .format(len(train_data),  len(test_data)))
Train set size: 105 objects 
Test set size: 45 objects
Print them out:
print("Test set, head:\n", test_data[:5])
print("\n")
print("Class labels for the test set:\n", test_labels)
Test set head:
 [[6.5 3.  5.2 2. ]
 [5.8 2.8 5.1 2.4]
 [5.5 2.4 3.7 1. ]
 [6.7 3.  5.  1.7]
 [7.  3.2 4.7 1.4]]
Class labels for the test set:
 [2 2 1 1 1 0 1 2 1 2 2 2 2 1 0 2 2 1 2 1 1 1 0 0 1 0 0 2 2 1 0 2 0 1 0 0 2
 0 1 1 2 0 1 1 2]

Cross-validation stratages

Now we consider different cross-validation stratages. First we generate a small similarity of a dataset. Its elements values match the sequence number.

X = range(0,10)
X
range(0, 10)

1. KFold

This stratagy splits a dataset into k folds in a random way. In our case we take k=5. In each line the first set (8 elements) are for training and the last (2 eleements) are for testing. Each time there are different (yet consecutive) values in the test set: (0,1), (2,3)…

kf = model_selection.KFold(n_splits = 5)
print(kf, "\n")
for train_indices, test_indices in kf.split(X):
    print(train_indices, test_indices)
KFold(n_splits=5, random_state=None, shuffle=False) 
[2 3 4 5 6 7 8 9] [0 1]
[0 1 4 5 6 7 8 9] [2 3]
[0 1 2 3 6 7 8 9] [4 5]
[0 1 2 3 4 5 8 9] [6 7]
[0 1 2 3 4 5 6 7] [8 9]

We add shuffling so that silmilar object (one class label) would not be in a test set.

kf = model_selection.KFold(n_splits = 5, shuffle = True)
for train_indices, test_indices in kf.split(X):
    print(train_indices, test_indices)
[0 1 2 3 4 7 8 9] [5 6]
[0 1 3 5 6 7 8 9] [2 4]
[0 1 2 3 4 5 6 7] [8 9]
[0 2 4 5 6 7 8 9] [1 3]
[1 2 3 4 5 6 8 9] [0 7]

For reproducible output across multiple function calls we add random_state.

kf = model_selection.KFold(n_splits = 5, shuffle = True, random_state = 1)
for train_indices, test_indices in kf.split(X):
    print(train_indices, test_indices)
[0 1 3 4 5 6 7 8] [2 9]
[0 1 2 3 5 7 8 9] [4 6]
[1 2 4 5 6 7 8 9] [0 3]
[0 2 3 4 5 6 8 9] [1 7]
[0 1 2 3 4 6 7 9] [5 8]

2. StratifiedKFold

The folds are made by preserving the percentage of samples for each class in training and test sets. Stratified’s synonim is class-conscious.

y = np.array([0] * 5 + [1] * 5)
print("Dataset:", y, "\nFolds")
skf = model_selection.StratifiedKFold(n_splits = 2, shuffle = True, random_state = 0)
for train_indices, test_indices in skf.split(X, y):
    print(train_indices, test_indices)
Dataset: [0 0 0 0 0 1 1 1 1 1]
Folds
[3 4 6 8 9] [0 1 2 5 7]
[0 1 2 5 7] [3 4 6 8 9]

Even if the indices values (class labels) are not consecutive that method splits the set preserving the percentage of samples for each class, 0 or 1.

target = np.array([0, 1] * 5)
print(target)
skf = model_selection.StratifiedKFold(n_splits = 2,shuffle = True)
for train_indices, test_indices in skf.split(X, target):
    print(train_indices, test_indices)
[0 1 0 1 0 1 0 1 0 1]
[0 1 2 3 5] [4 6 7 8 9]
[4 6 7 8 9] [0 1 2 3 5]

3. ShuffleSplit

The method yields random permutation cross-validator. Contrary to other cross-validation strategies, random splits do not guarantee that all folds will be different, although this is still very likely for sizeable datasets.

ss = model_selection.ShuffleSplit(n_splits = 10, test_size = 0.2)
for train_indices, test_indices in ss.split(X):
    print(train_indices, test_indices)
[5 1 8 2 4 9 7 0] [6 3]
[1 4 2 7 5 0 3 9] [8 6]
[4 7 2 5 1 0 3 6] [9 8]
[8 9 4 2 3 1 5 7] [6 0]
[7 8 9 2 0 4 1 6] [5 3]
[0 5 6 7 9 4 3 1] [2 8]
[0 2 6 9 1 4 7 3] [5 8]
[3 7 8 4 9 0 5 6] [1 2]
[0 1 4 3 8 5 9 2] [6 7]
[0 2 5 4 1 9 3 7] [6 8]

4. StratifiedShuffleSplit

The method includes the characteristics of both abovementioned methods.

target = np.array([0] * 5 + [1] * 5)
print(target)
sss = model_selection.StratifiedShuffleSplit(n_splits = 4, test_size = 0.2)
for train_indices, test_indices in sss.split(X, target):
    print(train_indices, test_indices)
[0 0 0 0 0 1 1 1 1 1]
[0 9 5 3 7 1 6 4] [8 2]
[4 2 1 8 6 7 0 9] [5 3]
[7 0 5 4 9 6 3 1] [2 8]
[2 5 4 3 6 9 7 1] [8 0]

5. Leave-One-Out

In this method each sample is used once as a test set (singleton) while the remaining samples form a training set. This strategy is best when we have a small dataset.

loo = model_selection.LeaveOneOut()
for train_indices, test_index in loo.split(X):
    print(train_indices, test_index)
[1 2 3 4 5 6 7 8 9] [0]
[0 2 3 4 5 6 7 8 9] [1]
[0 1 3 4 5 6 7 8 9] [2]
[0 1 2 4 5 6 7 8 9] [3]
[0 1 2 3 5 6 7 8 9] [4]
[0 1 2 3 4 6 7 8 9] [5]
[0 1 2 3 4 5 7 8 9] [6]
[0 1 2 3 4 5 6 8 9] [7]
[0 1 2 3 4 5 6 7 9] [8]
[0 1 2 3 4 5 6 7 8] [9]

Conclusion

Due to the high number of test sets (which is the same as the number of samples) the Leave-One-Out cross-validation method can be very costly. For large datasets one should favor KFold, ShuffleSplit or StratifiedKFold.

More cross-validation methods one may find here.

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.