Categories
Data Mining Development

Linear models, Sklearn.linear_model, Classification

In this post we’ll show how to build classification linear models using the sklearn.linear.model module.

The code as an IPython notebook

Linear models

sklearn.liner_model – documentation on sklearn.linear_model

linear_model examples:

  • RidgeClassifier
  • SGDClassifier
  • SGDRegressor
  • LinearRegression
  • LogisticRegression
  • Lasso

All the examples of sklearn.linear_model

Module metrics is needed to evaluate the quality of aquired models.

import warnings
warnings.filterwarnings("ignore")

from matplotlib.colors import ListedColormap
from sklearn import model_selection, datasets, linear_model, metrics
import numpy as np
%pylab inline

1. Data generation

We choose a dataset of 2 features, labels forming 2 clouds.

  • centers – number of classes
  • cluster_std – standart deviation
blobs = datasets.make_blobs(centers = 2, cluster_std = 5.5, random_state=1)

Let’s visualize datasets in colors.

colors = ListedColormap(["red", "blue"])
pylab.figure(figsize(10, 10))
pylab.scatter([x[0] for x in blobs[0]], [x[1] for x in blobs[0]], c=blobs[1], cmap=colors)

Now we split the blobs dataset into the train and test sets.

train_data, test_data, train_labels, test_labels \
   = model_selection.train_test_split(blobs[0], blobs[1], \
   test_size = 0.3, random_state = 1)

2. Linear classification models

2.1 RidgeClassifier

We’ll create/build the linear classifier named RidgeClassifier.

ridge_classifier = linear_model.RidgeClassifier(random_state = 1)

We train the classifier object: pass to it the train data and the train labels generated earlier.

ridge_classifier.fit(train_data, train_labels)
RidgeClassifier(random_state=1)

Now we check how well the classsifier has been trained.

We apply the test data to get predictions.

ridge_predictions = ridge_classifier.predict(test_data)

Now we output the test labels (response values) and the predicted labels, ones generated by the classifier.

print("Actual labels:\n", test_labels)
print("\nPredicted labels:\n", ridge_predictions)
Actual labels:
 [0 0 0 1 0 0 0 0 0 1 0 1 0 1 1 1 0 0 0 1 0 0 0 1 1 1 1 0 0 1]
Predicted labels:
 [0 0 0 1 0 1 0 0 0 1 0 1 0 1 1 1 1 0 1 1 1 0 0 1 1 1 1 0 0 1]

Rating model

We now rate the classification model quality.

round(metrics.accuracy_score(test_labels, ridge_predictions), 2)
0.87

We output the features weights and intercept for the following approximation function:

y(x1, x2) = w0 + w1 * x1 + w2 * x2
print("[w1, w2]:", ridge_classifier.coef_ )
[w1, w2]: [[-0.0854443 -0.07273219]]
print("[w0]:",ridge_classifier.intercept_ )
[w0]: [-0.31250723]

2.2 Logistic Regression

The Logistic Regression is another model that we’ll train with the labeled data.

Create a classifier with the default parameters. Some of them are shown below:

  • penalty = L2. Taken as a parameter of regularization.
  • tolerance = 0.0001
log_regressor = linear_model.LogisticRegression(random_state = 1)

We train the set with the method fit().

log_regressor.fit(train_data, train_labels)
LogisticRegression(random_state=1)

We get predictions of the trained model. Besides prediction labels, the logistic regression build the probability model. That means that it may return a probability with which each object belongs to a certain class. Method predict_proba() returns the probability array.

lr_predictions = log_regressor.predict(test_data)
lr_proba_predictions = log_regressor.predict_proba(test_data)
print("Actual labels:\n", test_labels)
print("\nPredicted labels:\n", lr_predictions)
Actual labels:
 [0 0 0 1 0 0 0 0 0 1 0 1 0 1 1 1 0 0 0 1 0 0 0 1 1 1 1 0 0 1]
Predicted labels:
 [0 1 1 1 0 1 0 0 0 1 0 1 0 1 1 1 1 0 1 1 1 0 0 1 1 1 1 0 0 1]

We output the probability predictions. For each object we get 2 probability values. These define probability if it pertains to label “0” or “1”.

print(lr_proba_predictions[:10])
[[9.99254667e-01 7.45333217e-04]
 [4.08279523e-01 5.91720477e-01]
 [4.90541791e-01 5.09458209e-01]
 [3.78296027e-03 9.96217040e-01]
 [7.32221196e-01 2.67778804e-01]
 [2.44262899e-01 7.55737101e-01]
 [9.93761113e-01 6.23888724e-03]
 [9.78405579e-01 2.15944205e-02]
 [9.55344987e-01 4.46550128e-02]
 [1.68318566e-01 8.31681434e-01]]

We now rate the Logistic regression and RidgeClassifier models; compare quality.

print("Logistic regression", round(metrics.accuracy_score(test_labels, lr_predictions), 3))
print("\nRidgeClassifier:", round(metrics.accuracy_score(test_labels, ridge_predictions), 3))
Logistic regression 0.8
RidgeClassifier: 0.867

3. Quality assessment of cross-validation

3.1 Evaluate a score by cross-validation

The library provides a useful function cross_val_score() to evaluate a score of method by cross-validation.

  • [trained] method: ridge_classifier
  • data: blobs[0]
  • labels: blobs[1]
  • metric, scoring parameter. ‘accuracy’ – 1/N*∑[a(xi)==yi]
  • cv denotes a cross-validation splitting strategy. cv is the K parameter in the KFold strategy.
ridge_scoring = model_selection.cross_val_score(ridge_classifier, blobs[0], blobs[1], scoring = "accuracy", cv = 10)
lr_scoring = model_selection.cross_val_score(log_regressor, blobs[0], blobs[1], scoring = "accuracy", cv = 10)
print("RidgeClassifier:\n", ridge_scoring)
print("\nLogistic regression scoring:\n", lr_scoring)
RidgeClassifier:
 [0.8 0.9 0.9 0.9 1.  1.  0.7 0.9 0.9 0.8]
Logistic regression scoring:
 [0.8 0.9 0.9 0.9 1.  1.  0.7 0.9 0.9 0.8]

Let’s view the scoring statistics:

print("Ridge mean:{}, max:{}, min:{}, std:{}"\
    .format(round(ridge_scoring.mean(), 3), ridge_scoring.max(), \
    ridge_scoring.min(), round( ridge_scoring.std(), 3)))
Ridge mean:0.88, max:1.0, min:0.7, std:0.087
print("Log mean:{}, max:{}, min:{}, std:{}"\
    .format(round(lr_scoring.mean(), 3), lr_scoring.max(), \
    lr_scoring.min(),  round( lr_scoring.std(), 3)))
Log mean:0.88, max:1.0, min:0.7, std:0.087

3.2 Evaluate a score by cross-validation with specified scorer and cv_strategy

Suppose we want to score a non-standart metric and we want to specify the cross-validation strategy.

We first define a scorer and specify cross-validation stratagy:

scorer = metrics.make_scorer(metrics.accuracy_score)
cv_strategy = model_selection.StratifiedShuffleSplit(n_splits=20, test_size = 0.3, random_state = 2)
cv_strategy.get_n_splits(blobs[1])
20
ridge_scoring = model_selection.cross_val_score(ridge_classifier, \
 blobs[0], blobs[1], scoring = scorer, cv = cv_strategy)
lr_scoring = model_selection.cross_val_score(log_regressor, \
 blobs[0], blobs[1], scoring = scorer, cv = cv_strategy)
print("Ridge mean:{}, max:{}, min:{}, std:{}"\
   .format(round(ridge_scoring.mean(), 3), ridge_scoring.max(), \
    ridge_scoring.min(), round( ridge_scoring.std(), 3)))
Ridge mean:0.87, max:1.0, min:0.7666666666666667, std:0.06
print("Log mean:{}, max:{}, min:{}, std:{}"\
    .format(round(lr_scoring.mean(), 3), lr_scoring.max(), \
    lr_scoring.min(),  round( lr_scoring.std(), 3)))
Log mean:0.87, max:1.0, min:0.7666666666666667, std:0.061

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.