{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Linear models\n", "sklearn.liner_model - [documentation on sklearn.linear_model](http://scikit-learn.org/stable/modules/classes.html#module-sklearn.linear_model)\n", "\n", "**linear_model examples:**\n", "* RidgeClassifier\n", "* SGDClassifier\n", "* SGDRegressor\n", "* LinearRegression\n", "* LogisticRegression\n", "* Lasso\n", "\n", "[All the examples of sklearn.linear_model](http://scikit-learn.org/stable/modules/linear_model.html#linear-model)\n", "\n", "Module *metrics* is needed to evaluate the quality of aquired models." ] }, { "cell_type": "code", "execution_count": 66, "metadata": {}, "outputs": [], "source": [ "import warnings\n", "warnings.filterwarnings('ignore')\n", "\n", "from matplotlib.colors import ListedColormap\n", "from sklearn import model_selection, datasets, linear_model, metrics\n", "\n", "import numpy as np" ] }, { "cell_type": "code", "execution_count": 69, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Populating the interactive namespace from numpy and matplotlib\n" ] } ], "source": [ "%pylab inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1. Data generation\n", "We choose a dataset of 2 features, labels forming 2 clouds.\n", " - **centers** - number of classes\n", " - **cluster_std** - standart deviation" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "blobs = datasets.make_blobs(centers = 2, cluster_std = 5.5, random_state=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's visualize datasets in colors." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "colors = ListedColormap(['red', 'blue'])\n", "\n", "pylab.figure(figsize(10, 10))\n", "pylab.scatter([x[0] for x in blobs[0]], [x[1] for x in blobs[0]], c=blobs[1], cmap=colors)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we split the **blobs** dataset into the **train** and **test** sets." ] }, { "cell_type": "code", "execution_count": 70, "metadata": {}, "outputs": [], "source": [ "train_data, test_data, train_labels, test_labels = model_selection.train_test_split(blobs[0], blobs[1], \n", " test_size = 0.3,\n", " random_state = 1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2. Linear classification models" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 2.1 RidgeClassifier\n", "We'll create/build the linear classifier named **[RidgeClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeClassifier.html#sklearn.linear_model.RidgeClassifier)**." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "ridge_classifier = linear_model.RidgeClassifier(random_state = 1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We train the classifier object: pass to it the train data and the train labels generated ealier." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "RidgeClassifier(random_state=1)" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ridge_classifier.fit(train_data, train_labels)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we check how well the classsifier has been trained. \n", "\n", "We apply the test data to get predictions." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "ridge_predictions = ridge_classifier.predict(test_data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we output the test labels (response values) and the predicted labels, ones generated by the classifier." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Actual labels:\n", " [0 0 0 1 0 0 0 0 0 1 0 1 0 1 1 1 0 0 0 1 0 0 0 1 1 1 1 0 0 1]\n", "\n", "Predicted labels:\n", " [0 0 0 1 0 1 0 0 0 1 0 1 0 1 1 1 1 0 1 1 1 0 0 1 1 1 1 0 0 1]\n" ] } ], "source": [ "print('Actual labels:\\n', test_labels)\n", "print('\\nPredicted labels:\\n', ridge_predictions)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We now **rate the classification model quality**." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.87" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "round(metrics.accuracy_score(test_labels, ridge_predictions), 2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We output the featurs weights and intercept. $y(x1, x2) = w0 + w1 * x1 + w2 * x2$" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[w1, w2]: [[-0.0854443 -0.07273219]]\n" ] } ], "source": [ "print('[w1, w2]:', ridge_classifier.coef_ )" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[w0]: [-0.31250723]\n" ] } ], "source": [ "print('[w0]:',ridge_classifier.intercept_ )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 2.2 LogisticRegression\n", "The **[Logistic Regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression)** is another model that we'll train with the labeled data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Create a classifier with the default parameters. Some of them are shown below: \n", " - penalty = **L2**. Taken as a parameter of [regularization](/add-regularization-in-linear-regression-model/). \n", " - tolerance = 0.0001 " ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [], "source": [ "log_regressor = linear_model.LogisticRegression(random_state = 1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We train the set with the method *fit()*." ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "LogisticRegression(random_state=1)" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "log_regressor.fit(train_data, train_labels)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We get predictions of the trained model. \n", "Besides prediction labels, the logistic regression build the **probability model**. \n", "That means that it may return a probability with which each object belongs to a certain class.\n", "Method *predict_proba()* returns the probability array." ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [], "source": [ "lr_predictions = log_regressor.predict(test_data)\n", "lr_proba_predictions = log_regressor.predict_proba(test_data)" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Actual labels:\n", " [0 0 0 1 0 0 0 0 0 1 0 1 0 1 1 1 0 0 0 1 0 0 0 1 1 1 1 0 0 1]\n", "\n", "Predicted labels:\n", " [0 1 1 1 0 1 0 0 0 1 0 1 0 1 1 1 1 0 1 1 1 0 0 1 1 1 1 0 0 1]\n" ] } ], "source": [ "print('Actual labels:\\n', test_labels)\n", "print('\\nPredicted labels:\\n', lr_predictions)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We output the probability predictions.\n", "For each object we get 2 probability values. \n", "These define probability if it pertains to label \"0\" or \"1\"." ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[9.99254667e-01 7.45333217e-04]\n", " [4.08279523e-01 5.91720477e-01]\n", " [4.90541791e-01 5.09458209e-01]\n", " [3.78296027e-03 9.96217040e-01]\n", " [7.32221196e-01 2.67778804e-01]\n", " [2.44262899e-01 7.55737101e-01]\n", " [9.93761113e-01 6.23888724e-03]\n", " [9.78405579e-01 2.15944205e-02]\n", " [9.55344987e-01 4.46550128e-02]\n", " [1.68318566e-01 8.31681434e-01]]\n" ] } ], "source": [ "print(lr_proba_predictions[:10])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We now **rate the Logistic regression** and **RidgeClassifier models**; compare quality." ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Logistic regression 0.8\n", "\n", "RidgeClassifier: 0.867\n" ] } ], "source": [ "print('Logistic regression', round(metrics.accuracy_score(test_labels, lr_predictions), 3))\n", "print('\\nRidgeClassifier:', round(metrics.accuracy_score(test_labels, ridge_predictions), 3))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3. Quality assessment of cross-validation\n", "#### 3.1 Evaluate a score by cross-validation\n", "The library provides a useful function **[cross_val_score()](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html)** to evaluate a score of method by cross-validation.\n", " - [trained] method: **ridge_classifier**\n", " - data: blobs[0]\n", " - labels: blobs[1]\n", " - metric, **scoring** parameter. 'accuracy' - 1/N*∑[a(xi)==yi] \n", " - cv -- a cross-validation splitting strategy.Cv = **K** parameter in the [KFold](/cross-validation-strategies-application/#1.-KFold) strategy." ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [], "source": [ "ridge_scoring = model_selection.cross_val_score(ridge_classifier, blobs[0], blobs[1], scoring = 'accuracy', cv = 10)" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [], "source": [ "lr_scoring = model_selection.cross_val_score(log_regressor, blobs[0], blobs[1], scoring = 'accuracy', cv = 10)" ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "RidgeClassifier:\n", " [0.8 0.9 0.9 0.9 1. 1. 0.7 0.9 0.9 0.8]\n", "\n", "Logistic regression scoring:\n", " [0.8 0.9 0.9 0.9 1. 1. 0.7 0.9 0.9 0.8]\n" ] } ], "source": [ "print('RidgeClassifier:\\n', ridge_scoring)\n", "print('\\nLogistic regression scoring:\\n', lr_scoring)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's view the scoring statistics:" ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Ridge mean:0.88, max:1.0, min:0.7, std:0.087\n" ] } ], "source": [ "print('Ridge mean:{}, max:{}, min:{}, std:{}'.format(round(ridge_scoring.mean(), 3), ridge_scoring.max(), \n", " ridge_scoring.min(), round( ridge_scoring.std(), 3)))" ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Log mean:0.88, max:1.0, min:0.7, std:0.087\n" ] } ], "source": [ "print('Log mean:{}, max:{}, min:{}, std:{}'.format(round(lr_scoring.mean(), 3), lr_scoring.max(), \n", " lr_scoring.min(), round( lr_scoring.std(), 3)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 3.2 Evaluate a score by cross-validation with specified *scorer* и *cv_strategy*\n", "Suppose we want to **score a non-standart metric** and we want to **specify the cross-validation strategy**.\n", "\n", "We first define a *scorer* and specify cross-validation stratagy:" ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "20" ] }, "execution_count": 59, "metadata": {}, "output_type": "execute_result" } ], "source": [ "scorer = metrics.make_scorer(metrics.accuracy_score)\n", "\n", "cv_strategy = model_selection.StratifiedShuffleSplit(n_splits=20, test_size = 0.3, random_state = 2)\n", "cv_strategy.get_n_splits(blobs[1])" ] }, { "cell_type": "code", "execution_count": 60, "metadata": {}, "outputs": [], "source": [ "ridge_scoring = model_selection.cross_val_score(ridge_classifier, blobs[0], blobs[1], scoring = scorer, cv = cv_strategy)" ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [], "source": [ "lr_scoring = model_selection.cross_val_score(log_regressor, blobs[0], blobs[1], scoring = scorer, cv = cv_strategy)" ] }, { "cell_type": "code", "execution_count": 63, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Ridge mean:0.87, max:1.0, min:0.7666666666666667, std:0.06\n" ] } ], "source": [ "print('Ridge mean:{}, max:{}, min:{}, std:{}'.format(round(ridge_scoring.mean(), 3), ridge_scoring.max(), \n", " ridge_scoring.min(), round( ridge_scoring.std(), 3)))" ] }, { "cell_type": "code", "execution_count": 64, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Log mean:0.87, max:1.0, min:0.7666666666666667, std:0.061\n" ] } ], "source": [ "print('Log mean:{}, max:{}, min:{}, std:{}'.format(round(lr_scoring.mean(), 3), lr_scoring.max(), \n", " lr_scoring.min(), round( lr_scoring.std(), 3)))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 1 }