{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Sklearn" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## sklearn.linear_model" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "from matplotlib.colors import ListedColormap\n", "from sklearn import model_selection, datasets, linear_model, metrics\n", "\n", "import numpy as np" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Populating the interactive namespace from numpy and matplotlib\n" ] } ], "source": [ "%pylab inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Linear regression models\n", "\n", "#### Data generation\n", "\n", "We build a dataset with 2 features: one is informative and the other is redundant. Besides, adding **coef=True** we ask return us the approximation function coefficients into the **coef** array, not only data." ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [], "source": [ "data, target, coef = datasets.make_regression(n_features = 2, n_informative = 1, n_targets = 1, \n", " noise = 5., coef = True, random_state = 2)" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Coefficients: [38.07925837 0. ] intercept: -0.13052467965349365\n" ] } ], "source": [ "print('Coefficients: ', coef, 'intercept: ', linear_regressor.intercept_)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We plot/draw a dependence of features and a target label.\n", " - Red line: y=f(feature1)\n", " - Blue line: y=f(feature2)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "pylab.scatter(data[:,0], target, color = 'r')\n", "pylab.scatter(data[:,1], target, color = 'b')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Based on the above plot one can find out which of 2 features is informative." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Let's build a model and view its coefficients.\n", "We split data for train and test sets." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "train_data, test_data, train_labels, test_labels = model_selection.train_test_split(data, target, \n", " test_size = 0.3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### LinearRegression over the train data" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "linear_regressor = linear_model.LinearRegression() # classificator\n", "linear_regressor.fit(train_data, train_labels)\n", "predictions = linear_regressor.predict(test_data)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[ -76.75213382 34.35183007 -11.18242389 -61.47026695 44.66274342\n", " -13.26392817 18.17188553 25.7124082 -19.16792315 101.14760598\n", " -105.77758163 23.87701013 12.42286854 -18.57607726 21.20540389\n", " 38.36241814 -45.38589148 29.8208999 -84.32102748 0.34799656\n", " 11.96165156 39.70663436 41.95683853 -10.06708677 -63.4056294\n", " -16.30914909 -45.27502383 -57.46293828 28.15553021 -17.27897399]\n" ] } ], "source": [ "# original labels\n", "print(test_labels)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[ -68.31690488 38.87063362 -12.62644748 -55.97354269 50.53828408\n", " -15.92863777 18.48923956 28.00480498 -10.77148765 95.936183\n", " -100.85245811 31.5152017 6.88671402 -24.57940803 16.71365889\n", " 41.26798296 -43.31634025 31.48688292 -80.45697523 -1.51872908\n", " 13.95215963 37.61637807 43.5105848 -9.47060759 -59.39279793\n", " -11.74032551 -47.34651564 -53.9339432 22.64507627 -13.03856029]\n" ] } ], "source": [ "# predicted labels on the test objects\n", "print(predictions)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lets count MAE of those original dataset labels to the predicted labels." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "3.859435388011848" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "metrics.mean_absolute_error(test_labels, predictions)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We use **cross_val_score()** to evaluate our linear regressor, the regressor eveluation will be more precise. \n", "\n", "Scoring is MAE (neg_mean_absolute_error). \n", "\n", "We cross-validate the data using 10 folds." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "mean: -4.070071498779695, std: 1.0737104492890204\n" ] } ], "source": [ "linear_scoring = model_selection.cross_val_score(linear_regressor, data, target, scoring = 'neg_mean_absolute_error', \n", " cv = 10)\n", "print('mean: {}, std: {}'.format(linear_scoring.mean(), linear_scoring.std()))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We create our own scorer with parameter **greater_is_better**." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "scorer = metrics.make_scorer(metrics.mean_absolute_error, greater_is_better = True)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "mean: 4.070071498779695, std: 1.0737104492890204\n" ] } ], "source": [ "linear_scoring = model_selection.cross_val_score(linear_regressor, data, target, scoring=scorer, \n", " cv = 10)\n", "print('mean: {}, std: {}'.format(linear_scoring.mean(), linear_scoring.std()))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's take a look at the coefficients of the inbuilt dataset **make_regression()** function that we've got." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([38.07925837, 0. ])" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "coef" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The coefficients of the model built with **LinearRegression()**:" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(array([37.86162519, 0.33738658]), -0.13052467965349365)" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "linear_regressor.coef_, linear_regressor.intercept_" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "-0.13052467965349365" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# In the regression model there is also the intercept.\n", "linear_regressor.intercept_" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Original function equation\n", "y = -0.13 + 38.08*x1 + 0.00*x2\n" ] } ], "source": [ "print(\"Original function equation\\n\\\n", "y = {:.2f} + {:.2f}*x1 + {:.2f}*x2\".format(linear_regressor.intercept_, coef[0], coef[1]))" ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Trained function equation\n", "y = 37.86*x1 + 0.34*x2 \n" ] } ], "source": [ "print(\"Trained function equation\\n\\\n", "y = {:.2f}*x1 + {:.2f}*x2 \".\\\n", " format(linear_regressor.coef_[0], \n", " linear_regressor.coef_[1]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Lasso regularizator or L1 for the Linear Regression" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [], "source": [ "lasso_regressor = linear_model.Lasso(random_state = 3) # build model\n", "lasso_regressor.fit(train_data, train_labels) # train it with the train set\n", "lasso_predictions = lasso_regressor.predict(test_data) # get predictions using test set" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We eveluate the model quality by cross-validation. We'll use the same scorer." ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "mean: 4.154478246666398, std: 1.0170354384993354\n" ] } ], "source": [ "lasso_scoring = model_selection.cross_val_score(lasso_regressor, data, target, scoring = scorer, cv = 10)\n", "print('mean: {}, std: {}'.format(lasso_scoring.mean(), lasso_scoring.std()))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We see that std has decreased comparing to L2 regularizator: std: 1.0737104492890204" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[37.0580843 0. ]\n" ] } ], "source": [ "print(lasso_regressor.coef_)" ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Original function equation\n", "y = -0.13 + 38.08*x1 + 0.000*x2\n" ] } ], "source": [ "print(\"Original function equation\\n\\\n", "y = {:.2f} + {:.2f}*x1 + {:.3f}*x2\".format(linear_regressor.intercept_, coef[0], coef[1]))" ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Trained Lasso function equation\n", "y = 37.06*x1 + 0.0000*x2\n" ] } ], "source": [ "print(\"Trained Lasso function equation\\n\\\n", "y = {:.2f}*x1 + {:.4f}*x2\".format(lasso_regressor.coef_[0], lasso_regressor.coef_[1]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The advantage of L1 (Lasso) regularization is that we got the weight of 0.0000 value before non-informative/redundant feature." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 1 }