In this post we’ll share with you the vivid yet simple application of the Linear regression methods. We’ll be using the example of **predicting a person’s height based on their weight**. There you’ll see what kind of math is behind this. We will also introduce you to the basic Python libraries needed to work in the Data Analysis.

#### The iPython notebook code

### Contents

## 1. The linear algebra and data processing libraries references

*Seaborn* is not included in the Anaconda build by default, but the library provides convenient high-level functionality for data visualization. To install it run the **conda install seaborn** command in the terminal.

## 2. Initial Data Analysis

We’ll be using the SOCR data on the weights and heights of 25’000 teenagers.

import numpy as np import pandas as pd import seaborn as sns import matplotlib.pyplot as plt %matplotlib inline

Load the data from *weights_heights.csv* file into a Pandas DataFrame, called **data**.

data = pd.read_csv('weights_heights.csv', index_col='Index') data.head(5)

Index | Height | Weight |
---|---|---|

1 | 65.78331 | 112.9925 |

2 | 71.51521 | 136.4873 |

3 | 69.39874 | 153.0269 |

4 | 68.21660 | 142.3354 |

5 | 67.78781 | 144.2971 |

### 2.2 Primary Data Analysis

Let’s build histograms of the distribution of the features. This makes us to understand the nature of a feature (its distribution in particular). The histogram lets us to find such data values that stand out from the main data values, “**outliers**“. It is convenient to plot histograms using the *plot* Pandas DataFrame method with the kind argument=’hist’.

data.plot(y='Height', kind='hist', color='g', title='Height (inch.) distribution') data.plot(y='Weight', kind='hist', color='b', title='Weight (pounds) distribution')

### 2.3 Adding the *Body Mass Index*, BMI

We add a third attribute, *Body Mass Index* BMI. Now the data are of more columns.

`data.head(5)`

we get the following result:
Index | Height | Weight | BMI |
---|---|---|---|

1 | 65.78331 | 112.9925 | 18.357573 |

2 | 71.51521 | 136.4873 | 18.762577 |

3 | 69.39874 | 153.0269 | 22.338895 |

4 | 68.21660 | 142.3354 | 21.504526 |

5 | 67.78781 | 144.2971 | 22.077581 |

Now we display of pairwise dependencies of the features. Created **𝑚×𝑚** graphs (**m** is the number of features), where the diagonal cells depict histogram of the distribution of that feature and the rest of the cells show scatter plots based on two paired characteristics.

Let’s use the *pairplot()* method of the **Seaborn** library.

def make_bmi(height_inch, weight_pound): METER_TO_INCH, KILO_TO_POUND = 39.37, 2.20462 return (weight_pound / KILO_TO_POUND) \ / (height_inch / METER_TO_INCH) ** 2 data['BMI'] = data.apply(lambda row: make_bmi(row['Height'], row['Weight']), axis=1) sns.set(font_scale=1.3) sns.pairplot(data)

### 2.4 Adding a categorical feature

Let’s add a categorical feature, called *weight_category*. It takes only 3 values, `[1,2,3]`

, and is calculated as shown below.

def weight_category(weight): if weight < 120: return 1 elif weight >= 150: return 3 return 2

Box plot is a compact way to show statistics of a real feature (mean and quartiles) for different values of a categorical feature. It also helps to track “**outliers**“, the observations in which the value of this real feature is very different from others.

data['weight_cat'] = data['Weight'].apply(weight_category) dataBoxplot = sns.boxplot(x='weight_cat', y='Height', data=data) dataBoxplot.set(xlabel = 'Weight Category', ylabel = 'Height') plt.show()

The problem of predicting the value of a real feature based on other features (the regression recovery problem) is solved by minimizing the quadratic error function.

### 3.1 Linear approximation function

We try to approximate dependence of the *height* on the *weight* by the following linear function: **y(x) = w _{0} + w_{1}*x**

**y**– height**x**– weight**w**и_{0}**w**are the scalar parameters, subject to the optimization._{1}

### 3.2 Mean squared error (MSE) of the linear approximation function

The following function calculates squared error of the linear approximation **y(x)**:

**Error(w _{0}, w_{1}) = 1/n * ∑(y_{i} – (w_{0} + w_{1} * x_{i}))^{2} **

**n**– number of observations in the data set.**y**and_{i}**x**– height and weight of_{i}**i**-th item in the data set.

Calculating and optimizing MSE, we’ll find the optimal **w _{0}** and

**w**for the approximation. Below is the Python

_{1}**error**function that calculates MSE based on the

**n**samples of the data set.

def error(w0, w1, n): sum = 0 for v in data.sample(n).values: sum += (v[0] - (w0 + w1*v[1]))**2 return np.round(sum/n)

Now we plot both:

- (1) The cloud of points corresponding to the data set items as the space of the features
*height*and*weight*. - (2) Lines
**y(x)**with approximate**w**. We’ll see if they convey the dependence of_{0}, w_{1}*height*on*weight*correctly.

# original data distribution data.plot(x="Weight", y='Height', kind='scatter', color='b', title='Height on Weight distribution and Linear regression approximation' ) plt.rcParams["font.size"] = "12" # 100 samples (values) from 80 till 200 x = np.linspace(80, 200, 100) plt.rcParams["figure.figsize"] = [8,5] # Linear approximation parameters: sets of w0 and w1 sets = ([60, 0.05], [50, 0.16]) for s in sets: y = s[0] + s[1]*x plt.plot(x, y, label="Linear regression with w0={0}, w1={1}".format(*s)) plt.legend()

### 3.3 MSE graph

The MSE optimization is a relatively simple task, since the MSE is convex, namely **y(x) ≈ x ^{2}**. Later we leverage some optimization methods. Let’s see how the error function depends on

*one parameter only*(the slope of the straight line,

**w**), if the second parameter (the free term,

_{1}**w**) is fixed.

_{0}x = np.linspace(-1, 1.1, 110) plt.rcParams["figure.figsize"] = [8,5] plt.rcParams["font.size"] = "15" plt.plot(x, error(50, x, 10)) plt.title('MSE(y - wwith fixed_{0}- w_{1}*x)^2w') plt.ylabel('_{0}=50MSE(w') plt.xlabel('_{1})w')_{1}

### 3.4 MSE optimization with one fixed parameter

Now, we optimize to find an optimal line slope (**w _{1}**), approximating

*height*on

*weight*dependency with fixed

**w**.

_{0}= 50We use *minimize_scalar* method from *scipy.optimize* for MSE where **w _{1}** ∈ [-5,5].

def err_f(x): return error(50, x, 100) from scipy.optimize import minimize_scalar w1 = minimize_scalar(err_f, bounds = (-5, 5), method='bounded') print('Min MSE with w_{1}=', w1.x)

`Min MSE with w`_{1} = 0.13747647556112128

Let’s plot the *height on weight distribution* along with the linear approximation using the optimized parameter **w _{1}**.

data.plot(x="Weight", y='Height', kind='scatter', color='b', title='Height on Weight distribution' ) x = np.linspace(80, 200, 100) y = 50 + w1.x*x plt.plot(x, y, label="Linear regression withw=50,_{0}w=0.14", color='r') plt.legend()_{1}

### 3.5 Plot MSE as 3-D grid

We now plot the MSE 3D-grids with parameters **w _{0}** and

**w**.

_{1}- Axis
**X**is «Intercept» - Axis
**Y**is «Slope» - Axis
**Z**is «MSE»

We use *numpy.meshgrid* that return coordinate matrices from coordinate vectors.

We create objects of type *matplotlib.figure.Figure* (figure) and *matplotlib.axes._subplots.Axes3DSubplot* (axis).

from mpl_toolkits.mplot3d import Axes3D fig = plt.figure() fig.set_size_inches(20,10) ax = fig.gca(projection='3d') # get current axis w0 = np.arange(-100, 100, 1) w1 = np.arange(-5, 5, 0.25) X, Y = np.meshgrid(w0, w1) Z = error(X, Y, 10) # Method `plot_surface` of Axes3DSubplot surf = ax.plot_surface(X, Y, Z) ax.set_xlabel('Intercept') ax.set_ylabel('Slope') ax.set_zlabel('MSE') plt.show()

### 3.6 MSE optimization using the *L-BFGS-B* method

We use *minimize* of the *scipy.optimize* library to minimize MSE from point **3.2**.

**w**∈ [-100,100] and_{0}**w**∈ [-5, 5]._{1}- initial guess point (
**w**,_{0}**w**) = (0, 0)._{1}

We use the *L-BFGS-B* method and later we plot the *height on weight distribution* along with the Linear approximation for optimized **w _{0}** and

**w**.

_{1}import scipy.optimize as optimize def error1(w): s = 0 x = data['Weight'] y = data['Height'] for i in range(1, int(len(data.index)/100)): s = s + (y[i] - w[0] - w[1] * x[i]) ** 2 return s optimum = optimize.minimize(error1, np.array([0,0]), method = 'L-BFGS-B', bounds = ((-100, 100), (-5, 5))) print(optimum)

```
fun: 667.5124003407069
```

hess_inv: <2x2 LbfgsInvHessProduct with dtype=float64>

jac: array([0.00010232, 0.02563638])

message: b'CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH'

nfev: 54

nit: 7

njev: 18

status: 0

success: True

x: array([57.26076187, 0.08431683])

data.plot(x="Weight", y='Height', kind='scatter', color='b', title='The Height on Weight distribution and the Linear regression after `L-BFGS-B` optimization' ) x = np.linspace(80, 170, 100) y = optimum.x[0] + optimum.x[1]*x plt.plot(x, y, label="Linear regression optimized by `L-BFGS-B`" \ +"\n w_{0}= {0}\n w_{1}= {1}".format(np.round(optimum.x[0], 2), np.round(optimum.x[1], 2)), color="r") plt.legend()

## 4. Conclusion

In this post we’ve applied the Linear regression approximation to the given data set. Then we’ve used the MSE as the means for optimizing the linear regression approximation function. Using the linear algebra methods we could successfully find optimal parameters for the linear regression approximation function.