Linear Regression application for data analysis and scientific computing

In this post we’ll share with you the vivid yet simple application of the Linear regression methods. We’ll be using the example of predicting a person’s height based on their weight. There you’ll see what kind of math is behind this. We will also introduce you to the basic Python libraries needed to work in the Data Analysis.

The iPython notebook code

Linreg_height_weight-ENG Download

1. The linear algebra & data processing libraries references
2. Initial Data Analysis
3. Mean squared error (MSE) optimization
4. Conclusion

1. The linear algebra and data processing libraries references

Seaborn is not included in the Anaconda build by default, but the library provides convenient high-level functionality for data visualization. To install it run the conda install seaborn command in the terminal.

2. Initial Data Analysis

We’ll be using the SOCR data on the weights and heights of 25’000 teenagers.

2.1 Init imports and data reading

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

Load the data from weights_heights.csv file into a Pandas DataFrame, called data.

data = pd.read_csv('weights_heights.csv', index_col='Index')
data.head(5)

Index	Height	Weight
1	65.78331	112.9925
2	71.51521	136.4873
3	69.39874	153.0269
4	68.21660	142.3354
5	67.78781	144.2971

2.2 Primary Data Analysis

Let’s build histograms of the distribution of the features. This makes us to understand the nature of a feature (its distribution in particular). The histogram lets us to find such data values that stand out from the main data values, “outliers“. It is convenient to plot histograms using the plot Pandas DataFrame method with the kind argument=’hist’.

data.plot(y='Height', kind='hist',     
color='g',  title='Height (inch.) distribution')
data.plot(y='Weight', kind='hist',       
color='b',  title='Weight (pounds) distribution')

2.3 Adding the Body Mass Index, BMI

We add a third attribute, Body Mass Index BMI. Now the data are of more columns.

So, if we hit data.head(5) we get the following result:

Index	Height	Weight	BMI
1	65.78331	112.9925	18.357573
2	71.51521	136.4873	18.762577
3	69.39874	153.0269	22.338895
4	68.21660	142.3354	21.504526
5	67.78781	144.2971	22.077581

Now we display of pairwise dependencies of the features. Created 𝑚×𝑚 graphs (m is the number of features), where the diagonal cells depict histogram of the distribution of that feature and the rest of the cells show scatter plots based on two paired characteristics.
Let’s use the pairplot() method of the Seaborn library.

def make_bmi(height_inch, weight_pound):
    METER_TO_INCH, KILO_TO_POUND = 39.37, 2.20462
    return (weight_pound / KILO_TO_POUND) \       / (height_inch / METER_TO_INCH) ** 2

data['BMI'] = data.apply(lambda row: make_bmi(row['Height'],                                          row['Weight']), axis=1)
sns.set(font_scale=1.3)
sns.pairplot(data)

2.4 Adding a categorical feature

Let’s add a categorical feature, called weight_category. It takes only 3 values, [1,2,3], and is calculated as shown below.

def weight_category(weight):
    if weight < 120:    
	    return 1
    elif weight >= 150:   
	    return 3
    return 2

Box plot is a compact way to show statistics of a real feature (mean and quartiles) for different values of a categorical feature. It also helps to track “outliers“, the observations in which the value of this real feature is very different from others.

data['weight_cat'] = data['Weight'].apply(weight_category)
dataBoxplot = sns.boxplot(x='weight_cat', y='Height', data=data)
dataBoxplot.set(xlabel = 'Weight Category', ylabel = 'Height')
plt.show()

3. Mean squared error optimization

The problem of predicting the value of a real feature based on other features (the regression recovery problem) is solved by minimizing the quadratic error function.

3.1 Linear approximation function

We try to approximate dependence of the height on the weight by the following linear function: y(x) = w₀ + w₁*x

y – height
x – weight
w₀ и w₁ are the scalar parameters, subject to the optimization.

3.2 Mean squared error (MSE) of the linear approximation function

The following function calculates squared error of the linear approximation y(x):
Error(w₀, w₁) = 1/n * ∑(y_i – (w₀ + w₁ * x_i))²

n – number of observations in the data set.
y_i and x_i – height and weight of i-th item in the data set.

Calculating and optimizing MSE, we’ll find the optimal w₀ and w₁ for the approximation. Below is the Python error function that calculates MSE based on the n samples of the data set.

def error(w0, w1, n):
    sum = 0
    for v in data.sample(n).values:   
	    sum += (v[0] - (w0 + w1*v[1]))**2
    return np.round(sum/n)

Now we plot both:

(1) The cloud of points corresponding to the data set items as the space of the features height and weight.
(2) Lines y(x) with approximate w₀, w₁. We’ll see if they convey the dependence of height on weight correctly.

# original data distribution
data.plot(x="Weight", y='Height', kind='scatter',      
     color='b',  title='Height on Weight distribution and Linear regression approximation' )
plt.rcParams["font.size"] = "12" 
# 100 samples (values) from 80 till 200
x = np.linspace(80, 200, 100) 
plt.rcParams["figure.figsize"] = [8,5]
# Linear approximation parameters: sets of w0 and w1 
sets = ([60, 0.05], [50, 0.16])
for s in sets:
    y = s[0] + s[1]*x
    plt.plot(x, y, label="Linear regression with w0={0}, w1={1}".format(*s))
plt.legend()

3.3 MSE graph

The MSE optimization is a relatively simple task, since the MSE is convex, namely y(x) ≈ x². Later we leverage some optimization methods. Let’s see how the error function depends on one parameter only (the slope of the straight line, w₁), if the second parameter (the free term, w₀) is fixed.

x = np.linspace(-1, 1.1, 110) 
plt.rcParams["figure.figsize"] = [8,5]
plt.rcParams["font.size"] = "15" 
plt.plot(x, error(50, x, 10))
plt.title('MSE (y - w₀ - w₁*x)^2 with fixed w₀=50')
plt.ylabel('MSE(w₁)')
plt.xlabel('w₁')

3.4 MSE optimization with one fixed parameter

Now, we optimize to find an optimal line slope (w₁), approximating height on weight dependency with fixed w₀ = 50.

We use minimize_scalar method from scipy.optimize for MSE where w₁ ∈ [-5,5].

def err_f(x):
    return error(50, x, 100)
from scipy.optimize import minimize_scalar
w1 = minimize_scalar(err_f, bounds = (-5, 5), method='bounded')
print('Min MSE with w₁ =', w1.x)

Min MSE with w₁ = 0.13747647556112128

Let’s plot the height on weight distribution along with the linear approximation using the optimized parameter w₁.

data.plot(x="Weight", y='Height', kind='scatter',       color='b',  title='Height on Weight distribution' )
x = np.linspace(80, 200, 100) 
y = 50 + w1.x*x
plt.plot(x, y, label="Linear regression with w₀=50, w₁=0.14", color='r')
plt.legend()

3.5 Plot MSE as 3-D grid

We now plot the MSE 3D-grids with parameters w₀ and w₁.

Axis X is «Intercept»
Axis Y is «Slope»
Axis Z is «MSE»

We use numpy.meshgrid that return coordinate matrices from coordinate vectors.

We create objects of type matplotlib.figure.Figure (figure) and matplotlib.axes._subplots.Axes3DSubplot (axis).

from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure() 
fig.set_size_inches(20,10)
ax = fig.gca(projection='3d') # get current axis
w0 = np.arange(-100, 100, 1)
w1 = np.arange(-5, 5, 0.25)
X, Y = np.meshgrid(w0, w1)
Z = error(X, Y, 10)

# Method `plot_surface` of Axes3DSubplot 
surf = ax.plot_surface(X, Y, Z)
ax.set_xlabel('Intercept')
ax.set_ylabel('Slope')
ax.set_zlabel('MSE')
plt.show()

3.6 MSE optimization using the L-BFGS-B method

We use minimize of the scipy.optimize library to minimize MSE from point 3.2.

w₀ ∈ [-100,100] and w₁ ∈ [-5, 5].
initial guess point (w₀, w₁) = (0, 0).

We use the L-BFGS-B method and later we plot the height on weight distribution along with the Linear approximation for optimized w₀ and w₁.

import scipy.optimize as optimize
def error1(w):
    s = 0
    x = data['Weight']
    y = data['Height']
    for i in range(1, int(len(data.index)/100)):   
	    s = s + (y[i] - w[0] - w[1] * x[i]) ** 2
    return s
optimum = optimize.minimize(error1, np.array([0,0]), method = 'L-BFGS-B', bounds = ((-100, 100), (-5, 5)))
print(optimum)

with the result:

fun: 667.5124003407069
   

  hess_inv: <2x2 LbfgsInvHessProduct with dtype=float64> 
 

  jac: array([0.00010232, 0.02563638])
  

  message: b'CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH'
  

  nfev: 54 
  

  nit: 7

 
njev: 18
   

  status: 0
   

  success: True   
  

  x: array([57.26076187,  0.08431683])

data.plot(x="Weight", y='Height', kind='scatter',   
        color='b',  title='The Height on Weight distribution and the Linear regression after `L-BFGS-B` optimization' )
x = np.linspace(80, 170, 100)  
y = optimum.x[0] + optimum.x[1]*x
plt.plot(x, y, label="Linear regression optimized by  `L-BFGS-B`" \    +"\n w₀ = {0}\n w₁ = {1}".format(np.round(optimum.x[0], 2), np.round(optimum.x[1], 2)), color="r")
plt.legend()

4. Conclusion

In this post we’ve applied the Linear regression approximation to the given data set. Then we’ve used the MSE as the means for optimizing the linear regression approximation function. Using the linear algebra methods we could successfully find optimal parameters for the linear regression approximation function.