Data Mining

Linear Regression application for data analysis and scientific computing

In this post we’ll share with you the vivid yet simple application of the Linear regression methods. We’ll be using the example of predicting a person’s height based on their weight. There you’ll see what kind of math is behind this. We will also introduce you to the basic Python libraries needed to work in the Data Analysis.


1. The linear algebra and data processing libraries references

Seaborn is not included in the Anaconda build by default, but the library provides convenient high-level functionality for data visualization. To install it run the conda install seaborn command in the terminal.

2. Initial Data Analysis

We’ll be using the SOCR data on the weights and heights of 25’000 teenagers.

2.1 Init imports and data reading

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

Load the data from weights_heights.csv file into a Pandas DataFrame, called data.

data = pd.read_csv('weights_heights.csv', index_col='Index')
Index Height Weight
1 65.78331 112.9925
2 71.51521 136.4873
3 69.39874 153.0269
4 68.21660 142.3354
5 67.78781 144.2971

2.2 Primary Data Analysis

Let’s build histograms of the distribution of the features. This makes us to understand the nature of a feature (its distribution in particular). The histogram lets us to find such data values that stand out from the main data values, “outliers“. It is convenient to plot histograms using the plot Pandas DataFrame method with the kind argument=’hist’.

data.plot(y='Height', kind='hist',     
color='g',  title='Height (inch.) distribution')
data.plot(y='Weight', kind='hist',       
color='b',  title='Weight (pounds) distribution')

2.3 Adding the Body Mass Index, BMI

We add a third attribute, Body Mass Index BMI. Now the data are of more columns.

So, if we hit data.head(5) we get the following result:
Index Height Weight BMI
1 65.78331 112.9925 18.357573
2 71.51521 136.4873 18.762577
3 69.39874 153.0269 22.338895
4 68.21660 142.3354 21.504526
5 67.78781 144.2971 22.077581

Now we display of pairwise dependencies of the features. Created 𝑚×𝑚 graphs (m is the number of features), where the diagonal cells depict histogram of the distribution of that feature and the rest of the cells show scatter plots based on two paired characteristics.
Let’s use the pairplot() method of the Seaborn library.

def make_bmi(height_inch, weight_pound):
    METER_TO_INCH, KILO_TO_POUND = 39.37, 2.20462
    return (weight_pound / KILO_TO_POUND) \       / (height_inch / METER_TO_INCH) ** 2

data['BMI'] = data.apply(lambda row: make_bmi(row['Height'],                                          row['Weight']), axis=1)

2.4 Adding a categorical feature

Let’s add a categorical feature, called weight_category. It takes only 3 values, [1,2,3], and is calculated as shown below.

def weight_category(weight):
    if weight < 120:    
	    return 1
    elif weight >= 150:   
	    return 3
    return 2

Box plot is a compact way to show statistics of a real feature (mean and quartiles) for different values of a categorical feature. It also helps to track “outliers“, the observations in which the value of this real feature is very different from others.

data['weight_cat'] = data['Weight'].apply(weight_category)
dataBoxplot = sns.boxplot(x='weight_cat', y='Height', data=data)
dataBoxplot.set(xlabel = 'Weight Category', ylabel = 'Height')

3. Mean squared error optimization

The problem of predicting the value of a real feature based on other features (the regression recovery problem) is solved by minimizing the quadratic error function.

3.1 Linear approximation function

We try to approximate dependence of the height on the weight by the following linear function: y(x) = w0 + w1*x

  • y – height
  • x – weight
  • w0 и w1 are the scalar parameters, subject to the optimization.

3.2 Mean squared error (MSE) of the linear approximation function

The following function calculates squared error of the linear approximation y(x):
Error(w0, w1) = 1/n * ∑(yi – (w0 + w1 * xi))2

  • n – number of observations in the data set.
  • yi and xi – height and weight of i-th item in the data set.

Calculating and optimizing MSE, we’ll find the optimal w0 and w1 for the approximation. Below is the Python error function that calculates MSE based on the n samples of the data set.

def error(w0, w1, n):
    sum = 0
    for v in data.sample(n).values:   
	    sum += (v[0] - (w0 + w1*v[1]))**2
    return np.round(sum/n)

Now we plot both:

  • (1) The cloud of points corresponding to the data set items as the space of the features height and weight.
  • (2) Lines y(x) with approximate w0, w1. We’ll see if they convey the dependence of height on weight correctly.
# original data distribution
data.plot(x="Weight", y='Height', kind='scatter',      
     color='b',  title='Height on Weight distribution and Linear regression approximation' )
plt.rcParams["font.size"] = "12" 
# 100 samples (values) from 80 till 200
x = np.linspace(80, 200, 100) 
plt.rcParams["figure.figsize"] = [8,5]
# Linear approximation parameters: sets of w0 and w1 
sets = ([60, 0.05], [50, 0.16])
for s in sets:
    y = s[0] + s[1]*x
    plt.plot(x, y, label="Linear regression with w0={0}, w1={1}".format(*s))

3.3 MSE graph

The MSE optimization is a relatively simple task, since the MSE is convex, namely y(x) ≈ x2. Later we leverage some optimization methods. Let’s see how the error function depends on one parameter only (the slope of the straight line, w1), if the second parameter (the free term, w0) is fixed.

x = np.linspace(-1, 1.1, 110) 
plt.rcParams["figure.figsize"] = [8,5]
plt.rcParams["font.size"] = "15" 
plt.plot(x, error(50, x, 10))
plt.title('MSE (y - w0 - w1*x)^2 with fixed w0=50')

3.4 MSE optimization with one fixed parameter

Now, we optimize to find an optimal line slope (w1), approximating height on weight dependency with fixed w0 = 50.

We use minimize_scalar method from scipy.optimize for MSE where w1 ∈ [-5,5].

def err_f(x):
    return error(50, x, 100)
from scipy.optimize import minimize_scalar
w1 = minimize_scalar(err_f, bounds = (-5, 5), method='bounded')
print('Min MSE with w1 =', w1.x)
Min MSE with w1 = 0.13747647556112128

Let’s plot the height on weight distribution along with the linear approximation using the optimized parameter w1.

data.plot(x="Weight", y='Height', kind='scatter',       color='b',  title='Height on Weight distribution' )
x = np.linspace(80, 200, 100) 
y = 50 + w1.x*x
plt.plot(x, y, label="Linear regression with w0=50, w1=0.14", color='r')

3.5 Plot MSE as 3-D grid

We now plot the MSE 3D-grids with parameters w0 and w1.

  • Axis X is «Intercept»
  • Axis Y is «Slope»
  • Axis Z is «MSE»

We use numpy.meshgrid that return coordinate matrices from coordinate vectors.

We create objects of type matplotlib.figure.Figure (figure) and matplotlib.axes._subplots.Axes3DSubplot (axis).

from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure() 
ax = fig.gca(projection='3d') # get current axis
w0 = np.arange(-100, 100, 1)
w1 = np.arange(-5, 5, 0.25)
X, Y = np.meshgrid(w0, w1)
Z = error(X, Y, 10)

# Method `plot_surface` of Axes3DSubplot 
surf = ax.plot_surface(X, Y, Z)

3.6 MSE optimization using the L-BFGS-B method

We use minimize of the scipy.optimize library to minimize MSE from point 3.2.

  • w0 ∈ [-100,100] and w1 ∈ [-5, 5].
  • initial guess point (w0, w1) = (0, 0).

We use the L-BFGS-B method and later we plot the height on weight distribution along with the Linear approximation for optimized w0 and w1.

import scipy.optimize as optimize
def error1(w):
    s = 0
    x = data['Weight']
    y = data['Height']
    for i in range(1, int(len(data.index)/100)):   
	    s = s + (y[i] - w[0] - w[1] * x[i]) ** 2
    return s
optimum = optimize.minimize(error1, np.array([0,0]), method = 'L-BFGS-B', bounds = ((-100, 100), (-5, 5)))
with the result:
fun: 667.5124003407069
hess_inv: <2x2 LbfgsInvHessProduct with dtype=float64>
jac: array([0.00010232, 0.02563638])
nfev: 54
nit: 7
njev: 18
status: 0
success: True
x: array([57.26076187, 0.08431683])
data.plot(x="Weight", y='Height', kind='scatter',   
        color='b',  title='The Height on Weight distribution and the Linear regression after `L-BFGS-B` optimization' )
x = np.linspace(80, 170, 100)  
y = optimum.x[0] + optimum.x[1]*x
plt.plot(x, y, label="Linear regression optimized by  `L-BFGS-B`" \    +"\n w0 = {0}\n w1 = {1}".format(np.round(optimum.x[0], 2), np.round(optimum.x[1], 2)), color="r")

4. Conclusion

In this post we’ve applied the Linear regression approximation to the given data set. Then we’ve used the MSE as the means for optimizing the linear regression approximation function. Using the linear algebra methods we could successfully find optimal parameters for the linear regression approximation function.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.