Linear regression - OLS Model

In this section, we will cover:

  • fitting an OLS linear regression model

  • score analysis: MSE and variance explained: \(R^2\)

  • residual analysis

[1]:
import pandas as pd
import numpy as np
import pickle
import time
[2]:
df = pd.read_csv('data/df_resample.csv')
df.head()
[2]:
symboling make fuel_type aspiration num_of_doors body_style drive_wheels engine_location wheel_base length ... engine_size fuel_system bore stroke compression_ratio horsepower peak_rpm city_mpg highway_mpg price
0 0 nissan gas std four sedan fwd front 97.2 173.4 ... 120 2bbl 3.33 3.47 8.5 97.0 5200.0 27 34 9549.0
1 3 volkswagen gas std two hatchback fwd front 94.5 165.7 ... 109 mpfi 3.19 3.40 8.5 90.0 5500.0 24 29 9980.0
2 1 bmw gas std four sedan rwd front 103.5 189.0 ... 164 mpfi 3.31 3.19 9.0 121.0 4250.0 20 25 24565.0
3 2 subaru gas std two hatchback fwd front 93.7 156.9 ... 97 2bbl 3.62 2.36 9.0 69.0 4900.0 31 36 5118.0
4 0 mazda diesel std four sedan rwd front 104.9 175.0 ... 134 idi 3.43 3.64 22.0 72.0 4200.0 31 39 18344.0

5 rows × 25 columns

[3]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 25 columns):
 #   Column             Non-Null Count  Dtype
---  ------             --------------  -----
 0   symboling          10000 non-null  int64
 1   make               10000 non-null  object
 2   fuel_type          10000 non-null  object
 3   aspiration         10000 non-null  object
 4   num_of_doors       10000 non-null  object
 5   body_style         10000 non-null  object
 6   drive_wheels       10000 non-null  object
 7   engine_location    10000 non-null  object
 8   wheel_base         10000 non-null  float64
 9   length             10000 non-null  float64
 10  width              10000 non-null  float64
 11  height             10000 non-null  float64
 12  curb_weight        10000 non-null  int64
 13  engine_type        10000 non-null  object
 14  num_of_cylinders   10000 non-null  object
 15  engine_size        10000 non-null  int64
 16  fuel_system        10000 non-null  object
 17  bore               10000 non-null  float64
 18  stroke             10000 non-null  float64
 19  compression_ratio  10000 non-null  float64
 20  horsepower         10000 non-null  float64
 21  peak_rpm           10000 non-null  float64
 22  city_mpg           10000 non-null  int64
 23  highway_mpg        10000 non-null  int64
 24  price              10000 non-null  float64
dtypes: float64(10), int64(5), object(10)
memory usage: 1.9+ MB

Train-test split

To avoid overfitting, we will split our data on train and test sets.

The model will be trained with the train dataset and later we will evaluate it with the test set.

[4]:
from sklearn.model_selection import train_test_split

X = df.copy()
X.drop('price', axis=1, inplace=True)
y = np.log(df.price) # as discussed, we are going to use the log transformation here

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=.3, random_state=95276
)

Feature Selection

It is very important to carefully chose which features will be included into the model, because:

  • including unnecessary features increases the standard error of the coefficients

  • excluding necessary features results in bias

Selection through correlation

As we want start trying out linear regression models, we will select features that have linear correlation with the dependent variable price.

[5]:
# categorical features
with open('data/category_list', 'rb') as file:
    cat_cols = pickle.load(file)

# numeric columns
num_cols = [col for col in X_train.columns if col not in cat_cols]
num_cols
[5]:
['wheel_base',
 'length',
 'width',
 'height',
 'curb_weight',
 'engine_size',
 'bore',
 'stroke',
 'compression_ratio',
 'horsepower',
 'peak_rpm',
 'city_mpg',
 'highway_mpg']
[6]:
X_train['price'] = y_train
cols = num_cols + ['price']
cor = X_train[cols].corr().abs()
cor.loc[cor.price > .5, 'price'].sort_values()
C:\Users\BJ571WQ\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
[6]:
bore           0.590618
wheel_base     0.637270
length         0.780432
city_mpg       0.785230
highway_mpg    0.788294
width          0.803110
horsepower     0.827188
engine_size    0.844736
curb_weight    0.895085
price          1.000000
Name: price, dtype: float64
[7]:
mask = (cor.price > .5 ) & (cor.price < 1)
correlated = cor[mask].sort_values(by='price').index
not_correlated = [col for col in num_cols if col not in correlated]

print('correlated: ', correlated)
print('not correlated: ', not_correlated)
correlated:  Index(['bore', 'wheel_base', 'length', 'city_mpg', 'highway_mpg', 'width',
       'horsepower', 'engine_size', 'curb_weight'],
      dtype='object')
not correlated:  ['height', 'stroke', 'compression_ratio', 'peak_rpm']

Fitting the model

Formula for OLS

[8]:
cat_list = [f'C({cat})' for cat in cat_cols]
cat_formula = ' + '.join(cat_list)
num_formula = ' + '.join(correlated)

formula = f'price ~ {cat_formula} + {num_formula}'
formula
[8]:
'price ~ C(make) + C(fuel_type) + C(aspiration) + C(num_of_doors) + C(body_style) + C(drive_wheels) + C(engine_location) + C(engine_type) + C(num_of_cylinders) + C(fuel_system) + C(symboling) + bore + wheel_base + length + city_mpg + highway_mpg + width + horsepower + engine_size + curb_weight'
[9]:
import statsmodels.formula.api as smf

model = smf.ols(formula=formula, data=X_train)
res = model.fit()
print(res.summary())
                            OLS Regression Results
==============================================================================
Dep. Variable:                  price   R-squared:                       0.969
Model:                            OLS   Adj. R-squared:                  0.969
Method:                 Least Squares   F-statistic:                     3976.
Date:                Tue, 08 Sep 2020   Prob (F-statistic):               0.00
Time:                        16:41:56   Log-Likelihood:                 7096.2
No. Observations:                7000   AIC:                        -1.408e+04
Df Residuals:                    6944   BIC:                        -1.370e+04
Df Model:                          55
Covariance Type:            nonrobust
=================================================================================================
                                    coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------------------
Intercept                         4.7194      0.071     66.565      0.000       4.580       4.858
C(make)[T.audi]                   0.3223      0.019     16.807      0.000       0.285       0.360
C(make)[T.bmw]                    0.3656      0.018     20.808      0.000       0.331       0.400
C(make)[T.chevrolet]             -0.0854      0.018     -4.718      0.000      -0.121      -0.050
C(make)[T.dodge]                 -0.1320      0.014     -9.546      0.000      -0.159      -0.105
C(make)[T.honda]                  0.1660      0.017      9.787      0.000       0.133       0.199
C(make)[T.isuzu]                 -0.4436      0.020    -22.238      0.000      -0.483      -0.405
C(make)[T.jaguar]                -0.1599      0.021     -7.591      0.000      -0.201      -0.119
C(make)[T.mazda]                 -0.0011      0.014     -0.084      0.933      -0.028       0.025
C(make)[T.mercedes-benz]         -0.0189      0.020     -0.942      0.346      -0.058       0.020
C(make)[T.mercury]               -0.0778      0.023     -3.320      0.001      -0.124      -0.032
C(make)[T.mitsubishi]            -0.1376      0.014     -9.711      0.000      -0.165      -0.110
C(make)[T.nissan]                 0.0138      0.013      1.062      0.288      -0.012       0.039
C(make)[T.peugot]                -0.2821      0.016    -18.150      0.000      -0.313      -0.252
C(make)[T.plymouth]              -0.1657      0.014    -12.015      0.000      -0.193      -0.139
C(make)[T.porsche]                0.3364      0.022     14.997      0.000       0.292       0.380
C(make)[T.saab]                   0.1741      0.016     10.705      0.000       0.142       0.206
C(make)[T.subaru]                -0.1974      0.013    -14.770      0.000      -0.224      -0.171
C(make)[T.toyota]                -0.1405      0.013    -11.237      0.000      -0.165      -0.116
C(make)[T.volkswagen]            -0.0154      0.014     -1.066      0.286      -0.044       0.013
C(make)[T.volvo]                 -0.0357      0.016     -2.168      0.030      -0.068      -0.003
C(fuel_type)[T.gas]               2.2407      0.035     63.510      0.000       2.172       2.310
C(aspiration)[T.turbo]            0.1349      0.006     21.058      0.000       0.122       0.147
C(num_of_doors)[T.two]           -0.0474      0.004    -10.734      0.000      -0.056      -0.039
C(body_style)[T.hardtop]         -0.1464      0.010    -14.358      0.000      -0.166      -0.126
C(body_style)[T.hatchback]       -0.1858      0.009    -20.640      0.000      -0.203      -0.168
C(body_style)[T.sedan]           -0.1872      0.010    -19.010      0.000      -0.207      -0.168
C(body_style)[T.wagon]           -0.2630      0.010    -25.327      0.000      -0.283      -0.243
C(drive_wheels)[T.rwd]            0.1927      0.007     29.121      0.000       0.180       0.206
C(engine_location)[T.rear]        0.2984      0.016     18.586      0.000       0.267       0.330
C(engine_type)[T.l]              -0.1144      0.014     -8.401      0.000      -0.141      -0.088
C(engine_type)[T.ohc]            -0.0648      0.009     -7.108      0.000      -0.083      -0.047
C(engine_type)[T.ohcf]            0.1011      0.008     12.244      0.000       0.085       0.117
C(engine_type)[T.ohcv]           -0.1056      0.011    -10.014      0.000      -0.126      -0.085
C(num_of_cylinders)[T.five]      -0.1452      0.024     -6.137      0.000      -0.192      -0.099
C(num_of_cylinders)[T.four]      -0.0754      0.029     -2.586      0.010      -0.133      -0.018
C(num_of_cylinders)[T.six]       -0.1888      0.022     -8.633      0.000      -0.232      -0.146
C(num_of_cylinders)[T.three]      0.1677      0.024      6.972      0.000       0.121       0.215
C(num_of_cylinders)[T.twelve]    -0.0545      0.030     -1.803      0.071      -0.114       0.005
C(fuel_system)[T.2bbl]            0.2087      0.012     17.444      0.000       0.185       0.232
C(fuel_system)[T.idi]             2.4787      0.037     67.266      0.000       2.406       2.551
C(fuel_system)[T.mfi]             0.2890      0.022     13.203      0.000       0.246       0.332
C(fuel_system)[T.mpfi]            0.2880      0.012     23.062      0.000       0.263       0.312
C(fuel_system)[T.spdi]            0.2471      0.015     16.480      0.000       0.218       0.276
C(fuel_system)[T.spfi]            0.5898      0.025     23.329      0.000       0.540       0.639
C(symboling)[T.-1]                0.1008      0.011      9.326      0.000       0.080       0.122
C(symboling)[T.0]                 0.1112      0.012      9.220      0.000       0.088       0.135
C(symboling)[T.1]                 0.0327      0.013      2.581      0.010       0.008       0.058
C(symboling)[T.2]              2.929e-05      0.013      0.002      0.998      -0.026       0.026
C(symboling)[T.3]                 0.0828      0.014      5.826      0.000       0.055       0.111
bore                             -0.1834      0.015    -12.535      0.000      -0.212      -0.155
wheel_base                        0.0084      0.001     11.295      0.000       0.007       0.010
length                           -0.0045      0.000    -10.618      0.000      -0.005      -0.004
city_mpg                         -0.0178      0.001    -15.892      0.000      -0.020      -0.016
highway_mpg                       0.0131      0.001     13.579      0.000       0.011       0.015
width                             0.0231      0.002     11.809      0.000       0.019       0.027
horsepower                       -0.0002      0.000     -0.887      0.375      -0.001       0.000
engine_size                       0.0022      0.000     10.676      0.000       0.002       0.003
curb_weight                       0.0005   1.25e-05     37.256      0.000       0.000       0.000
==============================================================================
Omnibus:                        5.085   Durbin-Watson:                   1.999
Prob(Omnibus):                  0.079   Jarque-Bera (JB):                5.197
Skew:                          -0.044   Prob(JB):                       0.0744
Kurtosis:                       3.100   Cond. No.                     1.02e+16
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 4.65e-22. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.

More feature selection: p-values

When we fit a linear regression model, we have to test the hypothesis that the evaluated coefficient for a given regressor is actually zero, meaning we don’t need that regressor at all.

In other words, we have:

  • \(H_0\): the coefficient is zero

  • \(H_1\): the coefficient is not zero

Therefore, let’s look into the p-values for each coefficient for each regressor to understand if they should be zero:

if a p-value of a given coefficient is larger than 5%, then we fail to reject the null hypothesis that the coefficient actually is zero

[10]:
res.pvalues[res.pvalues > .05]
# we are not taking out make just because a few rotten apples
[10]:
C(make)[T.mazda]                 0.932911
C(make)[T.mercedes-benz]         0.346465
C(make)[T.nissan]                0.288217
C(make)[T.volkswagen]            0.286270
C(num_of_cylinders)[T.twelve]    0.071427
C(symboling)[T.2]                0.998247
horsepower                       0.375116
dtype: float64

As we can see, the p-values for some makers are larger than 5%. We are not removing this feature.

On the other hand, hypothesis testing for variable horsepower coefficient shows it could be zero.

Therefore, lets remove this feature and fit the model again:

[11]:
correlated = correlated.tolist()
correlated.remove('horsepower')
num_formula = ' + '.join(correlated)

formula = f'price ~ {cat_formula} + {num_formula}'

start = time.time()
model = smf.ols(formula=formula, data=X_train)
res = model.fit()
stop = time.time()
elapsed = stop - start

print(res.summary())
                            OLS Regression Results
==============================================================================
Dep. Variable:                  price   R-squared:                       0.969
Model:                            OLS   Adj. R-squared:                  0.969
Method:                 Least Squares   F-statistic:                     4050.
Date:                Tue, 08 Sep 2020   Prob (F-statistic):               0.00
Time:                        16:41:57   Log-Likelihood:                 7095.8
No. Observations:                7000   AIC:                        -1.408e+04
Df Residuals:                    6945   BIC:                        -1.370e+04
Df Model:                          54
Covariance Type:            nonrobust
=================================================================================================
                                    coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------------------
Intercept                         4.7214      0.071     66.632      0.000       4.583       4.860
C(make)[T.audi]                   0.3190      0.019     16.970      0.000       0.282       0.356
C(make)[T.bmw]                    0.3628      0.017     20.991      0.000       0.329       0.397
C(make)[T.chevrolet]             -0.0893      0.018     -5.095      0.000      -0.124      -0.055
C(make)[T.dodge]                 -0.1344      0.014     -9.895      0.000      -0.161      -0.108
C(make)[T.honda]                  0.1620      0.016      9.903      0.000       0.130       0.194
C(make)[T.isuzu]                 -0.4441      0.020    -22.269      0.000      -0.483      -0.405
C(make)[T.jaguar]                -0.1556      0.021     -7.591      0.000      -0.196      -0.115
C(make)[T.mazda]                 -0.0030      0.013     -0.227      0.821      -0.029       0.023
C(make)[T.mercedes-benz]         -0.0196      0.020     -0.980      0.327      -0.059       0.020
C(make)[T.mercury]               -0.0852      0.022     -3.889      0.000      -0.128      -0.042
C(make)[T.mitsubishi]            -0.1401      0.014    -10.097      0.000      -0.167      -0.113
C(make)[T.nissan]                 0.0110      0.013      0.869      0.385      -0.014       0.036
C(make)[T.peugot]                -0.2786      0.015    -18.557      0.000      -0.308      -0.249
C(make)[T.plymouth]              -0.1678      0.014    -12.348      0.000      -0.194      -0.141
C(make)[T.porsche]                0.3321      0.022     15.155      0.000       0.289       0.375
C(make)[T.saab]                   0.1705      0.016     10.833      0.000       0.140       0.201
C(make)[T.subaru]                -0.1956      0.013    -14.798      0.000      -0.222      -0.170
C(make)[T.toyota]                -0.1422      0.012    -11.502      0.000      -0.166      -0.118
C(make)[T.volkswagen]            -0.0176      0.014     -1.236      0.217      -0.046       0.010
C(make)[T.volvo]                 -0.0372      0.016     -2.277      0.023      -0.069      -0.005
C(fuel_type)[T.gas]               2.2405      0.035     63.507      0.000       2.171       2.310
C(aspiration)[T.turbo]            0.1312      0.005     26.932      0.000       0.122       0.141
C(num_of_doors)[T.two]           -0.0475      0.004    -10.755      0.000      -0.056      -0.039
C(body_style)[T.hardtop]         -0.1471      0.010    -14.469      0.000      -0.167      -0.127
C(body_style)[T.hatchback]       -0.1869      0.009    -20.948      0.000      -0.204      -0.169
C(body_style)[T.sedan]           -0.1883      0.010    -19.255      0.000      -0.207      -0.169
C(body_style)[T.wagon]           -0.2639      0.010    -25.553      0.000      -0.284      -0.244
C(drive_wheels)[T.rwd]            0.1912      0.006     29.961      0.000       0.179       0.204
C(engine_location)[T.rear]        0.2957      0.016     18.758      0.000       0.265       0.327
C(engine_type)[T.l]              -0.1142      0.014     -8.387      0.000      -0.141      -0.088
C(engine_type)[T.ohc]            -0.0610      0.008     -7.567      0.000      -0.077      -0.045
C(engine_type)[T.ohcf]            0.1001      0.008     12.235      0.000       0.084       0.116
C(engine_type)[T.ohcv]           -0.1038      0.010    -10.035      0.000      -0.124      -0.084
C(num_of_cylinders)[T.five]      -0.1495      0.023     -6.460      0.000      -0.195      -0.104
C(num_of_cylinders)[T.four]      -0.0816      0.028     -2.880      0.004      -0.137      -0.026
C(num_of_cylinders)[T.six]       -0.1940      0.021     -9.207      0.000      -0.235      -0.153
C(num_of_cylinders)[T.three]      0.1644      0.024      6.919      0.000       0.118       0.211
C(num_of_cylinders)[T.twelve]    -0.0667      0.027     -2.471      0.013      -0.120      -0.014
C(fuel_system)[T.2bbl]            0.2077      0.012     17.438      0.000       0.184       0.231
C(fuel_system)[T.idi]             2.4809      0.037     67.489      0.000       2.409       2.553
C(fuel_system)[T.mfi]             0.2865      0.022     13.199      0.000       0.244       0.329
C(fuel_system)[T.mpfi]            0.2854      0.012     23.486      0.000       0.262       0.309
C(fuel_system)[T.spdi]            0.2449      0.015     16.551      0.000       0.216       0.274
C(fuel_system)[T.spfi]            0.5884      0.025     23.320      0.000       0.539       0.638
C(symboling)[T.-1]                0.1024      0.011      9.598      0.000       0.081       0.123
C(symboling)[T.0]                 0.1131      0.012      9.544      0.000       0.090       0.136
C(symboling)[T.1]                 0.0352      0.012      2.839      0.005       0.011       0.059
C(symboling)[T.2]                 0.0023      0.013      0.179      0.858      -0.023       0.028
C(symboling)[T.3]                 0.0857      0.014      6.180      0.000       0.058       0.113
bore                             -0.1812      0.014    -12.565      0.000      -0.210      -0.153
wheel_base                        0.0085      0.001     11.698      0.000       0.007       0.010
length                           -0.0045      0.000    -10.611      0.000      -0.005      -0.004
city_mpg                         -0.0176      0.001    -16.130      0.000      -0.020      -0.015
highway_mpg                       0.0130      0.001     13.570      0.000       0.011       0.015
width                             0.0228      0.002     11.846      0.000       0.019       0.027
engine_size                       0.0021      0.000     11.974      0.000       0.002       0.002
curb_weight                       0.0005   1.25e-05     37.323      0.000       0.000       0.000
==============================================================================
Omnibus:                        4.426   Durbin-Watson:                   1.999
Prob(Omnibus):                  0.109   Jarque-Bera (JB):                4.512
Skew:                          -0.040   Prob(JB):                        0.105
Kurtosis:                       3.095   Cond. No.                     1.02e+16
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 4.64e-22. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
[12]:
res.pvalues[res.pvalues > .05]
[12]:
C(make)[T.mazda]            0.820602
C(make)[T.mercedes-benz]    0.327119
C(make)[T.nissan]           0.385010
C(make)[T.volkswagen]       0.216545
C(symboling)[T.2]           0.857865
dtype: float64

Now we are done with feature selection.

Model Scoring & Evaluation

[13]:
r2 = res.rsquared_adj

predictions = np.exp(res.predict(X_test)) # back transformation
mse = np.mean(
    (predictions - np.exp(y_test))**2
)

print(f'R^2: {r2:.4}')
print(f'MSE: {mse:.4}')
R^2: 0.969
MSE: 2.175e+06
[14]:
data = {
    'model name': ['ols'],
    'r2': [r2],
    'MSE': [mse],
    'time': elapsed
       }

score_df = pd.DataFrame(data)
score_df.to_csv('data/scores.csv', index=False)
score_df
[14]:
model name r2 MSE time
0 ols 0.968982 2.175057e+06 0.363137

Residual analysis

In a good model, we should expect that the residuals:

  • have mean zero

  • to be homoscedastic

  • to be normal

  • to have no correlation with the fitted values

  • to have no correlation with any of the features

Residual mean

[15]:
res.resid.mean() # good!
[15]:
-2.431832513138943e-15

Residuals vs fitted values

[16]:
from numpy.polynomial.polynomial import polyfit
import matplotlib.pyplot as plt

%matplotlib inline

plt.figure(figsize=(12, 5))
plt.title('fitted price vs resids')
plt.scatter(res.fittedvalues, res.resid, c="g", alpha=0.5, label="OLS model")
b, m = polyfit(res.fittedvalues, res.resid, 1)
plt.plot(res.fittedvalues, b + m * res.fittedvalues, '-')
plt.xlabel("")
plt.ylabel("")
plt.legend(loc=2)
[16]:
<matplotlib.legend.Legend at 0x27afa2eeac8>
_images/03-ols_26_1.png

Correlation between residuals and \(\hat y\)

In a good model, residuals cannot be correlated to the fitted values.

[17]:
from scipy.stats.stats import pearsonr
pearsonr(res.predict(X_train), res.resid)
[17]:
(8.134968393358477e-13, 0.9999999999497379)

Test for normality

[18]:
import scipy.stats as st
st.shapiro(res.resid)
C:\Users\BJ571WQ\Anaconda3\lib\site-packages\scipy\stats\morestats.py:1676: UserWarning: p-value may not be accurate for N > 5000.
  warnings.warn("p-value may not be accurate for N > 5000.")
[18]:
(0.9971299767494202, 2.398374832068839e-10)
[19]:
from statsmodels.graphics.gofplots import qqplot

qqplot(res.resid, line='s');
_images/03-ols_31_0.png