Linear regression - sklearn

In this section, we will cover:

  • fitting different linear regression models with sklearn

  • normalization and encoding of features

  • score analysis: MSE and variance explained: \(R^2\)

  • residual analysis

  • most important features

  • comparing the models: conclusions

Feature selection

This time we will not be manually selecting features. Some sklearn linear regressor models uses regularization, and that should address the lack of feature selection.

This has an important consequence, however: we will be using all features, even the ones which are not linearly correlated with the outcome. This means we will be trying to fit a linear model on a case that may be not strictly linear.

[1]:
import pandas as pd
import numpy as np

df = pd.read_csv('data/df_resample.csv')
df.head()
[1]:
symboling make fuel_type aspiration num_of_doors body_style drive_wheels engine_location wheel_base length ... engine_size fuel_system bore stroke compression_ratio horsepower peak_rpm city_mpg highway_mpg price
0 0 nissan gas std four sedan fwd front 97.2 173.4 ... 120 2bbl 3.33 3.47 8.5 97.0 5200.0 27 34 9549.0
1 3 volkswagen gas std two hatchback fwd front 94.5 165.7 ... 109 mpfi 3.19 3.40 8.5 90.0 5500.0 24 29 9980.0
2 1 bmw gas std four sedan rwd front 103.5 189.0 ... 164 mpfi 3.31 3.19 9.0 121.0 4250.0 20 25 24565.0
3 2 subaru gas std two hatchback fwd front 93.7 156.9 ... 97 2bbl 3.62 2.36 9.0 69.0 4900.0 31 36 5118.0
4 0 mazda diesel std four sedan rwd front 104.9 175.0 ... 134 idi 3.43 3.64 22.0 72.0 4200.0 31 39 18344.0

5 rows × 25 columns

[2]:
X = df.copy()
X.drop('price', axis=1, inplace=True)
y = np.log(df.price) # as discussed, we are going to use the log transform here

## Train-test split#
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=.3, random_state=95276
)

Normalization and encoding of features

The range of the numeric features varies a lot for each one of them:

  • bore range: ~2

  • curb range: ~3500

Also, its easy to see that they are measuring completely different things and therefore are in different scales.

In order to properly fit linear regression models with regularization, we must account for those differences.

Also, it is import to encode the categories, like fuel-type, so sklearn can use them.

[3]:
from sklearn.preprocessing import OneHotEncoder, StandardScaler
import pickle

with open('data/category_list', 'rb') as file:
    cat_cols = pickle.load(file)

# numeric columns
num_cols = [col for col in X_train.columns if col not in cat_cols]

# normalize numeric features
scaler = StandardScaler()
num_scaled = scaler.fit_transform(X_train[num_cols])

# encode categories
encoder = OneHotEncoder(sparse=False)
cat_encoded = encoder.fit_transform(X_train[cat_cols])

# all together
X_train_proc = np.concatenate([cat_encoded, num_scaled] ,axis=1)
X_train_proc.shape
[3]:
(7000, 73)
[4]:
# apply transformations on test set
num_scaled = scaler.transform(X_test[num_cols])

# encode categories
cat_encoded = encoder.transform(X_test[cat_cols])

# all together
X_test_proc = np.concatenate([cat_encoded, num_scaled] ,axis=1)
X_test_proc.shape
[4]:
(3000, 73)

Hyper-parameter tuning and Cross Validation

It is important to note that we are going to use the gridsearchCV method, so we can iterate over a series of hyper-parameters for each model in order to find the best combination of them through cross validation.

Linear regression model

Lets start trying a simple sklearn linear regression model, without regularization.

[5]:
%load_ext autoreload
%autoreload 2

import aux_functions as aux
[6]:
from sklearn.linear_model import LinearRegression

model = LinearRegression()
grid_params = {'fit_intercept': [True, False]}
name = 'Linear Regression'
data = (X_train_proc, y_train, X_test_proc, y_test)

linear_results = aux.make_regressor(name, model, grid_params, data)
Linear Regression
Score r2: 0.9707
Score MSE: 2.122e+06
Time: 1.5e+01s
{'copy_X': True, 'fit_intercept': False, 'n_jobs': None, 'normalize': False}

Regularized models - Ridge regularization

Ridge regression uses a penalty L2 factor for the least important regressors.

  • L2: least squared deviation \(resid = \sum{(y_i - \hat y_i)^2} + \lambda \sum{\beta^2}\)

However, the least squares regularization is not robust: it is sensitive to outliers.

[7]:
from sklearn.linear_model import Ridge

model = Ridge()
grid_params = {'alpha': [.001, .01, .1, 1]}
name = 'Ridge Regression'
ridge_results = aux.make_regressor(name, model, grid_params, data)
Ridge Regression
Score r2: 0.9707
Score MSE: 2.123e+06
Time: 4.1s
{'alpha': 0.01, 'copy_X': True, 'fit_intercept': True, 'max_iter': None, 'normalize': False, 'random_state': None, 'solver': 'auto', 'tol': 0.001}

Regularized models - Lasso

Lasso regression uses a stronger penalty for the least important regressors, the L1.

That said, lasso will perform feature selection and it is not a stable solution.

  • L1: least absolute deviation: \(resid = \sum{(y_i - \hat y_i)^2} + \lambda \sum{|\beta|}\)

[8]:
from sklearn.linear_model import Lasso

model = Lasso()
grid_params = {'alpha': [.001, .01, .1, 1]}
name = 'Lasso Regression'
lasso_results = aux.make_regressor(name, model, grid_params, data)
Lasso Regression
Score r2: 0.9599
Score MSE: 2.637e+06
Time: 0.49s
{'alpha': 0.001, 'copy_X': True, 'fit_intercept': True, 'max_iter': 1000, 'normalize': False, 'positive': False, 'precompute': False, 'random_state': None, 'selection': 'cyclic', 'tol': 0.0001, 'warm_start': False}

Huber Regressor

Huber uses L2 and L1 penalty. This makes it specially strong against outliers:

[9]:
from sklearn.linear_model import HuberRegressor

model = HuberRegressor(max_iter=1000)
grid_params = {
    'alpha': [.001, .01, .1],
    'epsilon': [1, 5, 10]
}
name = 'HUbber Regression'
hubber_results = aux.make_regressor(name, model, grid_params, data)

HUbber Regression
Score r2: 0.9707
Score MSE: 2.123e+06
Time: 6.4e+01s
{'alpha': 0.001, 'epsilon': 10, 'fit_intercept': True, 'max_iter': 1000, 'tol': 1e-05, 'warm_start': False}

Comparing models - MSE and \(R^2\)

[10]:
df_scores = pd.DataFrame({
    'MSE': [
        linear_results[2]['mse'],
        ridge_results[2]['mse'],
        lasso_results[2]['mse'],
        hubber_results[2]['mse'],

        ],
    'r2': [
        linear_results[2]['r2'],
        ridge_results[2]['r2'],
        lasso_results[2]['r2'],
        hubber_results[2]['r2'],
        ],
    'model name': [
        linear_results[2]['model name'],
        ridge_results[2]['model name'],
        lasso_results[2]['model name'],
        hubber_results[2]['model name'],
        ],
    'time': [
        linear_results[2]['time'],
        ridge_results[2]['time'],
        lasso_results[2]['time'],
        hubber_results[2]['time'],
        ],
    },
#     index=['linear', 'ridge', 'lasso', 'hubber']
)

# load ols results
df_scores_ols = pd.read_csv('data/scores.csv')
df_scores = pd.concat([df_scores, df_scores_ols], axis=0)
df_scores.to_csv('data/sk_scores.csv', index=False)
df_scores.sort_values(by='MSE').round(4)
[10]:
MSE r2 model name time
0 2.122211e+06 0.9707 Linear Regression 14.9926
3 2.123061e+06 0.9707 HUbber Regression 63.8937
1 2.123265e+06 0.9707 Ridge Regression 4.0631
0 2.175057e+06 0.9690 ols 0.7920
2 2.636549e+06 0.9599 Lasso Regression 0.4869

Most important features

Lets check what are the most important features, aside from make:

[11]:
model = ridge_results[0]

cat_list = []
# make is the categories_[0]
for item in encoder.categories_[1:]:
    for cat in item:
        cat_list.append(cat)
[12]:
features = cat_list + num_cols
coefs = model.coef_

# coefs 0 to 21 are the make coefs
df_features = pd.DataFrame({'feature': features, 'importance': coefs[21:]})
(df_features.sort_values(by='importance', ascending=False)
    .set_index('feature').head(10)
    .plot(kind='bar', figsize=(13, 5), rot=0)
)
[12]:
<matplotlib.axes._subplots.AxesSubplot at 0x2340bb71a88>
_images/04-sklearn-linear_19_1.png

Comparing the models - conclusions

  • MSE analysis shows better results for the Linear model: smallest MSE

  • \(R^2\) analysis shows a tie between Ridge and Linear models

  • Ridge model would be the safest choice:

  • OLS has a worse performance because it was built on the assumption of a linear phenomena, using only linearly correlated variables.

[14]:
df_scores.sort_values(by='MSE').round(4)
[14]:
MSE r2 model name time
0 2.122211e+06 0.9707 Linear Regression 14.9926
3 2.123061e+06 0.9707 HUbber Regression 63.8937
1 2.123265e+06 0.9707 Ridge Regression 4.0631
0 2.175057e+06 0.9690 ols 0.7920
2 2.636549e+06 0.9599 Lasso Regression 0.4869