Linear regression - sklearn¶
In this section, we will cover:
fitting different linear regression models with sklearn
normalization and encoding of features
score analysis: MSE and variance explained: \(R^2\)
residual analysis
most important features
comparing the models: conclusions
Feature selection¶
This time we will not be manually selecting features. Some sklearn linear regressor models uses regularization, and that should address the lack of feature selection.
This has an important consequence, however: we will be using all features, even the ones which are not linearly correlated with the outcome. This means we will be trying to fit a linear model on a case that may be not strictly linear.
[1]:
import pandas as pd
import numpy as np
df = pd.read_csv('data/df_resample.csv')
df.head()
[1]:
symboling | make | fuel_type | aspiration | num_of_doors | body_style | drive_wheels | engine_location | wheel_base | length | ... | engine_size | fuel_system | bore | stroke | compression_ratio | horsepower | peak_rpm | city_mpg | highway_mpg | price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | nissan | gas | std | four | sedan | fwd | front | 97.2 | 173.4 | ... | 120 | 2bbl | 3.33 | 3.47 | 8.5 | 97.0 | 5200.0 | 27 | 34 | 9549.0 |
1 | 3 | volkswagen | gas | std | two | hatchback | fwd | front | 94.5 | 165.7 | ... | 109 | mpfi | 3.19 | 3.40 | 8.5 | 90.0 | 5500.0 | 24 | 29 | 9980.0 |
2 | 1 | bmw | gas | std | four | sedan | rwd | front | 103.5 | 189.0 | ... | 164 | mpfi | 3.31 | 3.19 | 9.0 | 121.0 | 4250.0 | 20 | 25 | 24565.0 |
3 | 2 | subaru | gas | std | two | hatchback | fwd | front | 93.7 | 156.9 | ... | 97 | 2bbl | 3.62 | 2.36 | 9.0 | 69.0 | 4900.0 | 31 | 36 | 5118.0 |
4 | 0 | mazda | diesel | std | four | sedan | rwd | front | 104.9 | 175.0 | ... | 134 | idi | 3.43 | 3.64 | 22.0 | 72.0 | 4200.0 | 31 | 39 | 18344.0 |
5 rows × 25 columns
[2]:
X = df.copy()
X.drop('price', axis=1, inplace=True)
y = np.log(df.price) # as discussed, we are going to use the log transform here
## Train-test split#
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=.3, random_state=95276
)
Normalization and encoding of features¶
The range of the numeric features varies a lot for each one of them:
bore range: ~2
curb range: ~3500
Also, its easy to see that they are measuring completely different things and therefore are in different scales.
In order to properly fit linear regression models with regularization, we must account for those differences.
Also, it is import to encode the categories, like fuel-type, so sklearn can use them.
[3]:
from sklearn.preprocessing import OneHotEncoder, StandardScaler
import pickle
with open('data/category_list', 'rb') as file:
cat_cols = pickle.load(file)
# numeric columns
num_cols = [col for col in X_train.columns if col not in cat_cols]
# normalize numeric features
scaler = StandardScaler()
num_scaled = scaler.fit_transform(X_train[num_cols])
# encode categories
encoder = OneHotEncoder(sparse=False)
cat_encoded = encoder.fit_transform(X_train[cat_cols])
# all together
X_train_proc = np.concatenate([cat_encoded, num_scaled] ,axis=1)
X_train_proc.shape
[3]:
(7000, 73)
[4]:
# apply transformations on test set
num_scaled = scaler.transform(X_test[num_cols])
# encode categories
cat_encoded = encoder.transform(X_test[cat_cols])
# all together
X_test_proc = np.concatenate([cat_encoded, num_scaled] ,axis=1)
X_test_proc.shape
[4]:
(3000, 73)
Hyper-parameter tuning and Cross Validation¶
It is important to note that we are going to use the gridsearchCV method, so we can iterate over a series of hyper-parameters for each model in order to find the best combination of them through cross validation.
Linear regression model¶
Lets start trying a simple sklearn linear regression model, without regularization.
[5]:
%load_ext autoreload
%autoreload 2
import aux_functions as aux
[6]:
from sklearn.linear_model import LinearRegression
model = LinearRegression()
grid_params = {'fit_intercept': [True, False]}
name = 'Linear Regression'
data = (X_train_proc, y_train, X_test_proc, y_test)
linear_results = aux.make_regressor(name, model, grid_params, data)
Linear Regression
Score r2: 0.9707
Score MSE: 2.122e+06
Time: 1.5e+01s
{'copy_X': True, 'fit_intercept': False, 'n_jobs': None, 'normalize': False}
Regularized models - Ridge regularization¶
Ridge regression uses a penalty L2 factor for the least important regressors.
L2: least squared deviation \(resid = \sum{(y_i - \hat y_i)^2} + \lambda \sum{\beta^2}\)
However, the least squares regularization is not robust: it is sensitive to outliers.
[7]:
from sklearn.linear_model import Ridge
model = Ridge()
grid_params = {'alpha': [.001, .01, .1, 1]}
name = 'Ridge Regression'
ridge_results = aux.make_regressor(name, model, grid_params, data)
Ridge Regression
Score r2: 0.9707
Score MSE: 2.123e+06
Time: 4.1s
{'alpha': 0.01, 'copy_X': True, 'fit_intercept': True, 'max_iter': None, 'normalize': False, 'random_state': None, 'solver': 'auto', 'tol': 0.001}
Regularized models - Lasso¶
Lasso regression uses a stronger penalty for the least important regressors, the L1.
That said, lasso will perform feature selection and it is not a stable solution.
L1: least absolute deviation: \(resid = \sum{(y_i - \hat y_i)^2} + \lambda \sum{|\beta|}\)
[8]:
from sklearn.linear_model import Lasso
model = Lasso()
grid_params = {'alpha': [.001, .01, .1, 1]}
name = 'Lasso Regression'
lasso_results = aux.make_regressor(name, model, grid_params, data)
Lasso Regression
Score r2: 0.9599
Score MSE: 2.637e+06
Time: 0.49s
{'alpha': 0.001, 'copy_X': True, 'fit_intercept': True, 'max_iter': 1000, 'normalize': False, 'positive': False, 'precompute': False, 'random_state': None, 'selection': 'cyclic', 'tol': 0.0001, 'warm_start': False}
Huber Regressor¶
Huber uses L2 and L1 penalty. This makes it specially strong against outliers:
[9]:
from sklearn.linear_model import HuberRegressor
model = HuberRegressor(max_iter=1000)
grid_params = {
'alpha': [.001, .01, .1],
'epsilon': [1, 5, 10]
}
name = 'HUbber Regression'
hubber_results = aux.make_regressor(name, model, grid_params, data)
HUbber Regression
Score r2: 0.9707
Score MSE: 2.123e+06
Time: 6.4e+01s
{'alpha': 0.001, 'epsilon': 10, 'fit_intercept': True, 'max_iter': 1000, 'tol': 1e-05, 'warm_start': False}
Comparing models - MSE and \(R^2\)¶
[10]:
df_scores = pd.DataFrame({
'MSE': [
linear_results[2]['mse'],
ridge_results[2]['mse'],
lasso_results[2]['mse'],
hubber_results[2]['mse'],
],
'r2': [
linear_results[2]['r2'],
ridge_results[2]['r2'],
lasso_results[2]['r2'],
hubber_results[2]['r2'],
],
'model name': [
linear_results[2]['model name'],
ridge_results[2]['model name'],
lasso_results[2]['model name'],
hubber_results[2]['model name'],
],
'time': [
linear_results[2]['time'],
ridge_results[2]['time'],
lasso_results[2]['time'],
hubber_results[2]['time'],
],
},
# index=['linear', 'ridge', 'lasso', 'hubber']
)
# load ols results
df_scores_ols = pd.read_csv('data/scores.csv')
df_scores = pd.concat([df_scores, df_scores_ols], axis=0)
df_scores.to_csv('data/sk_scores.csv', index=False)
df_scores.sort_values(by='MSE').round(4)
[10]:
MSE | r2 | model name | time | |
---|---|---|---|---|
0 | 2.122211e+06 | 0.9707 | Linear Regression | 14.9926 |
3 | 2.123061e+06 | 0.9707 | HUbber Regression | 63.8937 |
1 | 2.123265e+06 | 0.9707 | Ridge Regression | 4.0631 |
0 | 2.175057e+06 | 0.9690 | ols | 0.7920 |
2 | 2.636549e+06 | 0.9599 | Lasso Regression | 0.4869 |
Most important features¶
Lets check what are the most important features, aside from make:
[11]:
model = ridge_results[0]
cat_list = []
# make is the categories_[0]
for item in encoder.categories_[1:]:
for cat in item:
cat_list.append(cat)
[12]:
features = cat_list + num_cols
coefs = model.coef_
# coefs 0 to 21 are the make coefs
df_features = pd.DataFrame({'feature': features, 'importance': coefs[21:]})
(df_features.sort_values(by='importance', ascending=False)
.set_index('feature').head(10)
.plot(kind='bar', figsize=(13, 5), rot=0)
)
[12]:
<matplotlib.axes._subplots.AxesSubplot at 0x2340bb71a88>
Comparing the models - conclusions¶
MSE analysis shows better results for the Linear model: smallest MSE
\(R^2\) analysis shows a tie between Ridge and Linear models
Ridge model would be the safest choice:
OLS has a worse performance because it was built on the assumption of a linear phenomena, using only linearly correlated variables.
[14]:
df_scores.sort_values(by='MSE').round(4)
[14]:
MSE | r2 | model name | time | |
---|---|---|---|---|
0 | 2.122211e+06 | 0.9707 | Linear Regression | 14.9926 |
3 | 2.123061e+06 | 0.9707 | HUbber Regression | 63.8937 |
1 | 2.123265e+06 | 0.9707 | Ridge Regression | 4.0631 |
0 | 2.175057e+06 | 0.9690 | ols | 0.7920 |
2 | 2.636549e+06 | 0.9599 | Lasso Regression | 0.4869 |