Modeling with machine learning¶

In this section, we will cover:

fitting different machine learting regression models with sklearn
score analysis: MSE and variance explained: \(R^2\)
comparing the models: conclusions

[1]:

import pandas as pd
import numpy as np

df = pd.read_csv('data/df_resample.csv')
df.head()

[1]:

	symboling	make	fuel_type	aspiration	num_of_doors	body_style	drive_wheels	engine_location	wheel_base	length	...	engine_size	fuel_system	bore	stroke	compression_ratio	horsepower	peak_rpm	city_mpg	highway_mpg	price
0	0	nissan	gas	std	four	sedan	fwd	front	97.2	173.4	...	120	2bbl	3.33	3.47	8.5	97.0	5200.0	27	34	9549.0
1	3	volkswagen	gas	std	two	hatchback	fwd	front	94.5	165.7	...	109	mpfi	3.19	3.40	8.5	90.0	5500.0	24	29	9980.0
2	1	bmw	gas	std	four	sedan	rwd	front	103.5	189.0	...	164	mpfi	3.31	3.19	9.0	121.0	4250.0	20	25	24565.0
3	2	subaru	gas	std	two	hatchback	fwd	front	93.7	156.9	...	97	2bbl	3.62	2.36	9.0	69.0	4900.0	31	36	5118.0
4	0	mazda	diesel	std	four	sedan	rwd	front	104.9	175.0	...	134	idi	3.43	3.64	22.0	72.0	4200.0	31	39	18344.0

5 rows × 25 columns

[2]:

X = df.copy()
X.drop('price', axis=1, inplace=True)
y = np.log(df.price) # as discussed, we are going to use the log transform here

## Train-test split#
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=.3, random_state=95276
)

[3]:

# normalize and encode
from sklearn.preprocessing import OneHotEncoder, StandardScaler
import pickle

with open('data/category_list', 'rb') as file:
    cat_cols = pickle.load(file)

# numeric columns
num_cols = [col for col in X_train.columns if col not in cat_cols]

# normalize numeric features
scaler = StandardScaler()
num_scaled = scaler.fit_transform(X_train[num_cols])

# encode categories
encoder = OneHotEncoder(sparse=False)
cat_encoded = encoder.fit_transform(X_train[cat_cols])

# all together
X_train_proc = np.concatenate([cat_encoded, num_scaled] ,axis=1)

[4]:

# apply transformations on test set
num_scaled = scaler.transform(X_test[num_cols])

# encode categories
cat_encoded = encoder.transform(X_test[cat_cols])

# all together
X_test_proc = np.concatenate([cat_encoded, num_scaled] ,axis=1)
X_test_proc.shape

[4]:

(3000, 73)

Hyper-parameter tuning and Cross Validation¶

It is important to note that we are going to use the gridsearchCV method, so we can iterate over a series of hyper-parameters for each model in order to find the best combination of them through cross validation.

Decision Tree Regressor¶

Lets start trying a simple sklearn decision tree regression model.

[5]:

%load_ext autoreload
%autoreload 2

import aux_functions as aux

[6]:

from sklearn.tree import DecisionTreeRegressor

model = DecisionTreeRegressor(random_state=95276)

grid_params = {
    'min_samples_split': [2, 5],
    'min_samples_leaf': [2, 5],
    'max_depth': [20, 25, 30]
}
name = 'Decision tree'
data = (X_train_proc, y_train, X_test_proc, y_test)

dt_results = aux.make_regressor(name, model, grid_params, data)

Decision tree
Score r2: 0.9985
Score MSE: 7.789e+04
Time: 1.8e+01s
{'ccp_alpha': 0.0, 'criterion': 'mse', 'max_depth': 20, 'max_features': None, 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_impurity_split': None, 'min_samples_leaf': 2, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'presort': 'deprecated', 'random_state': 95276, 'splitter': 'best'}

k-Nearest Neighbors¶

[7]:

from sklearn.neighbors import KNeighborsRegressor

model = KNeighborsRegressor()

grid_params = {
    'n_neighbors': [5, 10],
    'p': [1, 2]
    }
name = 'knn'
knn_results = aux.make_regressor(name, model, grid_params, data)

knn
Score r2: 0.9985
Score MSE: 8.469e+04
Time: 1.3s
{'algorithm': 'auto', 'leaf_size': 30, 'metric': 'minkowski', 'metric_params': None, 'n_jobs': None, 'n_neighbors': 10, 'p': 1, 'weights': 'uniform'}

Random Forests¶

[8]:

from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor()

grid_params = {
    'max_features': [10, 15, 20],
    'max_depth': [20, 30],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [2, 5],
}
name = 'RF'
rf_results = aux.make_regressor(name, model, grid_params, data)

RF
Score r2: 0.9985
Score MSE: 7.798e+04
Time: 1.5e+01s
{'bootstrap': True, 'ccp_alpha': 0.0, 'criterion': 'mse', 'max_depth': 30, 'max_features': 10, 'max_leaf_nodes': None, 'max_samples': None, 'min_impurity_decrease': 0.0, 'min_impurity_split': None, 'min_samples_leaf': 2, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 100, 'n_jobs': None, 'oob_score': False, 'random_state': None, 'verbose': 0, 'warm_start': False}

Gradient tree boosting¶

The gradient tree boosting is an ensemble machine learning methods too, but this time we have the boosting class: several weak models are combined to produce a powerful estimator with reduced bias.

This method is very robust because it uses regularization.

[9]:

from sklearn.ensemble import GradientBoostingRegressor
model = GradientBoostingRegressor()

grid_params = {
    'min_samples_split': [5],
}
name = 'Gradient Boost'
gb_results = aux.make_regressor(name, model, grid_params, data)

Gradient Boost
Score r2: 0.9883
Score MSE: 5.876e+05
Time: 3.6s
{'alpha': 0.9, 'ccp_alpha': 0.0, 'criterion': 'friedman_mse', 'init': None, 'learning_rate': 0.1, 'loss': 'ls', 'max_depth': 3, 'max_features': None, 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_impurity_split': None, 'min_samples_leaf': 1, 'min_samples_split': 5, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 100, 'n_iter_no_change': None, 'presort': 'deprecated', 'random_state': None, 'subsample': 1.0, 'tol': 0.0001, 'validation_fraction': 0.1, 'verbose': 0, 'warm_start': False}

AdaBoost¶

Adaboost is another ensemble machine learning method of the boosting class.

This time, however, we can start with the best model we have so far. Then, copies of the original model will be fitted on the same dataset, but weights will be attributed to them according to the error of the prediction.

Lets use our previously trained decision tree regressor.

[10]:

from sklearn.ensemble import AdaBoostRegressor
model = AdaBoostRegressor(random_state=95276, base_estimator=dt_results[0])

grid_params = {
    'learning_rate': [.5, 1],
}
name = 'AdaBoost'
ada_results = aux.make_regressor(name, model, grid_params, data)

AdaBoost
Score r2: 0.9985
Score MSE: 7.865e+04
Time: 1.2s
{'base_estimator__ccp_alpha': 0.0, 'base_estimator__criterion': 'mse', 'base_estimator__max_depth': 20, 'base_estimator__max_features': None, 'base_estimator__max_leaf_nodes': None, 'base_estimator__min_impurity_decrease': 0.0, 'base_estimator__min_impurity_split': None, 'base_estimator__min_samples_leaf': 2, 'base_estimator__min_samples_split': 2, 'base_estimator__min_weight_fraction_leaf': 0.0, 'base_estimator__presort': 'deprecated', 'base_estimator__random_state': 95276, 'base_estimator__splitter': 'best', 'base_estimator': DecisionTreeRegressor(ccp_alpha=0.0, criterion='mse', max_depth=20,
                      max_features=None, max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=2, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, presort='deprecated',
                      random_state=95276, splitter='best'), 'learning_rate': 0.5, 'loss': 'linear', 'n_estimators': 50, 'random_state': 95276}

Comparing models - MSE and \(R^2\)¶

[11]:

df_scores = pd.DataFrame({
    'MSE': [
        dt_results[2]['mse'],
        knn_results[2]['mse'],
        rf_results[2]['mse'],
        gb_results[2]['mse'],
        ada_results[2]['mse']

        ],
    'r2': [
        dt_results[2]['r2'],
        knn_results[2]['r2'],
        rf_results[2]['r2'],
        gb_results[2]['r2'],
        ada_results[2]['r2']
        ],
    'model name': [
        dt_results[2]['model name'],
        knn_results[2]['model name'],
        rf_results[2]['model name'],
        gb_results[2]['model name'],
        ada_results[2]['model name']
        ],
    'time': [
        dt_results[2]['time'],
        knn_results[2]['time'],
        rf_results[2]['time'],
        gb_results[2]['time'],
        ada_results[2]['time'],
        ],
    },
#     index=['linear', 'ridge', 'lasso', 'hubber']
)

# load ols results
df_scores_old = pd.read_csv('data/sk_scores.csv')
df_scores = pd.concat([df_scores, df_scores_old], axis=0)

# lets get the rmse
df_scores['rmse'] = np.sqrt(df_scores['MSE'])

# now lets measure the impact of processing time over RMSE
df_scores = df_scores.sort_values(by='rmse', ascending=False)
df_scores['dif_rmse'] = df_scores['rmse'].diff()
df_scores['dif_time'] = df_scores['time'].diff()
df_scores.to_csv('data/full_scores.csv', index=False)
df_scores.set_index('model name').round(3)

[11]:

	MSE	r2	time	rmse	dif_rmse	dif_time
model name
Lasso Regression	2636548.565	0.960	0.487	1623.745	NaN	NaN
ols	2175057.358	0.969	0.792	1474.808	-148.938	0.305
Ridge Regression	2123265.316	0.971	4.063	1457.143	-17.665	3.271
HUbber Regression	2123061.444	0.971	63.894	1457.073	-0.070	59.831
Linear Regression	2122211.274	0.971	14.993	1456.781	-0.292	-48.901
Gradient Boost	587570.524	0.988	3.617	766.531	-690.250	-11.376
knn	84686.639	0.998	1.337	291.010	-475.522	-2.279
AdaBoost	78647.131	0.999	1.189	280.441	-10.569	-0.148
RF	77984.792	0.999	15.136	279.258	-1.183	13.947
Decision tree	77889.082	0.999	17.938	279.086	-0.171	2.803

Conclusions¶

Considering the computation time and the error measured on the test set, we can conclude that:

OLS is not a good choice here unless we add the non-linear contributions of each feature to the explain the outcome.
Given that all sklearn linear models have almost the same scores, Ridge Regression would be chosen due to the time it needs to be trained while automatically adjusting the weights of each feature, so we don’t need to manually select them
among the ML models, its important to note that the KNN model took 10x less time to produce a model almost as good as the alternatives.
ML have 5x improved performance (RMSE) over Linear Models because it doesn’t rely on the linear correlations between predictors and the outcome. The linear models could have better performance if we added the contributions of interdependent terms or non-linear transformations of them.