Modeling with machine learning¶
In this section, we will cover:
fitting different machine learting regression models with sklearn
score analysis: MSE and variance explained: \(R^2\)
comparing the models: conclusions
[1]:
import pandas as pd
import numpy as np
df = pd.read_csv('data/df_resample.csv')
df.head()
[1]:
symboling | make | fuel_type | aspiration | num_of_doors | body_style | drive_wheels | engine_location | wheel_base | length | ... | engine_size | fuel_system | bore | stroke | compression_ratio | horsepower | peak_rpm | city_mpg | highway_mpg | price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | nissan | gas | std | four | sedan | fwd | front | 97.2 | 173.4 | ... | 120 | 2bbl | 3.33 | 3.47 | 8.5 | 97.0 | 5200.0 | 27 | 34 | 9549.0 |
1 | 3 | volkswagen | gas | std | two | hatchback | fwd | front | 94.5 | 165.7 | ... | 109 | mpfi | 3.19 | 3.40 | 8.5 | 90.0 | 5500.0 | 24 | 29 | 9980.0 |
2 | 1 | bmw | gas | std | four | sedan | rwd | front | 103.5 | 189.0 | ... | 164 | mpfi | 3.31 | 3.19 | 9.0 | 121.0 | 4250.0 | 20 | 25 | 24565.0 |
3 | 2 | subaru | gas | std | two | hatchback | fwd | front | 93.7 | 156.9 | ... | 97 | 2bbl | 3.62 | 2.36 | 9.0 | 69.0 | 4900.0 | 31 | 36 | 5118.0 |
4 | 0 | mazda | diesel | std | four | sedan | rwd | front | 104.9 | 175.0 | ... | 134 | idi | 3.43 | 3.64 | 22.0 | 72.0 | 4200.0 | 31 | 39 | 18344.0 |
5 rows × 25 columns
[2]:
X = df.copy()
X.drop('price', axis=1, inplace=True)
y = np.log(df.price) # as discussed, we are going to use the log transform here
## Train-test split#
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=.3, random_state=95276
)
[3]:
# normalize and encode
from sklearn.preprocessing import OneHotEncoder, StandardScaler
import pickle
with open('data/category_list', 'rb') as file:
cat_cols = pickle.load(file)
# numeric columns
num_cols = [col for col in X_train.columns if col not in cat_cols]
# normalize numeric features
scaler = StandardScaler()
num_scaled = scaler.fit_transform(X_train[num_cols])
# encode categories
encoder = OneHotEncoder(sparse=False)
cat_encoded = encoder.fit_transform(X_train[cat_cols])
# all together
X_train_proc = np.concatenate([cat_encoded, num_scaled] ,axis=1)
[4]:
# apply transformations on test set
num_scaled = scaler.transform(X_test[num_cols])
# encode categories
cat_encoded = encoder.transform(X_test[cat_cols])
# all together
X_test_proc = np.concatenate([cat_encoded, num_scaled] ,axis=1)
X_test_proc.shape
[4]:
(3000, 73)
Hyper-parameter tuning and Cross Validation¶
It is important to note that we are going to use the gridsearchCV method, so we can iterate over a series of hyper-parameters for each model in order to find the best combination of them through cross validation.
Decision Tree Regressor¶
Lets start trying a simple sklearn decision tree regression model.
[5]:
%load_ext autoreload
%autoreload 2
import aux_functions as aux
[6]:
from sklearn.tree import DecisionTreeRegressor
model = DecisionTreeRegressor(random_state=95276)
grid_params = {
'min_samples_split': [2, 5],
'min_samples_leaf': [2, 5],
'max_depth': [20, 25, 30]
}
name = 'Decision tree'
data = (X_train_proc, y_train, X_test_proc, y_test)
dt_results = aux.make_regressor(name, model, grid_params, data)
Decision tree
Score r2: 0.9985
Score MSE: 7.789e+04
Time: 1.8e+01s
{'ccp_alpha': 0.0, 'criterion': 'mse', 'max_depth': 20, 'max_features': None, 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_impurity_split': None, 'min_samples_leaf': 2, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'presort': 'deprecated', 'random_state': 95276, 'splitter': 'best'}
k-Nearest Neighbors¶
[7]:
from sklearn.neighbors import KNeighborsRegressor
model = KNeighborsRegressor()
grid_params = {
'n_neighbors': [5, 10],
'p': [1, 2]
}
name = 'knn'
knn_results = aux.make_regressor(name, model, grid_params, data)
knn
Score r2: 0.9985
Score MSE: 8.469e+04
Time: 1.3s
{'algorithm': 'auto', 'leaf_size': 30, 'metric': 'minkowski', 'metric_params': None, 'n_jobs': None, 'n_neighbors': 10, 'p': 1, 'weights': 'uniform'}
Random Forests¶
[8]:
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor()
grid_params = {
'max_features': [10, 15, 20],
'max_depth': [20, 30],
'min_samples_split': [2, 5],
'min_samples_leaf': [2, 5],
}
name = 'RF'
rf_results = aux.make_regressor(name, model, grid_params, data)
RF
Score r2: 0.9985
Score MSE: 7.798e+04
Time: 1.5e+01s
{'bootstrap': True, 'ccp_alpha': 0.0, 'criterion': 'mse', 'max_depth': 30, 'max_features': 10, 'max_leaf_nodes': None, 'max_samples': None, 'min_impurity_decrease': 0.0, 'min_impurity_split': None, 'min_samples_leaf': 2, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 100, 'n_jobs': None, 'oob_score': False, 'random_state': None, 'verbose': 0, 'warm_start': False}
Gradient tree boosting¶
The gradient tree boosting is an ensemble machine learning methods too, but this time we have the boosting class: several weak models are combined to produce a powerful estimator with reduced bias.
This method is very robust because it uses regularization.
[9]:
from sklearn.ensemble import GradientBoostingRegressor
model = GradientBoostingRegressor()
grid_params = {
'min_samples_split': [5],
}
name = 'Gradient Boost'
gb_results = aux.make_regressor(name, model, grid_params, data)
Gradient Boost
Score r2: 0.9883
Score MSE: 5.876e+05
Time: 3.6s
{'alpha': 0.9, 'ccp_alpha': 0.0, 'criterion': 'friedman_mse', 'init': None, 'learning_rate': 0.1, 'loss': 'ls', 'max_depth': 3, 'max_features': None, 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_impurity_split': None, 'min_samples_leaf': 1, 'min_samples_split': 5, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 100, 'n_iter_no_change': None, 'presort': 'deprecated', 'random_state': None, 'subsample': 1.0, 'tol': 0.0001, 'validation_fraction': 0.1, 'verbose': 0, 'warm_start': False}
AdaBoost¶
Adaboost is another ensemble machine learning method of the boosting class.
This time, however, we can start with the best model we have so far. Then, copies of the original model will be fitted on the same dataset, but weights will be attributed to them according to the error of the prediction.
Lets use our previously trained decision tree regressor.
[10]:
from sklearn.ensemble import AdaBoostRegressor
model = AdaBoostRegressor(random_state=95276, base_estimator=dt_results[0])
grid_params = {
'learning_rate': [.5, 1],
}
name = 'AdaBoost'
ada_results = aux.make_regressor(name, model, grid_params, data)
AdaBoost
Score r2: 0.9985
Score MSE: 7.865e+04
Time: 1.2s
{'base_estimator__ccp_alpha': 0.0, 'base_estimator__criterion': 'mse', 'base_estimator__max_depth': 20, 'base_estimator__max_features': None, 'base_estimator__max_leaf_nodes': None, 'base_estimator__min_impurity_decrease': 0.0, 'base_estimator__min_impurity_split': None, 'base_estimator__min_samples_leaf': 2, 'base_estimator__min_samples_split': 2, 'base_estimator__min_weight_fraction_leaf': 0.0, 'base_estimator__presort': 'deprecated', 'base_estimator__random_state': 95276, 'base_estimator__splitter': 'best', 'base_estimator': DecisionTreeRegressor(ccp_alpha=0.0, criterion='mse', max_depth=20,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=2, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort='deprecated',
random_state=95276, splitter='best'), 'learning_rate': 0.5, 'loss': 'linear', 'n_estimators': 50, 'random_state': 95276}
Comparing models - MSE and \(R^2\)¶
[11]:
df_scores = pd.DataFrame({
'MSE': [
dt_results[2]['mse'],
knn_results[2]['mse'],
rf_results[2]['mse'],
gb_results[2]['mse'],
ada_results[2]['mse']
],
'r2': [
dt_results[2]['r2'],
knn_results[2]['r2'],
rf_results[2]['r2'],
gb_results[2]['r2'],
ada_results[2]['r2']
],
'model name': [
dt_results[2]['model name'],
knn_results[2]['model name'],
rf_results[2]['model name'],
gb_results[2]['model name'],
ada_results[2]['model name']
],
'time': [
dt_results[2]['time'],
knn_results[2]['time'],
rf_results[2]['time'],
gb_results[2]['time'],
ada_results[2]['time'],
],
},
# index=['linear', 'ridge', 'lasso', 'hubber']
)
# load ols results
df_scores_old = pd.read_csv('data/sk_scores.csv')
df_scores = pd.concat([df_scores, df_scores_old], axis=0)
# lets get the rmse
df_scores['rmse'] = np.sqrt(df_scores['MSE'])
# now lets measure the impact of processing time over RMSE
df_scores = df_scores.sort_values(by='rmse', ascending=False)
df_scores['dif_rmse'] = df_scores['rmse'].diff()
df_scores['dif_time'] = df_scores['time'].diff()
df_scores.to_csv('data/full_scores.csv', index=False)
df_scores.set_index('model name').round(3)
[11]:
MSE | r2 | time | rmse | dif_rmse | dif_time | |
---|---|---|---|---|---|---|
model name | ||||||
Lasso Regression | 2636548.565 | 0.960 | 0.487 | 1623.745 | NaN | NaN |
ols | 2175057.358 | 0.969 | 0.792 | 1474.808 | -148.938 | 0.305 |
Ridge Regression | 2123265.316 | 0.971 | 4.063 | 1457.143 | -17.665 | 3.271 |
HUbber Regression | 2123061.444 | 0.971 | 63.894 | 1457.073 | -0.070 | 59.831 |
Linear Regression | 2122211.274 | 0.971 | 14.993 | 1456.781 | -0.292 | -48.901 |
Gradient Boost | 587570.524 | 0.988 | 3.617 | 766.531 | -690.250 | -11.376 |
knn | 84686.639 | 0.998 | 1.337 | 291.010 | -475.522 | -2.279 |
AdaBoost | 78647.131 | 0.999 | 1.189 | 280.441 | -10.569 | -0.148 |
RF | 77984.792 | 0.999 | 15.136 | 279.258 | -1.183 | 13.947 |
Decision tree | 77889.082 | 0.999 | 17.938 | 279.086 | -0.171 | 2.803 |
Conclusions¶
Considering the computation time and the error measured on the test set, we can conclude that:
OLS is not a good choice here unless we add the non-linear contributions of each feature to the explain the outcome.
Given that all sklearn linear models have almost the same scores, Ridge Regression would be chosen due to the time it needs to be trained while automatically adjusting the weights of each feature, so we don’t need to manually select them
among the ML models, its important to note that the KNN model took 10x less time to produce a model almost as good as the alternatives.
ML have 5x improved performance (RMSE) over Linear Models because it doesn’t rely on the linear correlations between predictors and the outcome. The linear models could have better performance if we added the contributions of interdependent terms or non-linear transformations of them.