Modeling with machine learning

In this section, we will cover:

  • fitting different machine learting regression models with sklearn

  • score analysis: MSE and variance explained: \(R^2\)

  • comparing the models: conclusions

[1]:
import pandas as pd
import numpy as np

df = pd.read_csv('data/df_resample.csv')
df.head()
[1]:
symboling make fuel_type aspiration num_of_doors body_style drive_wheels engine_location wheel_base length ... engine_size fuel_system bore stroke compression_ratio horsepower peak_rpm city_mpg highway_mpg price
0 0 nissan gas std four sedan fwd front 97.2 173.4 ... 120 2bbl 3.33 3.47 8.5 97.0 5200.0 27 34 9549.0
1 3 volkswagen gas std two hatchback fwd front 94.5 165.7 ... 109 mpfi 3.19 3.40 8.5 90.0 5500.0 24 29 9980.0
2 1 bmw gas std four sedan rwd front 103.5 189.0 ... 164 mpfi 3.31 3.19 9.0 121.0 4250.0 20 25 24565.0
3 2 subaru gas std two hatchback fwd front 93.7 156.9 ... 97 2bbl 3.62 2.36 9.0 69.0 4900.0 31 36 5118.0
4 0 mazda diesel std four sedan rwd front 104.9 175.0 ... 134 idi 3.43 3.64 22.0 72.0 4200.0 31 39 18344.0

5 rows × 25 columns

[2]:
X = df.copy()
X.drop('price', axis=1, inplace=True)
y = np.log(df.price) # as discussed, we are going to use the log transform here

## Train-test split#
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=.3, random_state=95276
)
[3]:
# normalize and encode
from sklearn.preprocessing import OneHotEncoder, StandardScaler
import pickle

with open('data/category_list', 'rb') as file:
    cat_cols = pickle.load(file)

# numeric columns
num_cols = [col for col in X_train.columns if col not in cat_cols]

# normalize numeric features
scaler = StandardScaler()
num_scaled = scaler.fit_transform(X_train[num_cols])

# encode categories
encoder = OneHotEncoder(sparse=False)
cat_encoded = encoder.fit_transform(X_train[cat_cols])

# all together
X_train_proc = np.concatenate([cat_encoded, num_scaled] ,axis=1)
[4]:
# apply transformations on test set
num_scaled = scaler.transform(X_test[num_cols])

# encode categories
cat_encoded = encoder.transform(X_test[cat_cols])

# all together
X_test_proc = np.concatenate([cat_encoded, num_scaled] ,axis=1)
X_test_proc.shape
[4]:
(3000, 73)

Hyper-parameter tuning and Cross Validation

It is important to note that we are going to use the gridsearchCV method, so we can iterate over a series of hyper-parameters for each model in order to find the best combination of them through cross validation.

Decision Tree Regressor

Lets start trying a simple sklearn decision tree regression model.

[5]:
%load_ext autoreload
%autoreload 2

import aux_functions as aux
[6]:
from sklearn.tree import DecisionTreeRegressor

model = DecisionTreeRegressor(random_state=95276)

grid_params = {
    'min_samples_split': [2, 5],
    'min_samples_leaf': [2, 5],
    'max_depth': [20, 25, 30]
}
name = 'Decision tree'
data = (X_train_proc, y_train, X_test_proc, y_test)

dt_results = aux.make_regressor(name, model, grid_params, data)
Decision tree
Score r2: 0.9985
Score MSE: 7.789e+04
Time: 1.8e+01s
{'ccp_alpha': 0.0, 'criterion': 'mse', 'max_depth': 20, 'max_features': None, 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_impurity_split': None, 'min_samples_leaf': 2, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'presort': 'deprecated', 'random_state': 95276, 'splitter': 'best'}

k-Nearest Neighbors

[7]:
from sklearn.neighbors import KNeighborsRegressor

model = KNeighborsRegressor()

grid_params = {
    'n_neighbors': [5, 10],
    'p': [1, 2]
    }
name = 'knn'
knn_results = aux.make_regressor(name, model, grid_params, data)
knn
Score r2: 0.9985
Score MSE: 8.469e+04
Time: 1.3s
{'algorithm': 'auto', 'leaf_size': 30, 'metric': 'minkowski', 'metric_params': None, 'n_jobs': None, 'n_neighbors': 10, 'p': 1, 'weights': 'uniform'}

Random Forests

[8]:
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor()

grid_params = {
    'max_features': [10, 15, 20],
    'max_depth': [20, 30],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [2, 5],
}
name = 'RF'
rf_results = aux.make_regressor(name, model, grid_params, data)
RF
Score r2: 0.9985
Score MSE: 7.798e+04
Time: 1.5e+01s
{'bootstrap': True, 'ccp_alpha': 0.0, 'criterion': 'mse', 'max_depth': 30, 'max_features': 10, 'max_leaf_nodes': None, 'max_samples': None, 'min_impurity_decrease': 0.0, 'min_impurity_split': None, 'min_samples_leaf': 2, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 100, 'n_jobs': None, 'oob_score': False, 'random_state': None, 'verbose': 0, 'warm_start': False}

Gradient tree boosting

The gradient tree boosting is an ensemble machine learning methods too, but this time we have the boosting class: several weak models are combined to produce a powerful estimator with reduced bias.

This method is very robust because it uses regularization.

[9]:
from sklearn.ensemble import GradientBoostingRegressor
model = GradientBoostingRegressor()

grid_params = {
    'min_samples_split': [5],
}
name = 'Gradient Boost'
gb_results = aux.make_regressor(name, model, grid_params, data)

Gradient Boost
Score r2: 0.9883
Score MSE: 5.876e+05
Time: 3.6s
{'alpha': 0.9, 'ccp_alpha': 0.0, 'criterion': 'friedman_mse', 'init': None, 'learning_rate': 0.1, 'loss': 'ls', 'max_depth': 3, 'max_features': None, 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_impurity_split': None, 'min_samples_leaf': 1, 'min_samples_split': 5, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 100, 'n_iter_no_change': None, 'presort': 'deprecated', 'random_state': None, 'subsample': 1.0, 'tol': 0.0001, 'validation_fraction': 0.1, 'verbose': 0, 'warm_start': False}

AdaBoost

Adaboost is another ensemble machine learning method of the boosting class.

This time, however, we can start with the best model we have so far. Then, copies of the original model will be fitted on the same dataset, but weights will be attributed to them according to the error of the prediction.

Lets use our previously trained decision tree regressor.

[10]:
from sklearn.ensemble import AdaBoostRegressor
model = AdaBoostRegressor(random_state=95276, base_estimator=dt_results[0])

grid_params = {
    'learning_rate': [.5, 1],
}
name = 'AdaBoost'
ada_results = aux.make_regressor(name, model, grid_params, data)
AdaBoost
Score r2: 0.9985
Score MSE: 7.865e+04
Time: 1.2s
{'base_estimator__ccp_alpha': 0.0, 'base_estimator__criterion': 'mse', 'base_estimator__max_depth': 20, 'base_estimator__max_features': None, 'base_estimator__max_leaf_nodes': None, 'base_estimator__min_impurity_decrease': 0.0, 'base_estimator__min_impurity_split': None, 'base_estimator__min_samples_leaf': 2, 'base_estimator__min_samples_split': 2, 'base_estimator__min_weight_fraction_leaf': 0.0, 'base_estimator__presort': 'deprecated', 'base_estimator__random_state': 95276, 'base_estimator__splitter': 'best', 'base_estimator': DecisionTreeRegressor(ccp_alpha=0.0, criterion='mse', max_depth=20,
                      max_features=None, max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=2, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, presort='deprecated',
                      random_state=95276, splitter='best'), 'learning_rate': 0.5, 'loss': 'linear', 'n_estimators': 50, 'random_state': 95276}

Comparing models - MSE and \(R^2\)

[11]:
df_scores = pd.DataFrame({
    'MSE': [
        dt_results[2]['mse'],
        knn_results[2]['mse'],
        rf_results[2]['mse'],
        gb_results[2]['mse'],
        ada_results[2]['mse']

        ],
    'r2': [
        dt_results[2]['r2'],
        knn_results[2]['r2'],
        rf_results[2]['r2'],
        gb_results[2]['r2'],
        ada_results[2]['r2']
        ],
    'model name': [
        dt_results[2]['model name'],
        knn_results[2]['model name'],
        rf_results[2]['model name'],
        gb_results[2]['model name'],
        ada_results[2]['model name']
        ],
    'time': [
        dt_results[2]['time'],
        knn_results[2]['time'],
        rf_results[2]['time'],
        gb_results[2]['time'],
        ada_results[2]['time'],
        ],
    },
#     index=['linear', 'ridge', 'lasso', 'hubber']
)

# load ols results
df_scores_old = pd.read_csv('data/sk_scores.csv')
df_scores = pd.concat([df_scores, df_scores_old], axis=0)

# lets get the rmse
df_scores['rmse'] = np.sqrt(df_scores['MSE'])

# now lets measure the impact of processing time over RMSE
df_scores = df_scores.sort_values(by='rmse', ascending=False)
df_scores['dif_rmse'] = df_scores['rmse'].diff()
df_scores['dif_time'] = df_scores['time'].diff()
df_scores.to_csv('data/full_scores.csv', index=False)
df_scores.set_index('model name').round(3)
[11]:
MSE r2 time rmse dif_rmse dif_time
model name
Lasso Regression 2636548.565 0.960 0.487 1623.745 NaN NaN
ols 2175057.358 0.969 0.792 1474.808 -148.938 0.305
Ridge Regression 2123265.316 0.971 4.063 1457.143 -17.665 3.271
HUbber Regression 2123061.444 0.971 63.894 1457.073 -0.070 59.831
Linear Regression 2122211.274 0.971 14.993 1456.781 -0.292 -48.901
Gradient Boost 587570.524 0.988 3.617 766.531 -690.250 -11.376
knn 84686.639 0.998 1.337 291.010 -475.522 -2.279
AdaBoost 78647.131 0.999 1.189 280.441 -10.569 -0.148
RF 77984.792 0.999 15.136 279.258 -1.183 13.947
Decision tree 77889.082 0.999 17.938 279.086 -0.171 2.803

Conclusions

Considering the computation time and the error measured on the test set, we can conclude that:

  • OLS is not a good choice here unless we add the non-linear contributions of each feature to the explain the outcome.

  • Given that all sklearn linear models have almost the same scores, Ridge Regression would be chosen due to the time it needs to be trained while automatically adjusting the weights of each feature, so we don’t need to manually select them

  • among the ML models, its important to note that the KNN model took 10x less time to produce a model almost as good as the alternatives.

  • ML have 5x improved performance (RMSE) over Linear Models because it doesn’t rely on the linear correlations between predictors and the outcome. The linear models could have better performance if we added the contributions of interdependent terms or non-linear transformations of them.