Data cleaning

In this notebook we will explore the dataset characteristics in order to clean and prepare it for analysis and modeling.

Getting the data

This dataset has handled to me through e-mail. I not sure whether or where it can be found over the web. I saw Microsoft using the same dataset on a training. In the future I’ll try to include the source here.

[1]:
import pandas as pd
import numpy as np
import pickle as pk

df_orign = pd.read_csv('data/Auto1-DS-TestData.csv')
df_orign.head()
[1]:
symboling normalized-losses make fuel-type aspiration num-of-doors body-style drive-wheels engine-location wheel-base ... engine-size fuel-system bore stroke compression-ratio horsepower peak-rpm city-mpg highway-mpg price
0 3 ? alfa-romero gas std two convertible rwd front 88.6 ... 130 mpfi 3.47 2.68 9.0 111 5000 21 27 13495
1 3 ? alfa-romero gas std two convertible rwd front 88.6 ... 130 mpfi 3.47 2.68 9.0 111 5000 21 27 16500
2 1 ? alfa-romero gas std two hatchback rwd front 94.5 ... 152 mpfi 2.68 3.47 9.0 154 5000 19 26 16500
3 2 164 audi gas std four sedan fwd front 99.8 ... 109 mpfi 3.19 3.40 10.0 102 5500 24 30 13950
4 2 164 audi gas std four sedan 4wd front 99.4 ... 136 mpfi 3.19 3.40 8.0 115 5500 18 22 17450

5 rows × 26 columns

As we can see, the missing data is marked as ‘?’ in this dataset.

Lets start by replacing it with proper missing data indicators.

After that we can check the data types for each feature.

[2]:
df = df_orign.copy()
df = df.replace(to_replace='?', value= np.NaN)
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205 entries, 0 to 204
Data columns (total 26 columns):
 #   Column             Non-Null Count  Dtype
---  ------             --------------  -----
 0   symboling          205 non-null    int64
 1   normalized-losses  164 non-null    object
 2   make               205 non-null    object
 3   fuel-type          205 non-null    object
 4   aspiration         205 non-null    object
 5   num-of-doors       203 non-null    object
 6   body-style         205 non-null    object
 7   drive-wheels       205 non-null    object
 8   engine-location    205 non-null    object
 9   wheel-base         205 non-null    float64
 10  length             205 non-null    float64
 11  width              205 non-null    float64
 12  height             205 non-null    float64
 13  curb-weight        205 non-null    int64
 14  engine-type        205 non-null    object
 15  num-of-cylinders   205 non-null    object
 16  engine-size        205 non-null    int64
 17  fuel-system        205 non-null    object
 18  bore               201 non-null    object
 19  stroke             201 non-null    object
 20  compression-ratio  205 non-null    float64
 21  horsepower         203 non-null    object
 22  peak-rpm           203 non-null    object
 23  city-mpg           205 non-null    int64
 24  highway-mpg        205 non-null    int64
 25  price              201 non-null    object
dtypes: float64(5), int64(5), object(16)
memory usage: 41.8+ KB
[3]:
# fixing columns names
df.columns = df.columns.str.replace("-", "_")
df.describe()
[3]:
symboling wheel_base length width height curb_weight engine_size compression_ratio city_mpg highway_mpg
count 205.000000 205.000000 205.000000 205.000000 205.000000 205.000000 205.000000 205.000000 205.000000 205.000000
mean 0.834146 98.756585 174.049268 65.907805 53.724878 2555.565854 126.907317 10.142537 25.219512 30.751220
std 1.245307 6.021776 12.337289 2.145204 2.443522 520.680204 41.642693 3.972040 6.542142 6.886443
min -2.000000 86.600000 141.100000 60.300000 47.800000 1488.000000 61.000000 7.000000 13.000000 16.000000
25% 0.000000 94.500000 166.300000 64.100000 52.000000 2145.000000 97.000000 8.600000 19.000000 25.000000
50% 1.000000 97.000000 173.200000 65.500000 54.100000 2414.000000 120.000000 9.000000 24.000000 30.000000
75% 2.000000 102.400000 183.100000 66.900000 55.500000 2935.000000 141.000000 9.400000 30.000000 34.000000
max 3.000000 120.900000 208.100000 72.300000 59.800000 4066.000000 326.000000 23.000000 49.000000 54.000000

Missing data

With proper missing data indicators, we can have a better understanding of the quality of each record in the dataset and in each one of its features.

[4]:
# looking for NAs
na_count = df.isnull().sum()/df.shape[0]
na_count[na_count >0]
[4]:
normalized_losses    0.200000
num_of_doors         0.009756
bore                 0.019512
stroke               0.019512
horsepower           0.009756
peak_rpm             0.009756
price                0.019512
dtype: float64

Strategies for dealing with missing data

A number of different strategies can be used for dealing with missing data:

  • data imputation, filling NANs through mean or median by feature

  • same as above, but taking data structure in consideration, ie, the mean of normalized losses grouped by make

  • imputation with ML (actualy there are a lot of imputation methods available)

  • just dropping the data

It is important to notice that some knowledge on the business domain would be required for making the proper assumptions for executing data imputation.

As this is not the case, we don’t have any understanding on the business rules here, we chose to deal with the missing data by just dropping it.

Therefore:

  • With 20% of missing data for normalized_losses, we will just drop this feature entirely

  • for the other features, as they have only a small portion of missing data, 2% max, we will use them but we will drop the observations containing missing data.

[5]:
df.drop('normalized_losses', axis=1, inplace=True)
df.dropna(inplace=True)
na_count = df.isnull().sum()/df.shape[0]
na_count[na_count >0]
[5]:
Series([], dtype: float64)

Exploring and transforming the features

Now we will explore each feature individually.

The first thing to do is to understand what features should be considered as categories and what features should be considered as numeric. This is very important for building predictive models later.

[6]:
temp_categories = df.select_dtypes('object').columns.tolist()
temp_categories
[6]:
['make',
 'fuel_type',
 'aspiration',
 'num_of_doors',
 'body_style',
 'drive_wheels',
 'engine_location',
 'engine_type',
 'num_of_cylinders',
 'fuel_system',
 'bore',
 'stroke',
 'horsepower',
 'peak_rpm',
 'price']
[7]:
for col in temp_categories:
    print(col)
    print(df[col].unique())
make
['alfa-romero' 'audi' 'bmw' 'chevrolet' 'dodge' 'honda' 'isuzu' 'jaguar'
 'mazda' 'mercedes-benz' 'mercury' 'mitsubishi' 'nissan' 'peugot'
 'plymouth' 'porsche' 'saab' 'subaru' 'toyota' 'volkswagen' 'volvo']
fuel_type
['gas' 'diesel']
aspiration
['std' 'turbo']
num_of_doors
['two' 'four']
body_style
['convertible' 'hatchback' 'sedan' 'wagon' 'hardtop']
drive_wheels
['rwd' 'fwd' '4wd']
engine_location
['front' 'rear']
engine_type
['dohc' 'ohcv' 'ohc' 'l' 'ohcf']
num_of_cylinders
['four' 'six' 'five' 'three' 'twelve' 'eight']
fuel_system
['mpfi' '2bbl' 'mfi' '1bbl' 'spfi' 'idi' 'spdi']
bore
['3.47' '2.68' '3.19' '3.13' '3.50' '3.31' '3.62' '2.91' '3.03' '2.97'
 '3.34' '3.60' '2.92' '3.15' '3.43' '3.63' '3.54' '3.08' '3.39' '3.76'
 '3.58' '3.46' '3.80' '3.78' '3.17' '3.35' '3.59' '2.99' '3.33' '3.70'
 '3.61' '3.94' '3.74' '2.54' '3.05' '3.27' '3.24' '3.01']
stroke
['2.68' '3.47' '3.40' '2.80' '3.19' '3.39' '3.03' '3.11' '3.23' '3.46'
 '3.90' '3.41' '3.07' '3.58' '4.17' '2.76' '3.15' '3.16' '3.64' '3.10'
 '3.35' '3.12' '3.86' '3.29' '3.27' '3.52' '2.19' '3.21' '2.90' '2.07'
 '2.36' '2.64' '3.08' '3.50' '3.54' '2.87']
horsepower
['111' '154' '102' '115' '110' '140' '101' '121' '182' '48' '70' '68' '88'
 '145' '58' '76' '60' '86' '100' '78' '90' '176' '262' '84' '120' '72'
 '123' '155' '184' '175' '116' '69' '55' '97' '152' '160' '200' '95' '142'
 '143' '207' '73' '82' '94' '62' '56' '112' '92' '161' '156' '52' '85'
 '114' '162' '134' '106']
peak_rpm
['5000' '5500' '5800' '4250' '5400' '5100' '4800' '6000' '4750' '4200'
 '4350' '4500' '5200' '4150' '5600' '5900' '5250' '4900' '4400' '6600'
 '5300']
price
['13495' '16500' '13950' '17450' '15250' '17710' '18920' '23875' '16430'
 '16925' '20970' '21105' '24565' '30760' '41315' '36880' '5151' '6295'
 '6575' '5572' '6377' '7957' '6229' '6692' '7609' '8921' '12964' '6479'
 '6855' '5399' '6529' '7129' '7295' '7895' '9095' '8845' '10295' '12945'
 '10345' '6785' '11048' '32250' '35550' '36000' '5195' '6095' '6795'
 '6695' '7395' '8495' '10595' '10245' '11245' '18280' '18344' '25552'
 '28248' '28176' '31600' '34184' '35056' '40960' '45400' '16503' '5389'
 '6189' '6669' '7689' '9959' '8499' '12629' '14869' '14489' '6989' '8189'
 '9279' '5499' '7099' '6649' '6849' '7349' '7299' '7799' '7499' '7999'
 '8249' '8949' '9549' '13499' '14399' '17199' '19699' '18399' '11900'
 '13200' '12440' '13860' '15580' '16900' '16695' '17075' '16630' '17950'
 '18150' '12764' '22018' '32528' '34028' '37028' '11850' '12170' '15040'
 '15510' '18620' '5118' '7053' '7603' '7126' '7775' '9960' '9233' '11259'
 '7463' '10198' '8013' '11694' '5348' '6338' '6488' '6918' '7898' '8778'
 '6938' '7198' '7788' '7738' '8358' '9258' '8058' '8238' '9298' '9538'
 '8449' '9639' '9989' '11199' '11549' '17669' '8948' '10698' '9988'
 '10898' '11248' '16558' '15998' '15690' '15750' '7975' '7995' '8195'
 '9495' '9995' '11595' '9980' '13295' '13845' '12290' '12940' '13415'
 '15985' '16515' '18420' '18950' '16845' '19045' '21485' '22470' '22625']
[8]:
# bore, stroke, horsepower, peak_rpm and price must be converted to numeric
for col in ['bore', 'stroke', 'horsepower', 'peak_rpm', 'price']:
    df[col] = df[col].astype('float')

# checking categories again
temp_categories = df.select_dtypes('object').columns.tolist()
for col in temp_categories:
    print(col)
    print(df[col].unique())
make
['alfa-romero' 'audi' 'bmw' 'chevrolet' 'dodge' 'honda' 'isuzu' 'jaguar'
 'mazda' 'mercedes-benz' 'mercury' 'mitsubishi' 'nissan' 'peugot'
 'plymouth' 'porsche' 'saab' 'subaru' 'toyota' 'volkswagen' 'volvo']
fuel_type
['gas' 'diesel']
aspiration
['std' 'turbo']
num_of_doors
['two' 'four']
body_style
['convertible' 'hatchback' 'sedan' 'wagon' 'hardtop']
drive_wheels
['rwd' 'fwd' '4wd']
engine_location
['front' 'rear']
engine_type
['dohc' 'ohcv' 'ohc' 'l' 'ohcf']
num_of_cylinders
['four' 'six' 'five' 'three' 'twelve' 'eight']
fuel_system
['mpfi' '2bbl' 'mfi' '1bbl' 'spfi' 'idi' 'spdi']
[9]:
# lets replace drive_wheels 4wd with fwd
df['drive_wheels'].replace('4wd', 'fwd', inplace=True)
df['drive_wheels'].unique()
[9]:
array(['rwd', 'fwd'], dtype=object)
[10]:
# now lets take a look on the 'numeric' features
temp_num = [col for col in df.columns if col not in temp_categories]
df[temp_num].describe()
[10]:
symboling wheel_base length width height curb_weight engine_size bore stroke compression_ratio horsepower peak_rpm city_mpg highway_mpg price
count 193.000000 193.000000 193.000000 193.000000 193.000000 193.000000 193.000000 193.000000 193.000000 193.000000 193.000000 193.000000 193.000000 193.000000 193.000000
mean 0.797927 98.923834 174.326425 65.893782 53.869948 2561.507772 128.124352 3.330622 3.248860 10.143627 103.481865 5099.740933 25.326425 30.787565 13285.025907
std 1.235582 6.152409 12.478593 2.137795 2.394770 526.700026 41.590452 0.272385 0.315421 3.977491 37.960107 468.694369 6.387828 6.816910 8089.082886
min -2.000000 86.600000 141.100000 60.300000 47.800000 1488.000000 61.000000 2.540000 2.070000 7.000000 48.000000 4150.000000 13.000000 16.000000 5118.000000
25% 0.000000 94.500000 166.300000 64.100000 52.000000 2145.000000 98.000000 3.150000 3.110000 8.500000 70.000000 4800.000000 19.000000 25.000000 7738.000000
50% 1.000000 97.000000 173.200000 65.400000 54.100000 2414.000000 120.000000 3.310000 3.290000 9.000000 95.000000 5100.000000 25.000000 30.000000 10245.000000
75% 2.000000 102.400000 184.600000 66.900000 55.700000 2952.000000 146.000000 3.590000 3.410000 9.400000 116.000000 5500.000000 30.000000 34.000000 16515.000000
max 3.000000 120.900000 208.100000 72.000000 59.800000 4066.000000 326.000000 3.940000 4.170000 23.000000 262.000000 6600.000000 49.000000 54.000000 45400.000000
[11]:
df[temp_num].sample(10)
[11]:
symboling wheel_base length width height curb_weight engine_size bore stroke compression_ratio horsepower peak_rpm city_mpg highway_mpg price
122 1 93.7 167.3 63.8 50.8 2191 98 2.97 3.23 9.4 68.0 5500.0 31 38 7609.0
53 1 93.1 166.8 64.2 54.1 1945 91 3.03 3.15 9.0 68.0 5000.0 31 38 6695.0
178 3 102.9 183.5 67.7 52.0 2976 171 3.27 3.35 9.3 161.0 5200.0 20 24 16558.0
187 2 97.3 171.7 65.5 55.7 2319 97 3.01 3.40 23.0 68.0 4500.0 37 42 9495.0
64 0 98.8 177.8 66.5 55.5 2425 122 3.39 3.39 8.6 84.0 4800.0 26 32 11245.0
87 1 96.3 172.4 65.4 51.6 2403 110 3.17 3.46 7.5 116.0 5500.0 23 30 9279.0
68 -1 110.0 190.9 70.3 58.7 3750 183 3.58 3.64 21.5 123.0 4350.0 22 25 28248.0
89 1 94.5 165.3 63.8 54.5 1889 97 3.15 3.29 9.4 69.0 5200.0 31 37 5499.0
59 1 98.8 177.8 66.5 53.7 2385 122 3.39 3.39 8.6 84.0 4800.0 26 32 8845.0
36 0 96.5 157.1 63.9 58.3 2024 92 2.92 3.41 9.2 76.0 6000.0 30 34 7295.0
[12]:
# symboling is discrete and clearly should be treated as category too
temp_categories.append('symboling')
temp_categories
[12]:
['make',
 'fuel_type',
 'aspiration',
 'num_of_doors',
 'body_style',
 'drive_wheels',
 'engine_location',
 'engine_type',
 'num_of_cylinders',
 'fuel_system',
 'symboling']

Now that we have our categories properly addressed, lets write them down so we can use this information later.

[13]:
with open('data/category_list', 'wb') as file:
    pk.dump(temp_categories, file)

We will now export the treated data.

Also, we will create a python script containing a function for running all the tasks we just did, so we can transform this and new data as well when the need arises.

[14]:
df.to_csv('data/df_transformed.csv', index=False)