Data cleaning¶
In this notebook we will explore the dataset characteristics in order to clean and prepare it for analysis and modeling.
Getting the data¶
This dataset has handled to me through e-mail. I not sure whether or where it can be found over the web. I saw Microsoft using the same dataset on a training. In the future I’ll try to include the source here.
[1]:
import pandas as pd
import numpy as np
import pickle as pk
df_orign = pd.read_csv('data/Auto1-DS-TestData.csv')
df_orign.head()
[1]:
symboling | normalized-losses | make | fuel-type | aspiration | num-of-doors | body-style | drive-wheels | engine-location | wheel-base | ... | engine-size | fuel-system | bore | stroke | compression-ratio | horsepower | peak-rpm | city-mpg | highway-mpg | price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 3 | ? | alfa-romero | gas | std | two | convertible | rwd | front | 88.6 | ... | 130 | mpfi | 3.47 | 2.68 | 9.0 | 111 | 5000 | 21 | 27 | 13495 |
1 | 3 | ? | alfa-romero | gas | std | two | convertible | rwd | front | 88.6 | ... | 130 | mpfi | 3.47 | 2.68 | 9.0 | 111 | 5000 | 21 | 27 | 16500 |
2 | 1 | ? | alfa-romero | gas | std | two | hatchback | rwd | front | 94.5 | ... | 152 | mpfi | 2.68 | 3.47 | 9.0 | 154 | 5000 | 19 | 26 | 16500 |
3 | 2 | 164 | audi | gas | std | four | sedan | fwd | front | 99.8 | ... | 109 | mpfi | 3.19 | 3.40 | 10.0 | 102 | 5500 | 24 | 30 | 13950 |
4 | 2 | 164 | audi | gas | std | four | sedan | 4wd | front | 99.4 | ... | 136 | mpfi | 3.19 | 3.40 | 8.0 | 115 | 5500 | 18 | 22 | 17450 |
5 rows × 26 columns
As we can see, the missing data is marked as ‘?’ in this dataset.
Lets start by replacing it with proper missing data indicators.
After that we can check the data types for each feature.
[2]:
df = df_orign.copy()
df = df.replace(to_replace='?', value= np.NaN)
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205 entries, 0 to 204
Data columns (total 26 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 symboling 205 non-null int64
1 normalized-losses 164 non-null object
2 make 205 non-null object
3 fuel-type 205 non-null object
4 aspiration 205 non-null object
5 num-of-doors 203 non-null object
6 body-style 205 non-null object
7 drive-wheels 205 non-null object
8 engine-location 205 non-null object
9 wheel-base 205 non-null float64
10 length 205 non-null float64
11 width 205 non-null float64
12 height 205 non-null float64
13 curb-weight 205 non-null int64
14 engine-type 205 non-null object
15 num-of-cylinders 205 non-null object
16 engine-size 205 non-null int64
17 fuel-system 205 non-null object
18 bore 201 non-null object
19 stroke 201 non-null object
20 compression-ratio 205 non-null float64
21 horsepower 203 non-null object
22 peak-rpm 203 non-null object
23 city-mpg 205 non-null int64
24 highway-mpg 205 non-null int64
25 price 201 non-null object
dtypes: float64(5), int64(5), object(16)
memory usage: 41.8+ KB
[3]:
# fixing columns names
df.columns = df.columns.str.replace("-", "_")
df.describe()
[3]:
symboling | wheel_base | length | width | height | curb_weight | engine_size | compression_ratio | city_mpg | highway_mpg | |
---|---|---|---|---|---|---|---|---|---|---|
count | 205.000000 | 205.000000 | 205.000000 | 205.000000 | 205.000000 | 205.000000 | 205.000000 | 205.000000 | 205.000000 | 205.000000 |
mean | 0.834146 | 98.756585 | 174.049268 | 65.907805 | 53.724878 | 2555.565854 | 126.907317 | 10.142537 | 25.219512 | 30.751220 |
std | 1.245307 | 6.021776 | 12.337289 | 2.145204 | 2.443522 | 520.680204 | 41.642693 | 3.972040 | 6.542142 | 6.886443 |
min | -2.000000 | 86.600000 | 141.100000 | 60.300000 | 47.800000 | 1488.000000 | 61.000000 | 7.000000 | 13.000000 | 16.000000 |
25% | 0.000000 | 94.500000 | 166.300000 | 64.100000 | 52.000000 | 2145.000000 | 97.000000 | 8.600000 | 19.000000 | 25.000000 |
50% | 1.000000 | 97.000000 | 173.200000 | 65.500000 | 54.100000 | 2414.000000 | 120.000000 | 9.000000 | 24.000000 | 30.000000 |
75% | 2.000000 | 102.400000 | 183.100000 | 66.900000 | 55.500000 | 2935.000000 | 141.000000 | 9.400000 | 30.000000 | 34.000000 |
max | 3.000000 | 120.900000 | 208.100000 | 72.300000 | 59.800000 | 4066.000000 | 326.000000 | 23.000000 | 49.000000 | 54.000000 |
Missing data¶
With proper missing data indicators, we can have a better understanding of the quality of each record in the dataset and in each one of its features.
[4]:
# looking for NAs
na_count = df.isnull().sum()/df.shape[0]
na_count[na_count >0]
[4]:
normalized_losses 0.200000
num_of_doors 0.009756
bore 0.019512
stroke 0.019512
horsepower 0.009756
peak_rpm 0.009756
price 0.019512
dtype: float64
Strategies for dealing with missing data¶
A number of different strategies can be used for dealing with missing data:
data imputation, filling NANs through mean or median by feature
same as above, but taking data structure in consideration, ie, the mean of normalized losses grouped by make
imputation with ML (actualy there are a lot of imputation methods available)
just dropping the data
It is important to notice that some knowledge on the business domain would be required for making the proper assumptions for executing data imputation.
As this is not the case, we don’t have any understanding on the business rules here, we chose to deal with the missing data by just dropping it.
Therefore:
With 20% of missing data for normalized_losses, we will just drop this feature entirely
for the other features, as they have only a small portion of missing data, 2% max, we will use them but we will drop the observations containing missing data.
[5]:
df.drop('normalized_losses', axis=1, inplace=True)
df.dropna(inplace=True)
na_count = df.isnull().sum()/df.shape[0]
na_count[na_count >0]
[5]:
Series([], dtype: float64)
Exploring and transforming the features¶
Now we will explore each feature individually.
The first thing to do is to understand what features should be considered as categories and what features should be considered as numeric. This is very important for building predictive models later.
[6]:
temp_categories = df.select_dtypes('object').columns.tolist()
temp_categories
[6]:
['make',
'fuel_type',
'aspiration',
'num_of_doors',
'body_style',
'drive_wheels',
'engine_location',
'engine_type',
'num_of_cylinders',
'fuel_system',
'bore',
'stroke',
'horsepower',
'peak_rpm',
'price']
[7]:
for col in temp_categories:
print(col)
print(df[col].unique())
make
['alfa-romero' 'audi' 'bmw' 'chevrolet' 'dodge' 'honda' 'isuzu' 'jaguar'
'mazda' 'mercedes-benz' 'mercury' 'mitsubishi' 'nissan' 'peugot'
'plymouth' 'porsche' 'saab' 'subaru' 'toyota' 'volkswagen' 'volvo']
fuel_type
['gas' 'diesel']
aspiration
['std' 'turbo']
num_of_doors
['two' 'four']
body_style
['convertible' 'hatchback' 'sedan' 'wagon' 'hardtop']
drive_wheels
['rwd' 'fwd' '4wd']
engine_location
['front' 'rear']
engine_type
['dohc' 'ohcv' 'ohc' 'l' 'ohcf']
num_of_cylinders
['four' 'six' 'five' 'three' 'twelve' 'eight']
fuel_system
['mpfi' '2bbl' 'mfi' '1bbl' 'spfi' 'idi' 'spdi']
bore
['3.47' '2.68' '3.19' '3.13' '3.50' '3.31' '3.62' '2.91' '3.03' '2.97'
'3.34' '3.60' '2.92' '3.15' '3.43' '3.63' '3.54' '3.08' '3.39' '3.76'
'3.58' '3.46' '3.80' '3.78' '3.17' '3.35' '3.59' '2.99' '3.33' '3.70'
'3.61' '3.94' '3.74' '2.54' '3.05' '3.27' '3.24' '3.01']
stroke
['2.68' '3.47' '3.40' '2.80' '3.19' '3.39' '3.03' '3.11' '3.23' '3.46'
'3.90' '3.41' '3.07' '3.58' '4.17' '2.76' '3.15' '3.16' '3.64' '3.10'
'3.35' '3.12' '3.86' '3.29' '3.27' '3.52' '2.19' '3.21' '2.90' '2.07'
'2.36' '2.64' '3.08' '3.50' '3.54' '2.87']
horsepower
['111' '154' '102' '115' '110' '140' '101' '121' '182' '48' '70' '68' '88'
'145' '58' '76' '60' '86' '100' '78' '90' '176' '262' '84' '120' '72'
'123' '155' '184' '175' '116' '69' '55' '97' '152' '160' '200' '95' '142'
'143' '207' '73' '82' '94' '62' '56' '112' '92' '161' '156' '52' '85'
'114' '162' '134' '106']
peak_rpm
['5000' '5500' '5800' '4250' '5400' '5100' '4800' '6000' '4750' '4200'
'4350' '4500' '5200' '4150' '5600' '5900' '5250' '4900' '4400' '6600'
'5300']
price
['13495' '16500' '13950' '17450' '15250' '17710' '18920' '23875' '16430'
'16925' '20970' '21105' '24565' '30760' '41315' '36880' '5151' '6295'
'6575' '5572' '6377' '7957' '6229' '6692' '7609' '8921' '12964' '6479'
'6855' '5399' '6529' '7129' '7295' '7895' '9095' '8845' '10295' '12945'
'10345' '6785' '11048' '32250' '35550' '36000' '5195' '6095' '6795'
'6695' '7395' '8495' '10595' '10245' '11245' '18280' '18344' '25552'
'28248' '28176' '31600' '34184' '35056' '40960' '45400' '16503' '5389'
'6189' '6669' '7689' '9959' '8499' '12629' '14869' '14489' '6989' '8189'
'9279' '5499' '7099' '6649' '6849' '7349' '7299' '7799' '7499' '7999'
'8249' '8949' '9549' '13499' '14399' '17199' '19699' '18399' '11900'
'13200' '12440' '13860' '15580' '16900' '16695' '17075' '16630' '17950'
'18150' '12764' '22018' '32528' '34028' '37028' '11850' '12170' '15040'
'15510' '18620' '5118' '7053' '7603' '7126' '7775' '9960' '9233' '11259'
'7463' '10198' '8013' '11694' '5348' '6338' '6488' '6918' '7898' '8778'
'6938' '7198' '7788' '7738' '8358' '9258' '8058' '8238' '9298' '9538'
'8449' '9639' '9989' '11199' '11549' '17669' '8948' '10698' '9988'
'10898' '11248' '16558' '15998' '15690' '15750' '7975' '7995' '8195'
'9495' '9995' '11595' '9980' '13295' '13845' '12290' '12940' '13415'
'15985' '16515' '18420' '18950' '16845' '19045' '21485' '22470' '22625']
[8]:
# bore, stroke, horsepower, peak_rpm and price must be converted to numeric
for col in ['bore', 'stroke', 'horsepower', 'peak_rpm', 'price']:
df[col] = df[col].astype('float')
# checking categories again
temp_categories = df.select_dtypes('object').columns.tolist()
for col in temp_categories:
print(col)
print(df[col].unique())
make
['alfa-romero' 'audi' 'bmw' 'chevrolet' 'dodge' 'honda' 'isuzu' 'jaguar'
'mazda' 'mercedes-benz' 'mercury' 'mitsubishi' 'nissan' 'peugot'
'plymouth' 'porsche' 'saab' 'subaru' 'toyota' 'volkswagen' 'volvo']
fuel_type
['gas' 'diesel']
aspiration
['std' 'turbo']
num_of_doors
['two' 'four']
body_style
['convertible' 'hatchback' 'sedan' 'wagon' 'hardtop']
drive_wheels
['rwd' 'fwd' '4wd']
engine_location
['front' 'rear']
engine_type
['dohc' 'ohcv' 'ohc' 'l' 'ohcf']
num_of_cylinders
['four' 'six' 'five' 'three' 'twelve' 'eight']
fuel_system
['mpfi' '2bbl' 'mfi' '1bbl' 'spfi' 'idi' 'spdi']
[9]:
# lets replace drive_wheels 4wd with fwd
df['drive_wheels'].replace('4wd', 'fwd', inplace=True)
df['drive_wheels'].unique()
[9]:
array(['rwd', 'fwd'], dtype=object)
[10]:
# now lets take a look on the 'numeric' features
temp_num = [col for col in df.columns if col not in temp_categories]
df[temp_num].describe()
[10]:
symboling | wheel_base | length | width | height | curb_weight | engine_size | bore | stroke | compression_ratio | horsepower | peak_rpm | city_mpg | highway_mpg | price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 193.000000 | 193.000000 | 193.000000 | 193.000000 | 193.000000 | 193.000000 | 193.000000 | 193.000000 | 193.000000 | 193.000000 | 193.000000 | 193.000000 | 193.000000 | 193.000000 | 193.000000 |
mean | 0.797927 | 98.923834 | 174.326425 | 65.893782 | 53.869948 | 2561.507772 | 128.124352 | 3.330622 | 3.248860 | 10.143627 | 103.481865 | 5099.740933 | 25.326425 | 30.787565 | 13285.025907 |
std | 1.235582 | 6.152409 | 12.478593 | 2.137795 | 2.394770 | 526.700026 | 41.590452 | 0.272385 | 0.315421 | 3.977491 | 37.960107 | 468.694369 | 6.387828 | 6.816910 | 8089.082886 |
min | -2.000000 | 86.600000 | 141.100000 | 60.300000 | 47.800000 | 1488.000000 | 61.000000 | 2.540000 | 2.070000 | 7.000000 | 48.000000 | 4150.000000 | 13.000000 | 16.000000 | 5118.000000 |
25% | 0.000000 | 94.500000 | 166.300000 | 64.100000 | 52.000000 | 2145.000000 | 98.000000 | 3.150000 | 3.110000 | 8.500000 | 70.000000 | 4800.000000 | 19.000000 | 25.000000 | 7738.000000 |
50% | 1.000000 | 97.000000 | 173.200000 | 65.400000 | 54.100000 | 2414.000000 | 120.000000 | 3.310000 | 3.290000 | 9.000000 | 95.000000 | 5100.000000 | 25.000000 | 30.000000 | 10245.000000 |
75% | 2.000000 | 102.400000 | 184.600000 | 66.900000 | 55.700000 | 2952.000000 | 146.000000 | 3.590000 | 3.410000 | 9.400000 | 116.000000 | 5500.000000 | 30.000000 | 34.000000 | 16515.000000 |
max | 3.000000 | 120.900000 | 208.100000 | 72.000000 | 59.800000 | 4066.000000 | 326.000000 | 3.940000 | 4.170000 | 23.000000 | 262.000000 | 6600.000000 | 49.000000 | 54.000000 | 45400.000000 |
[11]:
df[temp_num].sample(10)
[11]:
symboling | wheel_base | length | width | height | curb_weight | engine_size | bore | stroke | compression_ratio | horsepower | peak_rpm | city_mpg | highway_mpg | price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
122 | 1 | 93.7 | 167.3 | 63.8 | 50.8 | 2191 | 98 | 2.97 | 3.23 | 9.4 | 68.0 | 5500.0 | 31 | 38 | 7609.0 |
53 | 1 | 93.1 | 166.8 | 64.2 | 54.1 | 1945 | 91 | 3.03 | 3.15 | 9.0 | 68.0 | 5000.0 | 31 | 38 | 6695.0 |
178 | 3 | 102.9 | 183.5 | 67.7 | 52.0 | 2976 | 171 | 3.27 | 3.35 | 9.3 | 161.0 | 5200.0 | 20 | 24 | 16558.0 |
187 | 2 | 97.3 | 171.7 | 65.5 | 55.7 | 2319 | 97 | 3.01 | 3.40 | 23.0 | 68.0 | 4500.0 | 37 | 42 | 9495.0 |
64 | 0 | 98.8 | 177.8 | 66.5 | 55.5 | 2425 | 122 | 3.39 | 3.39 | 8.6 | 84.0 | 4800.0 | 26 | 32 | 11245.0 |
87 | 1 | 96.3 | 172.4 | 65.4 | 51.6 | 2403 | 110 | 3.17 | 3.46 | 7.5 | 116.0 | 5500.0 | 23 | 30 | 9279.0 |
68 | -1 | 110.0 | 190.9 | 70.3 | 58.7 | 3750 | 183 | 3.58 | 3.64 | 21.5 | 123.0 | 4350.0 | 22 | 25 | 28248.0 |
89 | 1 | 94.5 | 165.3 | 63.8 | 54.5 | 1889 | 97 | 3.15 | 3.29 | 9.4 | 69.0 | 5200.0 | 31 | 37 | 5499.0 |
59 | 1 | 98.8 | 177.8 | 66.5 | 53.7 | 2385 | 122 | 3.39 | 3.39 | 8.6 | 84.0 | 4800.0 | 26 | 32 | 8845.0 |
36 | 0 | 96.5 | 157.1 | 63.9 | 58.3 | 2024 | 92 | 2.92 | 3.41 | 9.2 | 76.0 | 6000.0 | 30 | 34 | 7295.0 |
[12]:
# symboling is discrete and clearly should be treated as category too
temp_categories.append('symboling')
temp_categories
[12]:
['make',
'fuel_type',
'aspiration',
'num_of_doors',
'body_style',
'drive_wheels',
'engine_location',
'engine_type',
'num_of_cylinders',
'fuel_system',
'symboling']
Now that we have our categories properly addressed, lets write them down so we can use this information later.
[13]:
with open('data/category_list', 'wb') as file:
pk.dump(temp_categories, file)
We will now export the treated data.
Also, we will create a python script containing a function for running all the tasks we just did, so we can transform this and new data as well when the need arises.
[14]:
df.to_csv('data/df_transformed.csv', index=False)