Data cleaning¶

In this notebook we will explore the dataset characteristics in order to clean and prepare it for analysis and modeling.

Getting the data¶

This dataset has handled to me through e-mail. I not sure whether or where it can be found over the web. I saw Microsoft using the same dataset on a training. In the future I’ll try to include the source here.

[1]:

import pandas as pd
import numpy as np
import pickle as pk

df_orign = pd.read_csv('data/Auto1-DS-TestData.csv')
df_orign.head()

[1]:

	symboling	normalized-losses	make	fuel-type	aspiration	num-of-doors	body-style	drive-wheels	engine-location	wheel-base	...	engine-size	fuel-system	bore	stroke	compression-ratio	horsepower	peak-rpm	city-mpg	highway-mpg	price
0	3	?	alfa-romero	gas	std	two	convertible	rwd	front	88.6	...	130	mpfi	3.47	2.68	9.0	111	5000	21	27	13495
1	3	?	alfa-romero	gas	std	two	convertible	rwd	front	88.6	...	130	mpfi	3.47	2.68	9.0	111	5000	21	27	16500
2	1	?	alfa-romero	gas	std	two	hatchback	rwd	front	94.5	...	152	mpfi	2.68	3.47	9.0	154	5000	19	26	16500
3	2	164	audi	gas	std	four	sedan	fwd	front	99.8	...	109	mpfi	3.19	3.40	10.0	102	5500	24	30	13950
4	2	164	audi	gas	std	four	sedan	4wd	front	99.4	...	136	mpfi	3.19	3.40	8.0	115	5500	18	22	17450

5 rows × 26 columns

As we can see, the missing data is marked as ‘?’ in this dataset.

Lets start by replacing it with proper missing data indicators.

After that we can check the data types for each feature.

[2]:

df = df_orign.copy()
df = df.replace(to_replace='?', value= np.NaN)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205 entries, 0 to 204
Data columns (total 26 columns):
 #   Column             Non-Null Count  Dtype
---  ------             --------------  -----
 0   symboling          205 non-null    int64
 1   normalized-losses  164 non-null    object
 2   make               205 non-null    object
 3   fuel-type          205 non-null    object
 4   aspiration         205 non-null    object
 5   num-of-doors       203 non-null    object
 6   body-style         205 non-null    object
 7   drive-wheels       205 non-null    object
 8   engine-location    205 non-null    object
 9   wheel-base         205 non-null    float64
 10  length             205 non-null    float64
 11  width              205 non-null    float64
 12  height             205 non-null    float64
 13  curb-weight        205 non-null    int64
 14  engine-type        205 non-null    object
 15  num-of-cylinders   205 non-null    object
 16  engine-size        205 non-null    int64
 17  fuel-system        205 non-null    object
 18  bore               201 non-null    object
 19  stroke             201 non-null    object
 20  compression-ratio  205 non-null    float64
 21  horsepower         203 non-null    object
 22  peak-rpm           203 non-null    object
 23  city-mpg           205 non-null    int64
 24  highway-mpg        205 non-null    int64
 25  price              201 non-null    object
dtypes: float64(5), int64(5), object(16)
memory usage: 41.8+ KB

[3]:

# fixing columns names
df.columns = df.columns.str.replace("-", "_")
df.describe()

[3]:

	symboling	wheel_base	length	width	height	curb_weight	engine_size	compression_ratio	city_mpg	highway_mpg
count	205.000000	205.000000	205.000000	205.000000	205.000000	205.000000	205.000000	205.000000	205.000000	205.000000
mean	0.834146	98.756585	174.049268	65.907805	53.724878	2555.565854	126.907317	10.142537	25.219512	30.751220
std	1.245307	6.021776	12.337289	2.145204	2.443522	520.680204	41.642693	3.972040	6.542142	6.886443
min	-2.000000	86.600000	141.100000	60.300000	47.800000	1488.000000	61.000000	7.000000	13.000000	16.000000
25%	0.000000	94.500000	166.300000	64.100000	52.000000	2145.000000	97.000000	8.600000	19.000000	25.000000
50%	1.000000	97.000000	173.200000	65.500000	54.100000	2414.000000	120.000000	9.000000	24.000000	30.000000
75%	2.000000	102.400000	183.100000	66.900000	55.500000	2935.000000	141.000000	9.400000	30.000000	34.000000
max	3.000000	120.900000	208.100000	72.300000	59.800000	4066.000000	326.000000	23.000000	49.000000	54.000000

Missing data¶

With proper missing data indicators, we can have a better understanding of the quality of each record in the dataset and in each one of its features.

[4]:

# looking for NAs
na_count = df.isnull().sum()/df.shape[0]
na_count[na_count >0]

[4]:

normalized_losses    0.200000
num_of_doors         0.009756
bore                 0.019512
stroke               0.019512
horsepower           0.009756
peak_rpm             0.009756
price                0.019512
dtype: float64

Strategies for dealing with missing data¶

A number of different strategies can be used for dealing with missing data:

data imputation, filling NANs through mean or median by feature
same as above, but taking data structure in consideration, ie, the mean of normalized losses grouped by make
imputation with ML (actualy there are a lot of imputation methods available)
just dropping the data

It is important to notice that some knowledge on the business domain would be required for making the proper assumptions for executing data imputation.

As this is not the case, we don’t have any understanding on the business rules here, we chose to deal with the missing data by just dropping it.

Therefore:

With 20% of missing data for normalized_losses, we will just drop this feature entirely
for the other features, as they have only a small portion of missing data, 2% max, we will use them but we will drop the observations containing missing data.

[5]:

df.drop('normalized_losses', axis=1, inplace=True)
df.dropna(inplace=True)
na_count = df.isnull().sum()/df.shape[0]
na_count[na_count >0]

[5]:

Series([], dtype: float64)

Exploring and transforming the features¶

Now we will explore each feature individually.

The first thing to do is to understand what features should be considered as categories and what features should be considered as numeric. This is very important for building predictive models later.

[6]:

temp_categories = df.select_dtypes('object').columns.tolist()
temp_categories

[6]:

['make',
 'fuel_type',
 'aspiration',
 'num_of_doors',
 'body_style',
 'drive_wheels',
 'engine_location',
 'engine_type',
 'num_of_cylinders',
 'fuel_system',
 'bore',
 'stroke',
 'horsepower',
 'peak_rpm',
 'price']

[7]:

for col in temp_categories:
    print(col)
    print(df[col].unique())

make
['alfa-romero' 'audi' 'bmw' 'chevrolet' 'dodge' 'honda' 'isuzu' 'jaguar'
 'mazda' 'mercedes-benz' 'mercury' 'mitsubishi' 'nissan' 'peugot'
 'plymouth' 'porsche' 'saab' 'subaru' 'toyota' 'volkswagen' 'volvo']
fuel_type
['gas' 'diesel']
aspiration
['std' 'turbo']
num_of_doors
['two' 'four']
body_style
['convertible' 'hatchback' 'sedan' 'wagon' 'hardtop']
drive_wheels
['rwd' 'fwd' '4wd']
engine_location
['front' 'rear']
engine_type
['dohc' 'ohcv' 'ohc' 'l' 'ohcf']
num_of_cylinders
['four' 'six' 'five' 'three' 'twelve' 'eight']
fuel_system
['mpfi' '2bbl' 'mfi' '1bbl' 'spfi' 'idi' 'spdi']
bore
['3.47' '2.68' '3.19' '3.13' '3.50' '3.31' '3.62' '2.91' '3.03' '2.97'
 '3.34' '3.60' '2.92' '3.15' '3.43' '3.63' '3.54' '3.08' '3.39' '3.76'
 '3.58' '3.46' '3.80' '3.78' '3.17' '3.35' '3.59' '2.99' '3.33' '3.70'
 '3.61' '3.94' '3.74' '2.54' '3.05' '3.27' '3.24' '3.01']
stroke
['2.68' '3.47' '3.40' '2.80' '3.19' '3.39' '3.03' '3.11' '3.23' '3.46'
 '3.90' '3.41' '3.07' '3.58' '4.17' '2.76' '3.15' '3.16' '3.64' '3.10'
 '3.35' '3.12' '3.86' '3.29' '3.27' '3.52' '2.19' '3.21' '2.90' '2.07'
 '2.36' '2.64' '3.08' '3.50' '3.54' '2.87']
horsepower
['111' '154' '102' '115' '110' '140' '101' '121' '182' '48' '70' '68' '88'
 '145' '58' '76' '60' '86' '100' '78' '90' '176' '262' '84' '120' '72'
 '123' '155' '184' '175' '116' '69' '55' '97' '152' '160' '200' '95' '142'
 '143' '207' '73' '82' '94' '62' '56' '112' '92' '161' '156' '52' '85'
 '114' '162' '134' '106']
peak_rpm
['5000' '5500' '5800' '4250' '5400' '5100' '4800' '6000' '4750' '4200'
 '4350' '4500' '5200' '4150' '5600' '5900' '5250' '4900' '4400' '6600'
 '5300']
price
['13495' '16500' '13950' '17450' '15250' '17710' '18920' '23875' '16430'
 '16925' '20970' '21105' '24565' '30760' '41315' '36880' '5151' '6295'
 '6575' '5572' '6377' '7957' '6229' '6692' '7609' '8921' '12964' '6479'
 '6855' '5399' '6529' '7129' '7295' '7895' '9095' '8845' '10295' '12945'
 '10345' '6785' '11048' '32250' '35550' '36000' '5195' '6095' '6795'
 '6695' '7395' '8495' '10595' '10245' '11245' '18280' '18344' '25552'
 '28248' '28176' '31600' '34184' '35056' '40960' '45400' '16503' '5389'
 '6189' '6669' '7689' '9959' '8499' '12629' '14869' '14489' '6989' '8189'
 '9279' '5499' '7099' '6649' '6849' '7349' '7299' '7799' '7499' '7999'
 '8249' '8949' '9549' '13499' '14399' '17199' '19699' '18399' '11900'
 '13200' '12440' '13860' '15580' '16900' '16695' '17075' '16630' '17950'
 '18150' '12764' '22018' '32528' '34028' '37028' '11850' '12170' '15040'
 '15510' '18620' '5118' '7053' '7603' '7126' '7775' '9960' '9233' '11259'
 '7463' '10198' '8013' '11694' '5348' '6338' '6488' '6918' '7898' '8778'
 '6938' '7198' '7788' '7738' '8358' '9258' '8058' '8238' '9298' '9538'
 '8449' '9639' '9989' '11199' '11549' '17669' '8948' '10698' '9988'
 '10898' '11248' '16558' '15998' '15690' '15750' '7975' '7995' '8195'
 '9495' '9995' '11595' '9980' '13295' '13845' '12290' '12940' '13415'
 '15985' '16515' '18420' '18950' '16845' '19045' '21485' '22470' '22625']

[8]:

# bore, stroke, horsepower, peak_rpm and price must be converted to numeric
for col in ['bore', 'stroke', 'horsepower', 'peak_rpm', 'price']:
    df[col] = df[col].astype('float')

# checking categories again
temp_categories = df.select_dtypes('object').columns.tolist()
for col in temp_categories:
    print(col)
    print(df[col].unique())

make
['alfa-romero' 'audi' 'bmw' 'chevrolet' 'dodge' 'honda' 'isuzu' 'jaguar'
 'mazda' 'mercedes-benz' 'mercury' 'mitsubishi' 'nissan' 'peugot'
 'plymouth' 'porsche' 'saab' 'subaru' 'toyota' 'volkswagen' 'volvo']
fuel_type
['gas' 'diesel']
aspiration
['std' 'turbo']
num_of_doors
['two' 'four']
body_style
['convertible' 'hatchback' 'sedan' 'wagon' 'hardtop']
drive_wheels
['rwd' 'fwd' '4wd']
engine_location
['front' 'rear']
engine_type
['dohc' 'ohcv' 'ohc' 'l' 'ohcf']
num_of_cylinders
['four' 'six' 'five' 'three' 'twelve' 'eight']
fuel_system
['mpfi' '2bbl' 'mfi' '1bbl' 'spfi' 'idi' 'spdi']

[9]:

# lets replace drive_wheels 4wd with fwd
df['drive_wheels'].replace('4wd', 'fwd', inplace=True)
df['drive_wheels'].unique()

[9]:

array(['rwd', 'fwd'], dtype=object)

[10]:

# now lets take a look on the 'numeric' features
temp_num = [col for col in df.columns if col not in temp_categories]
df[temp_num].describe()

[10]:

	symboling	wheel_base	length	width	height	curb_weight	engine_size	bore	stroke	compression_ratio	horsepower	peak_rpm	city_mpg	highway_mpg	price
count	193.000000	193.000000	193.000000	193.000000	193.000000	193.000000	193.000000	193.000000	193.000000	193.000000	193.000000	193.000000	193.000000	193.000000	193.000000
mean	0.797927	98.923834	174.326425	65.893782	53.869948	2561.507772	128.124352	3.330622	3.248860	10.143627	103.481865	5099.740933	25.326425	30.787565	13285.025907
std	1.235582	6.152409	12.478593	2.137795	2.394770	526.700026	41.590452	0.272385	0.315421	3.977491	37.960107	468.694369	6.387828	6.816910	8089.082886
min	-2.000000	86.600000	141.100000	60.300000	47.800000	1488.000000	61.000000	2.540000	2.070000	7.000000	48.000000	4150.000000	13.000000	16.000000	5118.000000
25%	0.000000	94.500000	166.300000	64.100000	52.000000	2145.000000	98.000000	3.150000	3.110000	8.500000	70.000000	4800.000000	19.000000	25.000000	7738.000000
50%	1.000000	97.000000	173.200000	65.400000	54.100000	2414.000000	120.000000	3.310000	3.290000	9.000000	95.000000	5100.000000	25.000000	30.000000	10245.000000
75%	2.000000	102.400000	184.600000	66.900000	55.700000	2952.000000	146.000000	3.590000	3.410000	9.400000	116.000000	5500.000000	30.000000	34.000000	16515.000000
max	3.000000	120.900000	208.100000	72.000000	59.800000	4066.000000	326.000000	3.940000	4.170000	23.000000	262.000000	6600.000000	49.000000	54.000000	45400.000000

[11]:

df[temp_num].sample(10)

[11]:

	symboling	wheel_base	length	width	height	curb_weight	engine_size	bore	stroke	compression_ratio	horsepower	peak_rpm	city_mpg	highway_mpg	price
122	1	93.7	167.3	63.8	50.8	2191	98	2.97	3.23	9.4	68.0	5500.0	31	38	7609.0
53	1	93.1	166.8	64.2	54.1	1945	91	3.03	3.15	9.0	68.0	5000.0	31	38	6695.0
178	3	102.9	183.5	67.7	52.0	2976	171	3.27	3.35	9.3	161.0	5200.0	20	24	16558.0
187	2	97.3	171.7	65.5	55.7	2319	97	3.01	3.40	23.0	68.0	4500.0	37	42	9495.0
64	0	98.8	177.8	66.5	55.5	2425	122	3.39	3.39	8.6	84.0	4800.0	26	32	11245.0
87	1	96.3	172.4	65.4	51.6	2403	110	3.17	3.46	7.5	116.0	5500.0	23	30	9279.0
68	-1	110.0	190.9	70.3	58.7	3750	183	3.58	3.64	21.5	123.0	4350.0	22	25	28248.0
89	1	94.5	165.3	63.8	54.5	1889	97	3.15	3.29	9.4	69.0	5200.0	31	37	5499.0
59	1	98.8	177.8	66.5	53.7	2385	122	3.39	3.39	8.6	84.0	4800.0	26	32	8845.0
36	0	96.5	157.1	63.9	58.3	2024	92	2.92	3.41	9.2	76.0	6000.0	30	34	7295.0

[12]:

# symboling is discrete and clearly should be treated as category too
temp_categories.append('symboling')
temp_categories

[12]:

['make',
 'fuel_type',
 'aspiration',
 'num_of_doors',
 'body_style',
 'drive_wheels',
 'engine_location',
 'engine_type',
 'num_of_cylinders',
 'fuel_system',
 'symboling']

Now that we have our categories properly addressed, lets write them down so we can use this information later.

[13]:

with open('data/category_list', 'wb') as file:
    pk.dump(temp_categories, file)

We will now export the treated data.

Also, we will create a python script containing a function for running all the tasks we just did, so we can transform this and new data as well when the need arises.

[14]:

df.to_csv('data/df_transformed.csv', index=False)