To run the code, use the Jupyter notebook vin_processing.ipynb
.
-
Load
vin
andepa
datasets from files:vin
gets loaded fromvin_file
which contains decoded vin data that has been joined with vtyp3 data.epa
gets loaded fromepa_file
contains mpg data.
-
Fix errors in datasets: e.g. wrong displacment for
ford
expedition
, typo in modelsolstice
(spelledsolistice
). -
Drop all records that are missing
make
,model
oryear
. -
Replace missing
fuelType1
withgasoline
. -
Make all fields lower case and trim white spaces.
-
Drop all vehicles that are in the following categories:
incomplete
,trailer
,motorcycle
,bus
,low speed vehicle (lsv)
. -
Get rid of duplicates fields: e.g.
gasoline, gasoline
becomesgasoline
. -
Drop records that are trucks based on a list of makes:
- volvo truck
- western star
- whitegmc
- winnebago
- winnebago industries, inc.
- workhorse
- ai-springfield
- autocar industries
- capacity of texas
- caterpillar
- e-one
- freightliner
- kenworth
- mack
- navistar
- peterbilt
- pierce manufacturing
- spartan motors chassis
- terex advance mixer
- the vehicle production group
- utilimaster motor corporation
- international.
-
Replace fields based on the following mapping:
'vin': {
'fuelType1': {
'compressed natural gas (cng)': 'natural gas',
'liquefied petroleum gas (propane or lpg)': 'natural gas',
'liquefied natural gas (lng)': 'natural gas',
'gasoline, diesel': 'gasoline',
'diesel, gasoline': 'gasoline',
'ethanol (e85)': 'ethanol',
'compressed natural gas (cng), gasoline': 'gasoline',
'gasoline, compressed natural gas (cng)': 'gasoline',
'compressed hydrogen / hydrogen': 'hydrogen',
'fuel cell': 'hydrogen',
},
'drive': {
'4x2': 'two',
'6x6': 'all',
'6x2': 'two',
'8x2': 'two',
'rwd/ rear wheel drive': 'two',
'fwd/front wheel drive': 'two',
'4x2, rwd/ rear wheel drive': 'two',
'4x2, fwd/front wheel drive': 'two',
'rwd/ rear wheel drive, 4x2': 'two',
'fwd/front wheel drive, 4x2': 'two',
'4wd/4-wheel drive/4x4': 'all',
'awd/all wheel drive': 'all',
},
'transmission_type': {
'manual/standard': 'manu',
'automated manual transmission (amt)': 'manu',
'manual/standard, manual/standard': 'manu',
'dual-clutch transmission (dct)': 'manu',
'continuously variable transmission (cvt)': 'auto',
'automatic': 'auto',
'automatic, continuously variable transmission (cvt)': 'auto',
}
},
'epa': {
'fuelType1': {
'regular gasoline': 'gasoline',
'premium gasoline': 'gasoline',
'midgrade gasoline': 'gasoline',
},
'drive': {
'rear-wheel drive': 'two',
'front-wheel drive': 'two',
'2-wheel drive': 'two',
'all-wheel drive': 'all',
'4-wheel drive': 'all',
'4-wheel or all-wheel drive': 'all',
'part-time 4-wheel drive': 'all',
},
}
}
e.g. for vin
, where fuelType1
is compressed natural gas (cng)
, it's replaced with natural gas
; where drive
is 4x2
, it becomes two
.
- Modify fuel type to identify flexible fuel vehicles and electric vehicles.
- Flexible fuel vehicles.
- In both
epa
andvin
, wheneverfuelType1
ormodel
orfuelType2
contains any of the following:ffv
,flexible
,ethanol
,e85
, ornatural gas
,fuelType1_mod
becomesffv
. - In
epa
, whenever the variableatvType
containsbi
orffv
, oreng_dscr
containsffv
,fuelType1_mod
is set toffv
.
- In both
- Electric vehicles.
- In both
epa
andvin
, ifmodel
containsplug
orvolt
, or iffuelType1
iselectric
andfuelType2
isgasoline
thenfuelType1_mod
isphev
. - For the remaining models, in both
epa
andvin
, ifmodel
containshev
orhybrid
or iffuelType1
isgasoline
andfuelType2
iselectric
thenfuelType1_mod
ishev
. - For the remaining models, in both
epa
andvin
, ifmodel
containsbev
orelectric
orfueltType1
iselectric
thenfuelType1_mod
isbev
. - In EPA, use
atvType
to determine the type of electric vehicle based on the following mapping:{'hybrid': 'hev', 'plug-in hybrid': 'phev', 'ev': 'bev'}
.
- In both
- Flexible fuel vehicles.
In vin
, split models using the regex [\w -]+
, e.g. 575 m maranello/575 m maranello f1
becomes 575 m maranello
and 575 m maranello f1
; sl2, sw2
becomes sl2
and sw2
.
In epa
, split any model name that contains the string '/|,'
; e.g. b2000/b2200/b2600
becomes b2000
, b2200
, and b2600
; and use regex magic to catch strings that need to be duplicated rally g15/25 2wd (passenger)
becomes rally g15 2wd (passenger)
, rally g25 2wd (passenger)
.
Modify datasets such that the fields that are being matched correspond and add custom variables.
- Extract
displacement
,transmission_speeds
,transmission_type
, - Add tonnage variable (named
type
). 1/4 and 1/2 ton has a type of15
, 3/4 ton is25
, and 1 ton is35
. - Add
weight
variable based onGVWR
. We're extracting the upper bound of the rangeGVWR
. - Modify models and makes so they correspond; e.g.
pathfinder armada
becomesarmada
;accord crosstour
becomescrosstour
.
-
Merge using
make
,model_mod
,year
,fuelType1_mod
,type
and all possible combinations of the following:drive_mod
,displ_mod
,cylinders
,transmission_type_mod
,transmission_speeds_mod
, while dropping 1 field, then 2, etc. and eventually all fields, successively; i.e.- match on all fields first:
make
,model_mod
,year
,fuelType1_mod
,type
,drive_mod
,displ_mod
,cylinders
,transmission_type_mod
,transmission_speeds_mod
; - drop 1 field:
- drop
transmission_speeds_mod
, match onmake
,model_mod
,year
,fuelType1_mod
,type
,drive_mod
,displ_mod
,cylinders
,transmission_type_mod
;
- drop
transmission_type_mod
, match onmake
,model_mod
,year
,fuelType1_mod
,type
,drive_mod
,displ_mod
,cylinders
,transmission_speeds_mod
;
- ...
- drop
- drop 2 fields:
- drop
transmission_type_mod
andtranmission_speeds_mod
, match onmake
,model_mod
,year
,fuelType1_mod
,type
,drive_mod
,dpdfispl_mod
,cylinders
; - drop
displ_mod
,cylinders
, match onmake
,model_mod
,year
,fuelType1_mod
,type
,drive_mod
,transmission_type_mod
,transmission_speeds_mod
- ...
- drop
- ...
- match on all fields first:
-
Merge using 'make', 'model_mod', 'year', 'type' (same as 1. but drop the fuel type) and use the same logic as 1. with the rest of the fields.
-
For all models that haven't been matched, if
weight
is above 8,000, tag as heavy and take out of the merging process. -
Merge using 'make', 'model_mod', 'year', 'fuelType1_mod' only.
Results of the merge:
Merging datasets
**************************************************
('Merging using:', ['make', 'model_mod', 'year', 'fuelType1_mod', 'type'])
('Matching on ', ['make', 'model_mod', 'year', 'fuelType1_mod', 'type', 'drive_mod', 'displ_mod', 'cylinders', 'transmission_type_mod', 'transmission_speeds_mod'])
Weighted match fraction: 1.31%
('Matching on ', ['make', 'model_mod', 'year', 'fuelType1_mod', 'type', 'drive_mod', 'displ_mod', 'cylinders', 'transmission_type_mod'])
Weighted match fraction: 1.63%
('Matching on ', ['make', 'model_mod', 'year', 'fuelType1_mod', 'type', 'drive_mod', 'displ_mod', 'cylinders', 'transmission_speeds_mod'])
Weighted match fraction: 1.65%
('Matching on ', ['make', 'model_mod', 'year', 'fuelType1_mod', 'type', 'drive_mod', 'displ_mod', 'transmission_type_mod', 'transmission_speeds_mod'])
Weighted match fraction: 1.88%
('Matching on ', ['make', 'model_mod', 'year', 'fuelType1_mod', 'type', 'drive_mod', 'cylinders', 'transmission_type_mod', 'transmission_speeds_mod'])
Weighted match fraction: 1.88%
('Matching on ', ['make', 'model_mod', 'year', 'fuelType1_mod', 'type', 'displ_mod', 'cylinders', 'transmission_type_mod', 'transmission_speeds_mod'])
Weighted match fraction: 6.65%
('Matching on ', ['make', 'model_mod', 'year', 'fuelType1_mod', 'type', 'drive_mod', 'displ_mod', 'cylinders'])
Weighted match fraction: 48.72%
('Matching on ', ['make', 'model_mod', 'year', 'fuelType1_mod', 'type', 'drive_mod', 'displ_mod', 'transmission_type_mod'])
Weighted match fraction: 48.92%
('Matching on ', ['make', 'model_mod', 'year', 'fuelType1_mod', 'type', 'drive_mod', 'displ_mod', 'transmission_speeds_mod'])
Weighted match fraction: 48.92%
('Matching on ', ['make', 'model_mod', 'year', 'fuelType1_mod', 'type', 'drive_mod', 'cylinders', 'transmission_type_mod'])
Weighted match fraction: 48.92%
('Matching on ', ['make', 'model_mod', 'year', 'fuelType1_mod', 'type', 'drive_mod', 'cylinders', 'transmission_speeds_mod'])
Weighted match fraction: 48.92%
('Matching on ', ['make', 'model_mod', 'year', 'fuelType1_mod', 'type', 'drive_mod', 'transmission_type_mod', 'transmission_speeds_mod'])
Weighted match fraction: 48.92%
('Matching on ', ['make', 'model_mod', 'year', 'fuelType1_mod', 'type', 'displ_mod', 'cylinders', 'transmission_type_mod'])
Weighted match fraction: 50.98%
('Matching on ', ['make', 'model_mod', 'year', 'fuelType1_mod', 'type', 'displ_mod', 'cylinders', 'transmission_speeds_mod'])
Weighted match fraction: 50.99%
('Matching on ', ['make', 'model_mod', 'year', 'fuelType1_mod', 'type', 'displ_mod', 'transmission_type_mod', 'transmission_speeds_mod'])
Weighted match fraction: 51.29%
('Matching on ', ['make', 'model_mod', 'year', 'fuelType1_mod', 'type', 'cylinders', 'transmission_type_mod', 'transmission_speeds_mod'])
Weighted match fraction: 51.33%
('Matching on ', ['make', 'model_mod', 'year', 'fuelType1_mod', 'type', 'drive_mod', 'displ_mod'])
Weighted match fraction: 56.45%
('Matching on ', ['make', 'model_mod', 'year', 'fuelType1_mod', 'type', 'drive_mod', 'cylinders'])
Weighted match fraction: 58.11%
('Matching on ', ['make', 'model_mod', 'year', 'fuelType1_mod', 'type', 'drive_mod', 'transmission_type_mod'])
Weighted match fraction: 58.16%
('Matching on ', ['make', 'model_mod', 'year', 'fuelType1_mod', 'type', 'drive_mod', 'transmission_speeds_mod'])
Weighted match fraction: 58.16%
('Matching on ', ['make', 'model_mod', 'year', 'fuelType1_mod', 'type', 'displ_mod', 'cylinders'])
Weighted match fraction: 80.25%
('Matching on ', ['make', 'model_mod', 'year', 'fuelType1_mod', 'type', 'displ_mod', 'transmission_type_mod'])
Weighted match fraction: 80.38%
('Matching on ', ['make', 'model_mod', 'year', 'fuelType1_mod', 'type', 'displ_mod', 'transmission_speeds_mod'])
Weighted match fraction: 80.38%
('Matching on ', ['make', 'model_mod', 'year', 'fuelType1_mod', 'type', 'cylinders', 'transmission_type_mod'])
Weighted match fraction: 80.46%
('Matching on ', ['make', 'model_mod', 'year', 'fuelType1_mod', 'type', 'cylinders', 'transmission_speeds_mod'])
Weighted match fraction: 80.46%
('Matching on ', ['make', 'model_mod', 'year', 'fuelType1_mod', 'type', 'transmission_type_mod', 'transmission_speeds_mod'])
Weighted match fraction: 80.66%
('Matching on ', ['make', 'model_mod', 'year', 'fuelType1_mod', 'type', 'drive_mod'])
Weighted match fraction: 82.12%
('Matching on ', ['make', 'model_mod', 'year', 'fuelType1_mod', 'type', 'displ_mod'])
Weighted match fraction: 86.44%
('Matching on ', ['make', 'model_mod', 'year', 'fuelType1_mod', 'type', 'cylinders'])
Weighted match fraction: 87.13%
('Matching on ', ['make', 'model_mod', 'year', 'fuelType1_mod', 'type', 'transmission_type_mod'])
Weighted match fraction: 87.33%
('Matching on ', ['make', 'model_mod', 'year', 'fuelType1_mod', 'type', 'transmission_speeds_mod'])
Weighted match fraction: 87.33%
**************************************************
('Merging using:', ['make', 'model_mod', 'year', 'type'])
('Matching on ', ['make', 'model_mod', 'year', 'type', 'drive_mod', 'displ_mod', 'cylinders', 'transmission_type_mod', 'transmission_speeds_mod'])
Weighted match fraction: 87.33%
('Matching on ', ['make', 'model_mod', 'year', 'type', 'drive_mod', 'displ_mod', 'cylinders', 'transmission_type_mod'])
Weighted match fraction: 87.33%
('Matching on ', ['make', 'model_mod', 'year', 'type', 'drive_mod', 'displ_mod', 'cylinders', 'transmission_speeds_mod'])
Weighted match fraction: 87.33%
('Matching on ', ['make', 'model_mod', 'year', 'type', 'drive_mod', 'displ_mod', 'transmission_type_mod', 'transmission_speeds_mod'])
Weighted match fraction: 87.33%
('Matching on ', ['make', 'model_mod', 'year', 'type', 'drive_mod', 'cylinders', 'transmission_type_mod', 'transmission_speeds_mod'])
Weighted match fraction: 87.33%
('Matching on ', ['make', 'model_mod', 'year', 'type', 'displ_mod', 'cylinders', 'transmission_type_mod', 'transmission_speeds_mod'])
Weighted match fraction: 87.35%
('Matching on ', ['make', 'model_mod', 'year', 'type', 'drive_mod', 'displ_mod', 'cylinders'])
Weighted match fraction: 88.14%
('Matching on ', ['make', 'model_mod', 'year', 'type', 'drive_mod', 'displ_mod', 'transmission_type_mod'])
Weighted match fraction: 88.14%
('Matching on ', ['make', 'model_mod', 'year', 'type', 'drive_mod', 'displ_mod', 'transmission_speeds_mod'])
Weighted match fraction: 88.14%
('Matching on ', ['make', 'model_mod', 'year', 'type', 'drive_mod', 'cylinders', 'transmission_type_mod'])
Weighted match fraction: 88.14%
('Matching on ', ['make', 'model_mod', 'year', 'type', 'drive_mod', 'cylinders', 'transmission_speeds_mod'])
Weighted match fraction: 88.14%
('Matching on ', ['make', 'model_mod', 'year', 'type', 'drive_mod', 'transmission_type_mod', 'transmission_speeds_mod'])
Weighted match fraction: 88.14%
('Matching on ', ['make', 'model_mod', 'year', 'type', 'displ_mod', 'cylinders', 'transmission_type_mod'])
Weighted match fraction: 88.67%
('Matching on ', ['make', 'model_mod', 'year', 'type', 'displ_mod', 'cylinders', 'transmission_speeds_mod'])
Weighted match fraction: 88.67%
('Matching on ', ['make', 'model_mod', 'year', 'type', 'displ_mod', 'transmission_type_mod', 'transmission_speeds_mod'])
Weighted match fraction: 88.67%
('Matching on ', ['make', 'model_mod', 'year', 'type', 'cylinders', 'transmission_type_mod', 'transmission_speeds_mod'])
Weighted match fraction: 88.68%
('Matching on ', ['make', 'model_mod', 'year', 'type', 'drive_mod', 'displ_mod'])
Weighted match fraction: 88.76%
('Matching on ', ['make', 'model_mod', 'year', 'type', 'drive_mod', 'cylinders'])
Weighted match fraction: 89.10%
('Matching on ', ['make', 'model_mod', 'year', 'type', 'drive_mod', 'transmission_type_mod'])
Weighted match fraction: 89.10%
('Matching on ', ['make', 'model_mod', 'year', 'type', 'drive_mod', 'transmission_speeds_mod'])
Weighted match fraction: 89.10%
('Matching on ', ['make', 'model_mod', 'year', 'type', 'displ_mod', 'cylinders'])
Weighted match fraction: 89.48%
('Matching on ', ['make', 'model_mod', 'year', 'type', 'displ_mod', 'transmission_type_mod'])
Weighted match fraction: 89.48%
('Matching on ', ['make', 'model_mod', 'year', 'type', 'displ_mod', 'transmission_speeds_mod'])
Weighted match fraction: 89.48%
('Matching on ', ['make', 'model_mod', 'year', 'type', 'cylinders', 'transmission_type_mod'])
Weighted match fraction: 89.48%
('Matching on ', ['make', 'model_mod', 'year', 'type', 'cylinders', 'transmission_speeds_mod'])
Weighted match fraction: 89.48%
('Matching on ', ['make', 'model_mod', 'year', 'type', 'transmission_type_mod', 'transmission_speeds_mod'])
Weighted match fraction: 89.48%
('Matching on ', ['make', 'model_mod', 'year', 'type', 'drive_mod'])
Weighted match fraction: 89.59%
('Matching on ', ['make', 'model_mod', 'year', 'type', 'displ_mod'])
Weighted match fraction: 89.63%
('Matching on ', ['make', 'model_mod', 'year', 'type', 'cylinders'])
Weighted match fraction: 89.70%
('Matching on ', ['make', 'model_mod', 'year', 'type', 'transmission_type_mod'])
Weighted match fraction: 89.71%
('Matching on ', ['make', 'model_mod', 'year', 'type', 'transmission_speeds_mod'])
Weighted match fraction: 89.71%
**************************************************
('Matching on', ['make', 'model_mod', 'year', 'fuelType1_mod'])
Weighted match fraction: 99.35%
**************************************************