The Di-Tech Challenge is organized by DiDi Chuxing, China's largest ride-hailing company. It challenges contestants to use real data to generate predictive rider-driver supply and demand gap model, to direct drivers to where riders will need to be picked up.
When I learned of the challenge announcement, I just completed all of the required courses in Udacity Machine Learning Engineer Nanodegree. It's exciting to choose this competition as my Capstone project, practice and consolidate what I've learned throughout the Nanodegree program by tackling a real world problem, and what's more, in a highly competitive ongoing contest!
The development life cycle of this machine learning project can be summarized using below steps:
- Define the problem as a supervised regression learning problem
- Define MAPE as the metrics to evaluate the models
- Explore the data type, basic statistics and distribution of features and labels, perform univariate and bivariate visualizations to gain insight into the data, and guide feature engineering and cross validation strategy.
- Identify KNN, Random Forest, GBM Neural network as potential models/algorithms, Find out state of art benchmark that these models aims to reach/beat.
- Perform feature engineering, feature transformation, outlier and missing value handling.
- Implement models by leveraging Sklearn, XGBoost, and Tensorflow learning libraries.
- Fine tune the models via iterative feature selection/engineering, model selection, hyper parameter tuning. Cross validation is used to ensure that the models generalize well into unseen data.
The best score in Di-Tech challenge leaderboard is about 0.39. Our final model's score is 0.42. A score of 0.42 is not really state of the art for this competition, but still it is quite respectable. This score gap of 0.03 might be further narrowed by experimenting some improvement ideas listed in improvement section of this document.
For details, please refer to the project report Rider-Driver Supply and Demand Gap Forecast.pdf, which resides under the root directory of this github project.
- Sklearn
- XGBOOST
- Tensorflow
- Download the source codes from here
- Get data files( data_preprocessed.zip and data_raw.zip) from Dropbox shared link
- Extract data_preprocessed.zip under root directory of the project. After extraction, all temporary preprocessed dump files will be under data_preprocessed folder
- Extract data_raw.zip under root directory of the project. After extraction, all raw files will be under data_raw folder
- In the console/terminal, set implement as current directory
- Run scripts
Run python didineuralmodel.py to train/validate neural network model
Run python knnmodel.py to train/validate KNN model
Run python randomforestmodel.py to train/validate random forest model
Run python xgboostmodel.py to train/validation GBM model
Run python forwardfeatureselection to try out greedy forward feature selection
Run python tunemodels.py to try out grid search for models