🏆 AI For S.E.A. is an online challenge held by Grab to search for talented, innovative technologies across Southeast Asia. In this challenge, partcipants are tasked to select and tackle one problem statement (as shown in the image below) by leveraging data science and AI technologies.
💡 My objective is to build an end-to-end machine learning pipeline based on telematics data to detect dangerous driving on the road.
To start off the challenge, I began by performing exploratory data analysis to find out the class distribution as well as what constitutes safe or dangerous driving.
Full Dataset is available here
- Below is an example of the dataset provided
- A visualization of the class distribution of safe vs. unsafe driving (15007 safe vs. 4993 unsafe)
- A visualization of what constitutes safe vs. unsafe driving based on the acceleration, gyro, speed, and change in speed.
After analysing the data, I then performed feature engineering to create new features to supplement my machine learning algorithm later on.
Features Engineered include:
- Change in Bearing
- Change in Speed (Acceleration/Deceleration)
- Bucket Acceleration/Braking Values
- Bucket Speed values
- Magnitude of Acceleration/Gyro
- Change in Magnitude
- Total Distance Travelled in km
- No. of Danger Events Per Distance Travelled in km (Include Acceleration/Braking/Speeding Events)
Changes made to original features include:
- Convert Speed from m/s to km/h
- Convert Gyro from rad/s to degree/s
Next, I prepared the data to be fed into the machine learning algorithm by converting categorical features using one-hot encoding and also aggregate the features grouped by the BookingID.
- One-Hot encode categorical features to transform it into an appropriate format for the machine learning algorithm
- Aggregate features to "expand the number of features"
- Before moving on to machine learning, I need to impute the missing values and split the dataset into training and test datasets.
In this challenge, I have opt to use XGBoost Classifier (a type of gradient boosted decision tree algorithm) to tackle the non-linearity in the data.
- To ensure that the model performs consistently across the entire dataset, I implemented the k-fold cross validation and obtained the model's mean score on the various sets of data.
- The model is observed to be poor in identifying dangerous driving behaviors. There are several factors that could have led to this issue:
- Imbalanced Class Distribution
- Insufficient Features to distinguish dangerous driving from safe driving
- Based on the feature importance chart, speed and gyro magnitude seem to be the driving factors in determining safe vs. unsafe driving.