This repository contains the code for two pipelines built using ZenML and GCP. The pipelines train a machine learning model that predicts hotel booking demand and perform batch inference.
Cancellations in hotel booking reservations can result in a significant annual loss for hotels. It would be of great help for hotels to have a system to predict whether a booking will get cancelled so they can offer the room to another customer or plan accordingly. Moreover, the system needs to be simple to implement and have the capability to be updated continuously as new data becomes available.
For this project, I’m using the Hotel booking demand dataset. It contains data of two hotel bookings for almost two years that either effectively arrived or were cancelled. The size of the data set is relatively small (36 columns and 119,390 rows) but was definitely useful for creating a proof of concept for a system that predicts booking cancelations.
To solve this problem, I set out to create a machine learning model to predict hotel booking demand using historical data. This model is continuously trained (CT) and deployed (CD) using two pipelines: training and batch prediction. The training pipeline trains the model as new historic data is available and the batch prediction pipeline stores the predictions in a database. The pipelines both run independently and can be triggered automatically. In addition, I implemented a Continuous Integration (CI) strategy to check the quality of the Python code using GitHub Actions.
This project has the following components:
- Hotel bookings database stored in a Cloud Storage bucket.
- Model registry stored in a Cloud Storage bucket.
- The training pipeline runs on Compute Engine. It trains and stores the model (Continuous Deployment) on the model registry.
- The batch prediction pipeline runs on Compute Engine. It fetches the latest model from the model registry and computes predictions for the hotel bookings databas
Also, it has a ZenML ML stack with the following components:
- Sklearn: Machine learning library to train the model.
- MLFlow: Experiment tracker.
- Deepchecks: Data validation library to validate the training data.
This project consists of two pipelines: one that trains the model and a second that performs batch inference. The pipelines are divided into steps and can be found in the pipelines
directory. Similarly, the steps can be found in the steps
directory. Both pipelines can be run independently using the run_training_pipeline.py
and run_batch_prediction_pipeline.py
scripts (for more information, see section Running the pipelines locally).
The training pipeline performs the following steps:
- Load data: Fetches the train data from the hotel bookings database.
- Clean data: Removes unwanted columns and enforces the correct column type.
- Validate data: Runs Deepcheks to validate the data.
- Split data into train and test: Divide the train data.
- Train model: Creates a Sklearn Pipeline that transforms the data and trains a classification model that predicts whether the hotel booking will get cancelled. Also, it logs the parameters using MLFlow.
- Evaluate model: Uses the accuracy metric to assess the model performance with test data. It logs the results using MLFlow.
- Evaluate deployment: Assesses whether the model performance is greater than 70%.
- Deploy model: If the model performs well, then the pipeline stores it as a
.pkl
file in the model registry bucket.
The batch inference pipeline performs the following steps:
- Load data: Fetches the hotel booking data for inference.
- Clean data: Removes unwanted columns and enforces the correct column type.
- Fetch model: Obtains the best model from the model registry bucket.
- Get predictions: Compute the model’s predictions.
- Store predictions: Loads the predictions file into the hotel booking database.
Before running the pipelines locally, you need to set up the GCP infrastructure and the ZenML stack. Make sure you have installed the gcloud cli.
To create the GCP infrastructure:
-
Create the Cloud Storage bucket that contains both the database and the model registry:
gsutil mb -l <REGION> gs://hotel-booking-prediction
-
Inside the Cloud Storage bucket, create the following structure:
data
full
: Contains the full dataset (or database) and predictions as.csv
files.models
: Contains the model registry of the models.test
: Contains the test data. This is a hold out set for testing the model before deploying (work in progress).train
: Contains the train data. This data is further split into train an test for training the model.
-
Generate a key (
.json
) file for authenticating to GCP:gcloud iam service-accounts keys create <FILE-NAME>.json --iam-account=<SA-NAME>@<PROJECT_ID>.iam.gserviceaccount.com
Note: Store this file in the root directory of this project.
To set up the ZenML stack:
-
Create a Python virtual environment:
python -m venv hotel-prediction
-
Activate the virtual environment:
source hotel-prediction/bin/activate
-
Install ZenML with the command
pip install zenml
. -
Install all the required libraries with the command
pip install -r requirements.txt
. -
Install the required ZenML integrations:
zenml integration install sklearn mlflow deepchecks -y
-
Initialize ZenML with the command:
zenml init && zenml up
. -
Register the required ZenML stack:
zenml data-validator register deepchecks --flavor=deepchecks zenml experiment-tracker register mlflow_tracker --flavor=mlflow
-
Register the ZenML new stack:
zenml stack register quickstart_stack -a default\ -o default\ -e mlflow_tracker\ -dv deepchecks\ --set
Finally, to run the either of the pipelines: python run_training_pipeline.py
or python run_batch_prediction_pipeline.py
.
This is work in progress, for more information see this PR.
The next steps for this project are the following:
- Run the pipelines in Vertex AI.
- Use ZenML’s Stack Recipes to create the GCP infrastructure.
- Implement data drift detection using Deepchecks.
- Create a dashboard in Data Studio to visualise the predictions.
- Improve the baseline model: During the data validation step, Deepchecks detected that the data had conflicting labels and about 32% of duplicate data.
- Load the hotel bookings data to a relational database or data warehouse (BigQuery).