Hotel booking demand prediction

This repository contains the code for two pipelines built using ZenML and GCP. The pipelines train a machine learning model that predicts hotel booking demand and perform batch inference.

Problem

Cancellations in hotel booking reservations can result in a significant annual loss for hotels. It would be of great help for hotels to have a system to predict whether a booking will get cancelled so they can offer the room to another customer or plan accordingly. Moreover, the system needs to be simple to implement and have the capability to be updated continuously as new data becomes available.

For this project, I’m using the Hotel booking demand dataset. It contains data of two hotel bookings for almost two years that either effectively arrived or were cancelled. The size of the data set is relatively small (36 columns and 119,390 rows) but was definitely useful for creating a proof of concept for a system that predicts booking cancelations.

Solution

To solve this problem, I set out to create a machine learning model to predict hotel booking demand using historical data. This model is continuously trained (CT) and deployed (CD) using two pipelines: training and batch prediction. The training pipeline trains the model as new historic data is available and the batch prediction pipeline stores the predictions in a database. The pipelines both run independently and can be triggered automatically. In addition, I implemented a Continuous Integration (CI) strategy to check the quality of the Python code using GitHub Actions.

This project has the following components:

Hotel bookings database stored in a Cloud Storage bucket.
Model registry stored in a Cloud Storage bucket.
The training pipeline runs on Compute Engine. It trains and stores the model (Continuous Deployment) on the model registry.
The batch prediction pipeline runs on Compute Engine. It fetches the latest model from the model registry and computes predictions for the hotel bookings databas

Also, it has a ZenML ML stack with the following components:

Sklearn: Machine learning library to train the model.
MLFlow: Experiment tracker.
Deepchecks: Data validation library to validate the training data.

Pipelines

This project consists of two pipelines: one that trains the model and a second that performs batch inference. The pipelines are divided into steps and can be found in the pipelines directory. Similarly, the steps can be found in the steps directory. Both pipelines can be run independently using the run_training_pipeline.py and run_batch_prediction_pipeline.py scripts (for more information, see section Running the pipelines locally).

Training

The training pipeline performs the following steps:

Load data: Fetches the train data from the hotel bookings database.
Clean data: Removes unwanted columns and enforces the correct column type.
Validate data: Runs Deepcheks to validate the data.
Split data into train and test: Divide the train data.
Train model: Creates a Sklearn Pipeline that transforms the data and trains a classification model that predicts whether the hotel booking will get cancelled. Also, it logs the parameters using MLFlow.
Evaluate model: Uses the accuracy metric to assess the model performance with test data. It logs the results using MLFlow.
Evaluate deployment: Assesses whether the model performance is greater than 70%.
Deploy model: If the model performs well, then the pipeline stores it as a .pkl file in the model registry bucket.

Batch inference

The batch inference pipeline performs the following steps:

Load data: Fetches the hotel booking data for inference.
Clean data: Removes unwanted columns and enforces the correct column type.
Fetch model: Obtains the best model from the model registry bucket.
Get predictions: Compute the model’s predictions.
Store predictions: Loads the predictions file into the hotel booking database.

Running the pipelines locally

Before running the pipelines locally, you need to set up the GCP infrastructure and the ZenML stack. Make sure you have installed the gcloud cli.

To create the GCP infrastructure:

Create the Cloud Storage bucket that contains both the database and the model registry:
```
gsutil mb -l <REGION> gs://hotel-booking-prediction
```
Inside the Cloud Storage bucket, create the following structure:
- data
  - full: Contains the full dataset (or database) and predictions as .csv files.
  - models: Contains the model registry of the models.
  - test: Contains the test data. This is a hold out set for testing the model before deploying (work in progress).
  - train: Contains the train data. This data is further split into train an test for training the model.

Generate a key (.json) file for authenticating to GCP:

gcloud iam service-accounts keys create <FILE-NAME>.json --iam-account=<SA-NAME>@<PROJECT_ID>.iam.gserviceaccount.com

Note: Store this file in the root directory of this project.

To set up the ZenML stack:

Create a Python virtual environment:
```
python -m venv hotel-prediction
```
Activate the virtual environment:
```
source hotel-prediction/bin/activate
```
Install ZenML with the command pip install zenml.
Install all the required libraries with the command pip install -r requirements.txt.

Install the required ZenML integrations:

zenml integration install sklearn mlflow deepchecks -y

Initialize ZenML with the command: zenml init && zenml up.

Register the required ZenML stack:

zenml data-validator register deepchecks --flavor=deepchecks
zenml experiment-tracker register mlflow_tracker --flavor=mlflow

Register the ZenML new stack:

zenml stack register quickstart_stack -a default\
                                  -o default\
                                  -e mlflow_tracker\
                                  -dv deepchecks\
                                  --set

Finally, to run the either of the pipelines: python run_training_pipeline.py or python run_batch_prediction_pipeline.py.

Run the pipelines using Docker

This is work in progress, for more information see this PR.

Roadmap

The next steps for this project are the following:

Run the pipelines in Vertex AI.
Use ZenML’s Stack Recipes to create the GCP infrastructure.
Implement data drift detection using Deepchecks.
Create a dashboard in Data Studio to visualise the predictions.
Improve the baseline model: During the data validation step, Deepchecks detected that the data had conflicting labels and about 32% of duplicate data.
Load the hotel bookings data to a relational database or data warehouse (BigQuery).

Name		Name	Last commit message	Last commit date
Latest commit History 71 Commits
.dvc		.dvc
.github/workflows		.github/workflows
.zen		.zen
assets		assets
data		data
mlruns/0		mlruns/0
models		models
notebooks		notebooks
pipelines		pipelines
steps		steps
.dvcignore		.dvcignore
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
run_batch_prediction_pipeline.py		run_batch_prediction_pipeline.py
run_training_pipeline.py		run_training_pipeline.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hotel booking demand prediction

Problem

Solution

Pipelines

Training

Batch inference

Running the pipelines locally

Run the pipelines using Docker

Roadmap

About

Releases

Packages

Languages

lilianabs/predict-hotel-reserv-cancelation-zenml

Folders and files

Latest commit

History

Repository files navigation

Hotel booking demand prediction

Problem

Solution

Pipelines

Training

Batch inference

Running the pipelines locally

Run the pipelines using Docker

Roadmap

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages