This directory contains code samples that demonstrate how to implement a low latency item-to-item recommendation solution, by training and serving embeddings that you can use to enable real-time similarity matching. The foundations of the solution are BigQuery and ScaNN, which is an open source library for efficient vector similarity search at scale.
The series is for data scientists and ML engineers who want to build an embedding training system and serve for item-item recommendation use cases. It assumes that you have experience with Google Cloud, BigQuery, AI Platform, Dataflow, Datastore, and with Tensorflow and TFX Pipelines.
There are two variants of the solution:
- The first variant utilizes generally available releases of BigQuery and AI Platform together with open source components including ScaNN and Kubeflow Pipelines. To use this variant, follow the instructions in the Production variant section.
- The second variant is a fully-managed solution that leverages the experimental releases of AI Platform Pipelines and ANN service. To use this variant, follow the instructions in the Experimental variant section.
We use the public bigquery-samples.playlists
BigQuery dataset to demonstrate the
solutions. We use the playlist data to learn embeddings for songs based on their
co-occurrences in different playlists. The learned embeddings can be used to
match and recommend relevant songs to a given song or playlist.
At a high level, the solution works as follows:
- Computes pointwise mutual information (PMI) between items based on their co-occurrences.
- Trains item embeddings using BigQuery ML Matrix Factorization, with item PMI as implicit feedback.
- Using Cloud Dataflow, post-processes the embeddings into CSV files and exports them from the BigQuery ML model to Cloud Storage.
- Implements an embedding lookup model using TensorFlow Keras, and then deploys it to AI Platform Prediction.
- Serves the embeddings as an approximate nearest neighbor index on AI Platform Prediction for real-time similar items matching.
For a detailed description of the solution architecture, see Architecture of a machine learning system for item matching.
The solution uses the following billable components of Google Cloud:
- AI Platform Notebooks
- AI Platform Pipelines
- AI Platform Prediction
- AI Platform Training
- Artifact Registry
- BigQuery
- Cloud Build
- Cloud Storage
- Dataflow
- Datastore
To learn about Google Cloud pricing, use the Pricing Calculator to generate a cost estimate based on your projected usage.
You can run the solution step-by-step, or you can run it by using a TFX pipeline.
- Complete the steps in Set up the GCP environment.
- Complete the steps in Set up the AI Platform Notebooks environment.
- In the Jupyterlab environment of the
embeddings-notebooks
instance, open the file browser pane and navigate to theanalytics-componentized-patterns/retail/recommendation-system/bqml-scann
directory. - Run the
00_prep_bq_and_datastore.ipynb
notebook to import theplaylist
dataset, create thevw_item_groups
view with song and playlist data, and export song title and artist information to Datastore. - Run the
00_prep_bq_procedures
notebook to create stored procedures needed by the solution. - Run the
01_train_bqml_mf_pmi.ipynb
notebook. This covers computing item co-occurrences using PMI, and then training a BigQuery ML matrix factorization mode to generate item embeddings. - Run the
02_export_bqml_mf_embeddings.ipynb
notebook. This covers using Dataflow to request the embeddings from the matrix factorization model, format them as CSV files, and export them to Cloud Storage. - Run the
03_create_embedding_lookup_model.ipynb
notebook. This covers creating a TensorFlow Keras model to wrap the item embeddings, exporting that model as a SavedModel, and deploying that SavedModel to act as an item-embedding lookup. - Run the
04_build_embeddings_scann.ipynb
notebook. This covers building an approximate nearest neighbor index for the embeddings using ScaNN and AI Platform Training, then exporting the ScaNN index to Cloud Storage. - Run the
05_deploy_lookup_and_scann_caip.ipynb
notebook. This covers deploying the embedding lookup model and ScaNN index (wrapped in a Flask app to add functionality) created by the solution. - If you don't want to keep the resources you created for this solution, complete the steps in Delete the GCP resources.
In addition to manual steps outlined above, we provide a TFX pipeline that automates the process of building and deploying the solution. To run the solution by using the TFX pipeline, follow these steps:
- Complete the steps in Set up the GCP environment.
- Complete the steps in Set up the AI Platform Notebooks environment.
- In the Jupyterlab environment of the
embeddings-notebooks
instance, open the file browser pane and navigate to theanalytics-componentized-patterns/retail/recommendation-system/bqml-scann
directory. - Run the
00_prep_bq_and_datastore.ipynb
notebook to import theplaylist
dataset, create thevw_item_groups
view with song and playlist data, and export song title and artist information to Datastore. - Run the
00_prep_bq_procedures
notebook to create stored procedures needed by the solution. - Run the
tfx01_interactive.ipynb
notebook. This covers creating and running a TFX pipeline that runs the solution, which includes all of the tasks mentioned in the step-by-step notebooks above. - Run the
tfx02_deploy_run.ipynb
notebook. This covers deploying the TFX pipeline, including building a Docker container image, compiling the pipeline, and deploying the pipeline to AI Platform Pipelines. - Run the
05_deploy_lookup_and_scann_caip.ipynb
notebook. This covers deploying the embedding lookup model and ScaNN index (wrapped in a Flask app to add functionality) created by the solution. - If you don't want to keep the resources you created for this solution, complete the steps in Delete the GCP resources.
Before running the solution, you must complete the following steps to prepare an appropriate environment:
-
Create and configure a GCP project.
-
Create the GCP resources you need.
Before creating the resources, consider what regions you want to use. Creating resources in the same region or multi-region (like US or EU) can reduce latency and improve performance.
-
Clone this repo to the AI Platform notebook environment.
-
Install the solution requirements on the notebook environment.
-
Add the sample dataset and some stored procedures to BigQuery.
- In the Cloud Console, on the project selector page, select or create a Cloud project.
- Make sure that billing is enabled for your Cloud project.
- Enable the Compute Engine, Dataflow, Datastore, AI Platform, AI Platform Notebooks, Artifact Registry, Identity and Access Management, Cloud Build, BigQuery, and BigQuery Reservations APIs.
If you use on-demand pricing for BigQuery, you must purchase flex slots and then create reservations and assignments for them in order to train a matrix factorization model. You can skip this section if you use flat-rate pricing with BigQuery.
You must have the bigquery.reservations.create
permission in order to purchase
flex slots. This permission is granted to the project owner, and also to the
bigquery.admin
and bigquery.resourceAdmin
predefined Identity and Access
Management roles.
-
In the BigQuery console, click Reservations.
-
On the Reservations page, click Buy Slots.
-
On the Buy Slots page, set the options as follows:
-
In Commitment duration, choose Flex.
-
In Location, choose the region you want to use for BigQuery. Depending on the region you choose, you may have to request additional slot quota.
-
In Number of slots, choose 500.
-
Click Next.
-
In Purchase confirmation, type
CONFIRM
.Note: The console displays an estimated monthly cost of $14,600.00. You will delete the unused slots at the end of this tutorial, so you will only pay for the slots you use to train the model. Training the model takes approximately 2 hours.
-
-
Click Purchase.
-
Click View Slot Commitments.
-
Allow up to 20 minutes for the capacity to be provisioned. After the capacity is provisioned, the slot commitment status turns green and shows a checkmark.
-
Click Create Reservation.
-
On the Create Reservation page, set the options as follows:
- In Reservation name, type
model
. - In Location, choose whatever region you purchased the flex slots in.
- In Number of slots, type
500
. - Click Save. This returns you to the Reservations page.
- In Reservation name, type
-
Select the Assignments tab.
-
In Select an organization, folder, or project, click Browse.
-
Type the name of the project you are using.
-
Click Select.
-
In Reservation, choose the model reservation you created.
-
Click Create.
Create a Firestore in Datastore Mode database instance to store song title and artist information for lookup.
- Open the Datastore console.
- Click Select Datastore Mode.
- For Select a location, choose the region you want to use for Datastore.
- Click Create Database.
Create a Cloud Storage bucket to store the following objects:
- The SavedModel files for the models created in the solution.
- The temp files created by the Dataflow pipeline that processes the song embeddings.
- The CSV files for the processed embeddings.
- Open the Cloud Storage console.
- Click Create Bucket.
- For Name your bucket, type a bucket name. The name must be globally unique.
- For Choose where to store your data, select Region and then choose the region you want to use for Cloud Storage.
- Click Create.
Create an AI Platform Notebooks instance to run the notebooks that walk you through using the solution.
- Open the AI Platform Notebooks console.
- Click New Instance.
- Choose TensorFlow Enterprise 2.3, Without GPUs.
- For Instance name, type
embeddings-notebooks
. - For Region, choose the region you want to use for the AI Platform Notebooks instance.
- Click Create. It takes a few minutes for the notebook instance to be created.
- Open the Cloud Build settings page.
- In the service account list, find the row for Compute Engine and change the Status column value to Enabled.
Add the Compute Engine service account to the IAM Security Admin role. This is required so that later this account can set up other service accounts needed by the solution.
- Open the IAM permissions page.
- In the members list, find the row for
<projectNumber>[email protected]
and click Edit. - Click Add another role.
- In Select a role, choose IAM and then choose Security Admin.
- Click Save.
Create an AI Platform Pipelines instance to run the TensorFlow Extended (TFX) pipeline that automates the solution workflow. You can skip this step if you are running the solution using the step-by-step notebooks.
Create a Cloud SQL instance to provide managed storage for the pipeline.
- Open the Cloud SQL console.
- Click Create Instance.
- On the MySQL card, click Choose MySQL.
- For Instance ID, type
pipeline-db
. - For Root Password, type in the password you want to use for the root user.
- For Region, type in the region you want to use for the database instance.
- Click Create.
-
In the AI Platform Pipelines toolbar, click New instance. Kubeflow Pipelines opens in Google Cloud Marketplace.
-
Click Configure. The Deploy Kubeflow Pipelines form opens.
-
For Cluster zone, choose a zone in the region you want to use for AI Platform Pipelines.
-
Check Allow access to the following Cloud APIs to grant applications that run on your GKE cluster access to Google Cloud resources. By checking this box, you are granting your cluster access to the
https://www.googleapis.com/auth/cloud-platform
access scope. This access scope provides full access to the Google Cloud resources that you have enabled in your project. Granting your cluster access to Google Cloud resources in this manner saves you the effort of creating and managing a service account or creating a Kubernetes secret. -
Click Create cluster. This step may take several minutes.
-
Select Create a namespace in the Namespace drop-down list. Type
kubeflow-pipelines
in New namespace name.To learn more about namespaces, read a blog post about organizing Kubernetes with namespaces.
-
In the App instance name box, type
kubeflow-pipelines
. -
Select Use managed storage and supply the following information:
- Artifact storage Cloud Storage bucket: Specify the name of the bucket you created in the "Create a Cloud Storage bucket" procedure.
- Cloud SQL instance connection name: Specify the connection name for the Cloud SQL instance you created in the "Create a Cloud SQL instance" procedure. The instance connection name can be found on the instance detail page in the Cloud SQL console.
- Database username: Leave this field empty to default to root.
- Database password: Specify the root user password for the Cloud SQL instance you created in the "Create a Cloud SQL instance" procedure.
- Database name prefix: Type
embeddings
.
-
Click Deploy. This step may take several minutes.
You use notebooks to complete the prerequisites and then run the solution. To use the notebooks, you must clone the solution's GitHub repo to your AI Platform Notebooks JupyterLab instance.
-
Click Open JupyterLab for the
embeddings-notebooks
instance. -
In the Other section of the JupyterLab Launcher, click Terminal.
-
In the terminal, run the following command to clone the
analytics-componentized-patterns
Github repository:git clone https://github.com/GoogleCloudPlatform/analytics-componentized-patterns.git
-
In the terminal, run the following command to install packages required by the solution:
pip install -r analytics-componentized-patterns/retail/recommendation-system/bqml-scann/requirements.txt
Unless you plan to continue using the resources you created in this solution, you should delete them to avoid incurring charges to your GCP account. You can either delete the project containing the resources, or keep the project but delete just those resources.
Either way, you should remove the resources so you won't be billed for them in the future. The following sections describe how to delete these resources.
The easiest way to eliminate billing is to delete the project you created for the solution.
- In the Cloud Console, go to the Manage resources page.
- In the project list, select the project that you want to delete, and then click Delete.
- In the dialog, type the project ID, and then click Shut down to delete the project.
If you don't want to delete the project, delete the billable components of the solution. These can include:
- A Bigquery assignment, reservation, and remaining flex slots (if you chose to use flex slots to train the matrix factorization model)
- A BigQuery dataset
- Several Cloud Storage buckets
- Datastore entities
- An AI Platform Notebooks instance
- AI Platform models
- A Kubernetes Engine cluster (if you used a pipeline for automation)
- An AI Platform pipeline (if you used a pipeline for automation)
- A Cloud SQL instance (if you used a pipeline for automation)
- A Container Registry image (if you used a pipeline for automation)
The experimental variant of the solution utilizes the new AI Platform and AI
Platform (Unified) Pipelines services. Note that both services are currently in
the Experimental stage and that the provided examples may have to be updated
when the services move to the Preview and eventually to the General
Availability. Setting up the managed ANN service is described in the
ann_setup.md
file.
Note: To use the Experimental releases of AI Platform Pipelines and ANN services
you need to allow-list you project and user account. Please contact your Google
representative for more information and support.
- Compute pointwise mutual information (PMI) between items based on their co-occurrences.
- Train item embeddings using BigQuery ML Matrix Factorization, with item PMI as implicit feedback.
- Post-process and export the embeddings from BigQuery ML Matrix Factorization Model to Cloud Storage JSONL formatted files.
- Create an approximate nearest search index using the ANN service and the exported embedding files.
- Deployed to the index as an ANN service endpoint.
Note that the first two steps are the same as the ScaNN library based solution.
We provide an example TFX pipeline that automates the process of training the
embeddings and deploying the index.
The pipeline is designed to run on AI Platform (Unified) Pipelines and relies on
features introduced in v0.25 of TFX. Each step of the pipeline is implemented as
a
TFX Custom Python function component.
All steps and their inputs and outputs are tracked in the AI Platform (Unified)
ML Metadata service.
- ann01_create_index.ipynb
- This notebook walks you through creating an ANN index, creating an ANN endpoint, and deploying the index to the endpoint. It also shows how to call the interfaces exposed by the deployed index.
- ann02_run_pipeline.ipynb
- This notebook demonstrates how to create and test the TFX pipeline and how to submit pipeline runs to AI Platform (Unified) Pipelines.
Before experimenting with the notebooks, make sure that you have prepared the BigQuery environment and trained and extracted item embeddings using the procedures described in the ScaNN library based solution.
If you have any questions or feedback, please open up a new issue.
Copyright 2020 Google LLC
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at: http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and limitations under the License.
This is not an official Google product but sample code provided for an educational purpose