Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(examples): Add RAGAS evaluation to RAG chain #50

Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
102 changes: 102 additions & 0 deletions examples/genai-rag-multimodal/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,108 @@ When running the Notebook, you will reach a step that downloads an example PDF f
- projects/200612033880 # Google Cloud Example Project
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could change the 200612033880 for project_number or what that number means, or even projects/00000000000

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case this project 200612033880 is a Google Cloud Projects that hosts a Storage Bucket that contains the PDF we are downloading

```

## Deploying infrastructure using Machine Learning Infra Pipeline

### Required Permissions for pipeline Service Account

- Give `roles/compute.networkUser` to the Service Account that runs the Pipeline.

```bash
SERVICE_ACCOUNT=$(terraform -chdir="./gcp-projects/ml_business_unit/shared" output -json terraform_service_accounts | jq -r '."ml-machine-learning"')

gcloud projects add-iam-policy-binding <INSERT_HOST_VPC_NETWORK_PROJECT_HERE> --member="serviceAccount:$SERVICE_ACCOUNT" --role="roles/compute.networkUser"
```

- Add the following ingress rule to the Service Perimeter.

```yaml
ingressPolicies:
- ingressFrom:
identities:
- serviceAccount:<SERVICE_ACCOUNT>
sources:
- accessLevel: '*'
ingressTo:
operations:
- serviceName: '*'
resources:
- '*'
```

### Deployment steps

**IMPORTANT:** Please note that the steps below are assuming you are checked out on the same level as `terraform-google-enterprise-genai/` and the other repos (`gcp-bootstrap`, `gcp-org`, `gcp-projects`...).

- Retrieve the Project ID where the Machine Learning Pipeline Repository is located in.

```bash
export INFRA_PIPELINE_PROJECT_ID=$(terraform -chdir="gcp-projects/ml_business_unit/shared/" output -raw cloudbuild_project_id)
echo ${INFRA_PIPELINE_PROJECT_ID}
```

- Clone the repository.

```bash
gcloud source repos clone ml-machine-learning --project=${INFRA_PIPELINE_PROJECT_ID}
```

- Navigate into the repo and the desired branch. Create directories if they don't exist.

```bash
cd ml-machine-learning
git checkout -b development

mkdir -p ml_business_unit/development
mkdir -p modules
```

- Copy required files to the repository.

```bash
cp -R ../terraform-google-enterprise-genai/examples/genai-rag-multimodal ./modules
cp ../terraform-google-enterprise-genai/build/cloudbuild-tf-* .
cp ../terraform-google-enterprise-genai/build/tf-wrapper.sh .
chmod 755 ./tf-wrapper.sh

cat ../terraform-google-enterprise-genai/examples/genai-rag-multimodal/terraform.tfvars >> ml_business_unit/development/genai_example.auto.tfvars
cat ../terraform-google-enterprise-genai/examples/genai-rag-multimodal/variables.tf >> ml_business_unit/development/variables.tf
```

> NOTE: Make sure there are no variable name collision for variables under `terraform-google-enterprise-genaiexamples/genai-rag-multimodal/variables.tf` and that your `terraform.tfvars` is updated with values from your environment.

- Validate that variables under `ml_business_unit/development/genai_example.auto.tfvars` are correct.

```bash
cat ml_business_unit/development/genai_example.auto.tfvars
```

- Create a file named `genai_example.tf` under `ml_business_unit/development` path that calls the module.

```terraform
module "genai_example" {
source = "../../modules/genai-rag-multimodal"

kms_key = var.kms_key
network = var.network
subnet = var.subnet
machine_learning_project = var.machine_learning_project
vector_search_vpc_project = var.vector_search_vpc_project
}
```

- Commit and push

```terraform
git add .
git commit -m "Add GenAI example"

git push origin development
```

## Deploying infrastructure using terraform locally

Run `terraform init && terraform apply -auto-approve`.

## Usage

Once all the requirements are set up, you can start by running and adjusting the notebook step-by-step.
Expand Down
173 changes: 171 additions & 2 deletions examples/genai-rag-multimodal/multimodal_rag_langchain.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -599,7 +599,7 @@
"\n",
"\n",
"# Image summaries\n",
"img_base64_list, image_summaries = generate_img_summaries(\".\")"
"img_base64_list, image_summaries = generate_img_summaries(\"./intro_multimodal_rag_old_version\")"
]
},
{
Expand Down Expand Up @@ -824,8 +824,17 @@
" for i, s in enumerate(text_summaries + table_summaries + image_summaries)\n",
"]\n",
"\n",
"retriever_multi_vector_img.docstore.mset(list(zip(doc_ids, doc_contents)))\n",
"list_of_docs = list(zip(doc_ids, doc_contents))\n",
"\n",
"retriever_multi_vector_img.docstore.mset(list_of_docs)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# If using Vertex AI Vector Search, this will take a while to complete.\n",
"# You can cancel this cell and continue later.\n",
"retriever_multi_vector_img.vectorstore.add_documents(summary_docs)"
Expand Down Expand Up @@ -1000,6 +1009,166 @@
"Markdown(result)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### RAGAS Evaluation\n",
"\n",
"On the cells below we will be using RAGAS to evaluate the RAG pipeline for text-based context."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%pip install ragas"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"\n",
"questions = [\n",
" \"How did COVID-19 initially impact Google's advertising revenue in 2020?\",\n",
" \"How did Google's advertising revenue recover from the initial COVID-19 impact?\",\n",
" \"What was the primary driver of Google's operating cash flow in 2020?\",\n",
" \"How did Google's share repurchases compare to the previous year in 2020?\"\n",
"]\n",
"\n",
"golden_answers = [\n",
" \"COVID-19 initially impacted Google's advertising revenue in 2020 in two ways, Users searched for less commercially-driven topics, reducing the relevance and value of ads displayed and Businesses cut back on advertising budgets due to the economic downturn caused by the pandemic.\",\n",
" \"Google's advertising revenue recovered from the initial COVID-19 impact through a combination of factors, User search activity shifted back to more commercially-driven topics, increasing the effectiveness of advertising and As the economic climate improved, businesses began to invest more heavily in advertising again.\",\n",
" \"The primary driver of Google's operating cash flow in 2020 was revenue generated from its advertising products, totaling $91.7 billion\",\n",
" \"Google's share repurchases in 2020 were $50.3 billion, reflecting a significant increase of 62% compared to the prior year.\"\n",
"]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def summarize_image_context(doc_base64):\n",
" prompt = \"\"\"You are an assistant tasked with summarizing images for retrieval. \\\n",
" These summaries will be embedded and used to retrieve the raw image. \\\n",
" Give a concise summary of the image that is well optimized for retrieval.\n",
" If it's a table, extract all elements of the table.\n",
" If it's a graph, explain the findings in the graph.\n",
" Do not include any numbers that are not mentioned in the image.\n",
" \"\"\"\n",
" return image_summarize(doc_base64, prompt)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"data_samples = {\n",
" \"contexts\": [],\n",
" \"question\": [],\n",
" \"answer\": [],\n",
" \"ground_truth\": []\n",
" }\n",
"\n",
"for i, question in enumerate(questions): \n",
" docs = retriever_multi_vector_img.get_relevant_documents(question, limit=10) \n",
" image_contexts = []\n",
" \n",
" source_docs = split_image_text_types(docs)\n",
" \n",
" if len(source_docs[\"images\"]) > 0: \n",
" for image in source_docs[\"images\"]:\n",
" image_contexts.append(summarize_image_context(image))\n",
" \n",
" text_context = source_docs[\"texts\"]\n",
" \n",
" data_samples[\"contexts\"].append(text_context + image_contexts)\n",
" data_samples[\"question\"].append(question)\n",
" data_samples[\"answer\"].append(chain_multimodal_rag.invoke(question))\n",
" data_samples[\"ground_truth\"].append(golden_answers[i])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from datasets import Dataset\n",
"\n",
"dataset = Dataset.from_dict(data_samples)\n",
"\n",
"\n",
"from ragas.metrics import (\n",
" context_precision,\n",
" answer_relevancy,\n",
" faithfulness,\n",
" context_recall,\n",
" answer_similarity,\n",
" answer_correctness,\n",
")\n",
"from ragas.metrics.critique import harmfulness\n",
"\n",
"# list of metrics we're going to use\n",
"metrics = [\n",
" faithfulness,\n",
" answer_relevancy,\n",
" context_recall,\n",
" context_precision,\n",
" harmfulness,\n",
" answer_similarity,\n",
" answer_correctness,\n",
"]\n",
"\n",
"from langchain_google_vertexai import ChatVertexAI, VertexAIEmbeddings\n",
"\n",
"config = { \n",
" \"chat_model_id\": \"gemini-1.0-pro-002\",\n",
" \"embedding_model_id\": \"textembedding-gecko\",\n",
"}\n",
"\n",
"\n",
"vertextai_llm = ChatVertexAI(model_name=config[\"chat_model_id\"],)\n",
"vertextai_embeddings = VertexAIEmbeddings(model_name=config[\"embedding_model_id\"])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"dataset.to_pandas()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"\n",
"from ragas import evaluate\n",
"\n",
"result = evaluate(\n",
" dataset, # using 1 as example due to quota constrains\n",
" metrics=metrics,\n",
" llm=vertextai_llm,\n",
" embeddings=vertextai_embeddings,\n",
")\n",
"\n",
"result.to_pandas()"
]
},
{
"cell_type": "markdown",
"metadata": {
Expand Down
4 changes: 2 additions & 2 deletions examples/genai-rag-multimodal/outputs.tf
Original file line number Diff line number Diff line change
Expand Up @@ -26,12 +26,12 @@ output "host_vpc_project_id" {

output "host_vpc_network" {
description = "This is the Self-link of the Host VPC network"
value = var.network
value = google_workbench_instance.instance.gce_setup[0].network_interfaces[0].network
}

output "notebook_project_id" {
description = "The Project ID where the notebook will be run on"
value = var.machine_learning_project
value = google_workbench_instance.instance.project
}

output "vector_search_bucket_name" {
Expand Down
23 changes: 23 additions & 0 deletions examples/genai-rag-multimodal/versions.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
/**
* Copyright 2024 Google LLC
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

terraform {
required_providers {
google = {
version = "~> 5.34.0"
}
}
}