forked from google-ai-edge/mediapipe
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
adding text embedder sample for python
- Loading branch information
Showing
1 changed file
with
213 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,213 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": { | ||
"id": "h2q27gKz1H20" | ||
}, | ||
"source": [ | ||
"**Copyright 2023 The MediaPipe Authors. All Rights Reserved.**" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"id": "TUfAcER1oUS6" | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n", | ||
"# you may not use this file except in compliance with the License.\n", | ||
"# You may obtain a copy of the License at\n", | ||
"#\n", | ||
"# https://www.apache.org/licenses/LICENSE-2.0\n", | ||
"#\n", | ||
"# Unless required by applicable law or agreed to in writing, software\n", | ||
"# distributed under the License is distributed on an \"AS IS\" BASIS,\n", | ||
"# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", | ||
"# See the License for the specific language governing permissions and\n", | ||
"# limitations under the License." | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": { | ||
"id": "L_cQX8dWu4Dv" | ||
}, | ||
"source": [ | ||
"# Text Embedding with MediaPipe Tasks" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"source": [ | ||
"This notebook will show you how to use the MediaPipe Tasks Python API to compare text items, giving you a value for how similar they are. These values will range from -1 to 1 with 1 being the same text. This is done through cosine similarity." | ||
], | ||
"metadata": { | ||
"id": "jERPUr5-vbYk" | ||
} | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"source": [ | ||
"##Preparation\n", | ||
"\n", | ||
"Let's start with installing MediaPipe.\n", | ||
"\n", | ||
"* If you see an error about flatbuffers incompatibility, it's fine to ignore it.\n", | ||
"MediaPipe requires a newer version of flatbuffers (v2), which is incompatible with the older version of Tensorflow (v2.9) currently preinstalled on Colab.\n", | ||
"\n", | ||
"* If you install MediaPipe outside of Colab, you only need to run pip install mediapipe. It isn't necessary to explicitly install flatbuffers." | ||
], | ||
"metadata": { | ||
"id": "7LSy09XEvuJy" | ||
} | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 2, | ||
"metadata": { | ||
"id": "gxbHBsF-8Y_l", | ||
"colab": { | ||
"base_uri": "https://localhost:8080/" | ||
}, | ||
"outputId": "a6944bf4-3d62-4810-9b1a-339d953c9f69" | ||
}, | ||
"outputs": [ | ||
{ | ||
"output_type": "stream", | ||
"name": "stdout", | ||
"text": [ | ||
"\u001b[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.\n", | ||
"tensorflow 2.9.2 requires flatbuffers<2,>=1.12, but you have flatbuffers 2.0 which is incompatible.\u001b[0m\u001b[31m\n", | ||
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m33.0/33.0 MB\u001b[0m \u001b[31m13.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", | ||
"\u001b[?25h" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"!pip install -q flatbuffers==2.0.0\n", | ||
"!pip install -q mediapipe==0.9.0.1" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": { | ||
"id": "a49D7h4TVmru" | ||
}, | ||
"source": [ | ||
"After installing your dependencies, you can download the text embedder model that will be used for this example." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 3, | ||
"metadata": { | ||
"id": "OMjuVQiDYJKF" | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"#@title Start downloading here.\n", | ||
"!wget -O embedder.tflite -q https://storage.googleapis.com/mediapipe-assets/mobilebert_embedding_with_metadata.tflite" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": { | ||
"id": "Iy4r2_ePylIa" | ||
}, | ||
"source": [ | ||
"## Running inference\n", | ||
"\n", | ||
"To run inference using the MediaPipe Text Embedder task, you just need to initialize the `TextEmbedder` using the model that you downloaded earlier, and then use that `TextEmbedder` to compare two separate strings. You can edit the two strings that will be used on the side of this section where you see `first_text` and `second_text`." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 7, | ||
"metadata": { | ||
"id": "Yl_Oiye4mUuo", | ||
"colab": { | ||
"base_uri": "https://localhost:8080/" | ||
}, | ||
"outputId": "d26b3a4b-c4d4-42d8-8ff0-38074817f4af" | ||
}, | ||
"outputs": [ | ||
{ | ||
"output_type": "stream", | ||
"name": "stdout", | ||
"text": [ | ||
"0.9332579993217945\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"from mediapipe.tasks import python\n", | ||
"from mediapipe.tasks.python.components import processors\n", | ||
"from mediapipe.tasks.python import text\n", | ||
"from scipy.spatial import distance\n", | ||
"\n", | ||
"# Create your base options with the model that was downloaded earlier\n", | ||
"base_options = python.BaseOptions(model_asset_path='embedder.tflite')\n", | ||
"\n", | ||
"# Set your values for using normalization and quantization\n", | ||
"l2_normalize = True #@param {type:\"boolean\"}\n", | ||
"quantize = False #@param {type:\"boolean\"}\n", | ||
"\n", | ||
"# Create the final set of options for the Embedder\n", | ||
"options = text.TextEmbedderOptions(\n", | ||
" base_options=base_options, l2_normalize=l2_normalize, quantize=quantize)\n", | ||
"\n", | ||
"with text.TextEmbedder.create_from_options(options) as embedder:\n", | ||
" # Retrieve the first and second sets of text that will be compared\n", | ||
" first_text = \"I'm feeling so good\" #@param {type:\"string\"}\n", | ||
" second_text = \"I'm okay I guess\" #@param {type:\"string\"}\n", | ||
"\n", | ||
" # Convert both sets of text to embeddings\n", | ||
" first_embedding_result = embedder.embed(first_text)\n", | ||
" second_embedding_result = embedder.embed(second_text)\n", | ||
" first_embedding_result.embeddings[0]\n", | ||
"\n", | ||
" # Retrieve the cosine similarity value from both sets of text, then take the\n", | ||
" # cosine of that value to receie a decimal similarity value.\n", | ||
" similarity = 1 - distance.cosine(first_embedding_result.embeddings[0].embedding.astype('float'),\n", | ||
" second_embedding_result.embeddings[0].embedding.astype('float'))\n", | ||
" print(similarity)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"source": [], | ||
"metadata": { | ||
"id": "Yn0qjHhVVlc_" | ||
}, | ||
"execution_count": null, | ||
"outputs": [] | ||
} | ||
], | ||
"metadata": { | ||
"colab": { | ||
"provenance": [] | ||
}, | ||
"kernelspec": { | ||
"display_name": "Python 3 (ipykernel)", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.7.13" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 0 | ||
} |