Skip to content

Commit

Permalink
feat: add vertexai embeddings (#2693)
Browse files Browse the repository at this point in the history
This PR:
- Adds VertexAI embeddings as an embedding provider

Testing
- Tested with pinecone destination connector on
[this](https://github.com/Unstructured-IO/unstructured/actions/runs/8429035114/job/23082700074?pr=2693)
job run.

---------

Co-authored-by: Matt Robinson <[email protected]>
Co-authored-by: Matt Robinson <[email protected]>
  • Loading branch information
3 people authored Mar 28, 2024
1 parent 887e6c9 commit d467922
Show file tree
Hide file tree
Showing 20 changed files with 24,484 additions and 4 deletions.
5 changes: 3 additions & 2 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,12 @@
## 0.13.0-dev13
## 0.13.0-dev14

### Enhancements
### Enhancements

* **Add `.metadata.is_continuation` to text-split chunks.** `.metadata.is_continuation=True` is added to second-and-later chunks formed by text-splitting an oversized `Table` element but not to their counterpart `Text` element splits. Add this indicator for `CompositeElement` to allow text-split continuation chunks to be identified for downstream processes that may wish to skip intentionally redundant metadata values in continuation chunks.
* **Add `compound_structure_acc` metric to table eval.** Add a new property to `unstructured.metrics.table_eval.TableEvaluation`: `composite_structure_acc`, which is computed from the element level row and column index and content accuracy scores
* **Add `.metadata.orig_elements` to chunks.** `.metadata.orig_elements: list[Element]` is added to chunks during the chunking process (when requested) to allow access to information from the elements each chunk was formed from. This is useful for example to recover metadata fields that cannot be consolidated to a single value for a chunk, like `page_number`, `coordinates`, and `image_base64`.
* **Add `--include_orig_elements` option to Ingest CLI.** By default, when chunking, the original elements used to form each chunk are added to `chunk.metadata.orig_elements` for each chunk. * The `include_orig_elements` parameter allows the user to turn off this behavior to produce a smaller payload when they don't need this metadata.
* **Add Google VertexAI embedder** Adds VertexAI embeddings to support embedding via Google Vertex AI.

### Features

Expand Down
53 changes: 53 additions & 0 deletions docs/source/core/embedding.rst
Original file line number Diff line number Diff line change
Expand Up @@ -171,6 +171,59 @@ To obtain an api key, visit: https://octo.ai/docs/getting-started/how-to-create-
query = "This is the query"
query_embedding = embedding_encoder.embed_query(query=query)
[print(e.embeddings, e) for e in elements]
print(query_embedding, query)
print(embedding_encoder.is_unit_vector(), embedding_encoder.num_of_dimensions())
``VertexAIEmbeddingEncoder``
--------------------------

The ``VertexAIEmbeddingEncoder`` class connects to the GCP VertexAI to obtain embeddings for pieces of text.

``embed_documents`` will receive a list of Elements, and return an updated list which
includes the ``embeddings`` attribute for each Element.

``embed_query`` will receive a query as a string, and return a list of floats which is the
embedding vector for the given query string.

``num_of_dimensions`` is a metadata property that denotes the number of dimensions in any
embedding vector obtained via this class.

``is_unit_vector`` is a metadata property that denotes if embedding vectors obtained via
this class are unit vectors.

The following code block shows an example of how to use ``VertexAIEmbeddingEncoder``. You will
see the updated elements list (with the ``embeddings`` attribute included for each element),
the embedding vector for the query string, and some metadata properties about the embedding model.

To use Vertex AI PaLM tou will need to:
- either, pass the full json content of your GCP VertexAI application credentials to the
VertexAIEmbeddingConfig as the api_key parameter. (This will create a file in the ``/tmp``
directory with the content of the json, and set the GOOGLE_APPLICATION_CREDENTIALS environment
variable to the **path** of the created file.)
- or, you'll need to store the path to a manually created service account JSON file as the
GOOGLE_APPLICATION_CREDENTIALS environment variable. (For more information:
https://python.langchain.com/docs/integrations/text_embedding/google_vertex_ai_palm)
- or, you'll need to have the credentials configured for your environment (gcloud,
workload identity, etc…)

.. code:: python
import os
from unstructured.documents.elements import Text
from unstructured.embed.vertexai import VertexAIEmbeddingConfig, VertexAIEmbeddingEncoder
embedding_encoder = VertexAIEmbeddingEncoder(
config=VertexAIEmbeddingConfig(api_key=os.environ["VERTEXAI_GCP_APP_CREDS_JSON_CONTENT"])
)
elements = embedding_encoder.embed_documents(
elements=[Text("This is sentence 1"), Text("This is sentence 2")],
)
query = "This is the query"
query_embedding = embedding_encoder.embed_query(query=query)
[print(e.embeddings, e) for e in elements]
print(query_embedding, query)
print(embedding_encoder.is_unit_vector(), embedding_encoder.num_of_dimensions())
30 changes: 30 additions & 0 deletions examples/embed/example_vertexai.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
import os

from unstructured.documents.elements import Text
from unstructured.embed.vertexai import VertexAIEmbeddingConfig, VertexAIEmbeddingEncoder

# To use Vertex AI PaLM tou will need to:
# - either, pass the full json content of your GCP VertexAI application credentials to the
# VertexAIEmbeddingConfig as the api_key parameter. (This will create a file in the ``/tmp``
# directory with the content of the json, and set the GOOGLE_APPLICATION_CREDENTIALS environment
# variable to the **path** of the created file.)
# - or, you'll need to store the path to a manually created service account JSON file as the
# GOOGLE_APPLICATION_CREDENTIALS environment variable. (For more information:
# https://python.langchain.com/docs/integrations/text_embedding/google_vertex_ai_palm)
# - or, you'll need to have the credentials configured for your environment (gcloud,
# workload identity, etc…)

embedding_encoder = VertexAIEmbeddingEncoder(
config=VertexAIEmbeddingConfig(api_key=os.environ["VERTEXAI_GCP_APP_CREDS_JSON_CONTENT"])
)

elements = embedding_encoder.embed_documents(
elements=[Text("This is sentence 1"), Text("This is sentence 2")],
)

query = "This is the query"
query_embedding = embedding_encoder.embed_query(query=query)

[print(e.embeddings, e) for e in elements]
print(query_embedding, query)
print(embedding_encoder.is_unit_vector(), embedding_encoder.num_of_dimensions())
4 changes: 4 additions & 0 deletions requirements/ingest/embed-octoai.in
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
-c ../constraints.in
-c ../base.txt
openai
tiktoken
72 changes: 72 additions & 0 deletions requirements/ingest/embed-octoai.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
#
# This file is autogenerated by pip-compile with Python 3.9
# by the following command:
#
# pip-compile --output-file=ingest/embed-octoai.txt ingest/embed-octoai.in
#
anyio==3.7.1
# via
# -c ingest/../constraints.in
# httpx
# openai
certifi==2024.2.2
# via
# -c ingest/../base.txt
# -c ingest/../constraints.in
# httpcore
# httpx
# requests
charset-normalizer==3.3.2
# via
# -c ingest/../base.txt
# requests
distro==1.9.0
# via openai
exceptiongroup==1.2.0
# via anyio
h11==0.14.0
# via httpcore
httpcore==1.0.4
# via httpx
httpx==0.27.0
# via openai
idna==3.6
# via
# -c ingest/../base.txt
# anyio
# httpx
# requests
openai==1.14.3
# via -r ingest/embed-octoai.in
pydantic==1.10.14
# via
# -c ingest/../constraints.in
# openai
regex==2023.12.25
# via
# -c ingest/../base.txt
# tiktoken
requests==2.31.0
# via
# -c ingest/../base.txt
# tiktoken
sniffio==1.3.1
# via
# anyio
# httpx
# openai
tiktoken==0.6.0
# via -r ingest/embed-octoai.in
tqdm==4.66.2
# via
# -c ingest/../base.txt
# openai
typing-extensions==4.10.0
# via
# -c ingest/../base.txt
# openai
# pydantic
urllib3==2.2.1
# via
# -c ingest/../base.txt
# requests
5 changes: 5 additions & 0 deletions requirements/ingest/embed-vertexai.in
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
-c ../constraints.in
-c ../base.txt
langchain
langchain-community
langchain-google-vertexai
Loading

0 comments on commit d467922

Please sign in to comment.