Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added AWS Bedrock embeddings #1738

Merged
merged 19 commits into from
Oct 19, 2023
Merged

Added AWS Bedrock embeddings #1738

merged 19 commits into from
Oct 19, 2023

Conversation

jackretterer
Copy link
Contributor

@jackretterer jackretterer commented Oct 12, 2023

Summary: Added support for AWS Bedrock embeddings. Leverages "amazon.titan-tg1-large" for the embedding model.

Test

return np.array(self.bedrock_client.embed_query(query))

def embed_documents(self, elements: List[Element]) -> List[Element]:
embeddings = [np.array(self.bedrock_client.embed_query(str(e))) for e in elements]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
embeddings = [np.array(self.bedrock_client.embed_query(str(e))) for e in elements]
embeddings = [np.array(self.embed_query(str(e))) for e in elements]

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually langchain has a bulk embed api: https://api.python.langchain.com/en/latest/embeddings/langchain.embeddings.bedrock.BedrockEmbeddings.html#langchain.embeddings.bedrock.BedrockEmbeddings.embed_documents
so better to do:

embeddings = self.bedrock_client.embed_documents([str(e) for e in elements])

Comment on lines 39 to 54
def _add_embeddings_to_elements(self, elements, embeddings) -> List[Element]:
assert len(elements) == len(embeddings)
elements_w_embedding = []

for i, element in enumerate(elements):
original_method = element.to_dict

def new_to_dict(self):
d = original_method()
d["embeddings"] = self.embeddings
return d

element.embeddings = embeddings[i]
elements_w_embedding.append(element)
element.to_dict = types.MethodType(new_to_dict, element)
return elements
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this is using

def _add_embeddings_to_elements(self, elements, embeddings) -> List[Element]:
assert len(elements) == len(embeddings)
elements_w_embedding = []
for i, element in enumerate(elements):
original_method = element.to_dict
def new_to_dict(self):
d = original_method()
d["embeddings"] = self.embeddings
return d
element.embeddings = embeddings[i]
elements_w_embedding.append(element)
element.to_dict = types.MethodType(new_to_dict, element)
return elements

this part of the code essentially depend on element data definition so can be shared by different embeddings. So we can consider refactor this either:

  • as a method for elements
  • as a utils func in unstructured/embed/utils.py so both here and openai version can import the same function

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also just FYI: that logic is buggy. fix on route over here


def new_to_dict(self):
d = original_method()
d["embeddings"] = self.embeddings
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah; given this operation here I mean leaning even more toward moving this function to be part of element (here self refers to an element... NOT this class itself

Comment on lines 39 to 54
def _add_embeddings_to_elements(self, elements, embeddings) -> List[Element]:
assert len(elements) == len(embeddings)
elements_w_embedding = []

for i, element in enumerate(elements):
original_method = element.to_dict

def new_to_dict(self):
d = original_method()
d["embeddings"] = self.embeddings
return d

element.embeddings = embeddings[i]
elements_w_embedding.append(element)
element.to_dict = types.MethodType(new_to_dict, element)
return elements
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this is using

def _add_embeddings_to_elements(self, elements, embeddings) -> List[Element]:
assert len(elements) == len(embeddings)
elements_w_embedding = []
for i, element in enumerate(elements):
original_method = element.to_dict
def new_to_dict(self):
d = original_method()
d["embeddings"] = self.embeddings
return d
element.embeddings = embeddings[i]
elements_w_embedding.append(element)
element.to_dict = types.MethodType(new_to_dict, element)
return elements

this part of the code essentially depend on element data definition so can be shared by different embeddings. So we can consider refactor this either:

  • as a method for elements
  • as a utils func in unstructured/embed/utils.py so both here and openai version can import the same function

@ahmetmeleq
Copy link
Contributor

@ahmetmeleq
Copy link
Contributor

  1. Let's form an extra named embed-aws-bedrock and add it as an extra key to setup.py as in:

    "openai": load_requirements("requirements/ingest-openai.in"),

  2. We need an .in requirements file for this to work (we can name it embed-aws-bedrock.in). Example:
    https://github.com/Unstructured-IO/unstructured/blob/282b8f700d9471b9430e8a37af73ca96b980e0f0/requirements/ingest-openai.in

  3. Finally, let's add a @requires_dependencies decorator to the methods where we use the extra libraries explicitly. Example:

    @EmbeddingEncoderConnectionError.wrap
    @requires_dependencies(
    ["langchain", "openai", "tiktoken"],
    extras="openai",
    )
    def get_openai_client(self):

@ahmetmeleq
Copy link
Contributor

Let's wrap potential connection errors as in:

@EmbeddingEncoderConnectionError.wrap

@ahmetmeleq
Copy link
Contributor

@badGarnet badGarnet requested a review from ahmetmeleq October 18, 2023 17:19
Copy link
Contributor

@ahmetmeleq ahmetmeleq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@ahmetmeleq ahmetmeleq force-pushed the jack/add-bedrock-embedding branch from 71a91cc to 72c62e0 Compare October 18, 2023 20:21
@badGarnet badGarnet enabled auto-merge October 18, 2023 23:52
@badGarnet badGarnet disabled auto-merge October 19, 2023 00:36
@badGarnet badGarnet merged commit b8f24ba into main Oct 19, 2023
38 of 39 checks passed
@badGarnet badGarnet deleted the jack/add-bedrock-embedding branch October 19, 2023 00:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants