-
Notifications
You must be signed in to change notification settings - Fork 817
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added AWS Bedrock embeddings #1738
Conversation
unstructured/embed/bedrock.py
Outdated
return np.array(self.bedrock_client.embed_query(query)) | ||
|
||
def embed_documents(self, elements: List[Element]) -> List[Element]: | ||
embeddings = [np.array(self.bedrock_client.embed_query(str(e))) for e in elements] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
embeddings = [np.array(self.bedrock_client.embed_query(str(e))) for e in elements] | |
embeddings = [np.array(self.embed_query(str(e))) for e in elements] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually langchain has a bulk embed api: https://api.python.langchain.com/en/latest/embeddings/langchain.embeddings.bedrock.BedrockEmbeddings.html#langchain.embeddings.bedrock.BedrockEmbeddings.embed_documents
so better to do:
embeddings = self.bedrock_client.embed_documents([str(e) for e in elements])
unstructured/embed/bedrock.py
Outdated
def _add_embeddings_to_elements(self, elements, embeddings) -> List[Element]: | ||
assert len(elements) == len(embeddings) | ||
elements_w_embedding = [] | ||
|
||
for i, element in enumerate(elements): | ||
original_method = element.to_dict | ||
|
||
def new_to_dict(self): | ||
d = original_method() | ||
d["embeddings"] = self.embeddings | ||
return d | ||
|
||
element.embeddings = embeddings[i] | ||
elements_w_embedding.append(element) | ||
element.to_dict = types.MethodType(new_to_dict, element) | ||
return elements |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see this is using
unstructured/unstructured/embed/openai.py
Lines 37 to 52 in 97f8326
def _add_embeddings_to_elements(self, elements, embeddings) -> List[Element]: | |
assert len(elements) == len(embeddings) | |
elements_w_embedding = [] | |
for i, element in enumerate(elements): | |
original_method = element.to_dict | |
def new_to_dict(self): | |
d = original_method() | |
d["embeddings"] = self.embeddings | |
return d | |
element.embeddings = embeddings[i] | |
elements_w_embedding.append(element) | |
element.to_dict = types.MethodType(new_to_dict, element) | |
return elements |
this part of the code essentially depend on
element
data definition so can be shared by different embeddings. So we can consider refactor this either:
- as a method for elements
- as a utils func in
unstructured/embed/utils.py
so both here and openai version can import the same function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also just FYI: that logic is buggy. fix on route over here
unstructured/embed/bedrock.py
Outdated
|
||
def new_to_dict(self): | ||
d = original_method() | ||
d["embeddings"] = self.embeddings |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah; given this operation here I mean leaning even more toward moving this function to be part of element (here self
refers to an element... NOT this class itself
unstructured/embed/bedrock.py
Outdated
def _add_embeddings_to_elements(self, elements, embeddings) -> List[Element]: | ||
assert len(elements) == len(embeddings) | ||
elements_w_embedding = [] | ||
|
||
for i, element in enumerate(elements): | ||
original_method = element.to_dict | ||
|
||
def new_to_dict(self): | ||
d = original_method() | ||
d["embeddings"] = self.embeddings | ||
return d | ||
|
||
element.embeddings = embeddings[i] | ||
elements_w_embedding.append(element) | ||
element.to_dict = types.MethodType(new_to_dict, element) | ||
return elements |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see this is using
unstructured/unstructured/embed/openai.py
Lines 37 to 52 in 97f8326
def _add_embeddings_to_elements(self, elements, embeddings) -> List[Element]: | |
assert len(elements) == len(embeddings) | |
elements_w_embedding = [] | |
for i, element in enumerate(elements): | |
original_method = element.to_dict | |
def new_to_dict(self): | |
d = original_method() | |
d["embeddings"] = self.embeddings | |
return d | |
element.embeddings = embeddings[i] | |
elements_w_embedding.append(element) | |
element.to_dict = types.MethodType(new_to_dict, element) | |
return elements |
this part of the code essentially depend on
element
data definition so can be shared by different embeddings. So we can consider refactor this either:
- as a method for elements
- as a utils func in
unstructured/embed/utils.py
so both here and openai version can import the same function
Reminder to update the docs: https://github.com/Unstructured-IO/unstructured/blob/main/docs/source/bricks/embedding.rst |
|
Let's wrap potential connection errors as in: unstructured/unstructured/embed/openai.py Line 44 in 282b8f7
|
Co-authored-by: Ahmet Melek <[email protected]>
Can we add a mock test as in: https://github.com/Unstructured-IO/unstructured/blob/main/test_unstructured/embed/test_openai.py |
Co-authored-by: Ahmet Melek <[email protected]>
Co-authored-by: Ahmet Melek <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
71a91cc
to
72c62e0
Compare
Summary: Added support for AWS Bedrock embeddings. Leverages "amazon.titan-tg1-large" for the embedding model.
Test