Added AWS Bedrock embeddings #1738

jackretterer · 2023-10-12T18:12:28Z

Summary: Added support for AWS Bedrock embeddings. Leverages "amazon.titan-tg1-large" for the embedding model.

Test

find your aws secret access key and key id; make sure the account has access to bedrock's tian embed model
follow the instructions in https://github.com/Unstructured-IO/unstructured/blob/d5e797cd44bfe3d423750b4442148d03f942d27d/docs/source/bricks/embedding.rst#bedrockembeddingencoder

badGarnet · 2023-10-12T18:21:46Z

unstructured/embed/bedrock.py

+        return np.array(self.bedrock_client.embed_query(query))
+
+    def embed_documents(self, elements: List[Element]) -> List[Element]:
+        embeddings = [np.array(self.bedrock_client.embed_query(str(e))) for e in elements]


Suggested change

embeddings = [np.array(self.bedrock_client.embed_query(str(e))) for e in elements]

embeddings = [np.array(self.embed_query(str(e))) for e in elements]

actually langchain has a bulk embed api: https://api.python.langchain.com/en/latest/embeddings/langchain.embeddings.bedrock.BedrockEmbeddings.html#langchain.embeddings.bedrock.BedrockEmbeddings.embed_documents
so better to do:

embeddings = self.bedrock_client.embed_documents([str(e) for e in elements])

badGarnet · 2023-10-12T18:27:13Z

unstructured/embed/bedrock.py

+    def _add_embeddings_to_elements(self, elements, embeddings) -> List[Element]:
+        assert len(elements) == len(embeddings)
+        elements_w_embedding = []
+
+        for i, element in enumerate(elements):
+            original_method = element.to_dict
+
+            def new_to_dict(self):
+                d = original_method()
+                d["embeddings"] = self.embeddings
+                return d
+
+            element.embeddings = embeddings[i]
+            elements_w_embedding.append(element)
+            element.to_dict = types.MethodType(new_to_dict, element)
+        return elements


I see this is using

unstructured/unstructured/embed/openai.py

Lines 37 to 52 in 97f8326

def _add_embeddings_to_elements(self, elements, embeddings) -> List[Element]:

assert len(elements) == len(embeddings)

elements_w_embedding = []

for i, element in enumerate(elements):

original_method = element.to_dict

def new_to_dict(self):

d = original_method()

d["embeddings"] = self.embeddings

return d

element.embeddings = embeddings[i]

elements_w_embedding.append(element)

element.to_dict = types.MethodType(new_to_dict, element)

return elements

this part of the code essentially depend on element data definition so can be shared by different embeddings. So we can consider refactor this either:

as a method for elements

as a utils func in unstructured/embed/utils.py so both here and openai version can import the same function

also just FYI: that logic is buggy. fix on route over here

badGarnet · 2023-10-12T18:29:16Z

unstructured/embed/bedrock.py

+
+            def new_to_dict(self):
+                d = original_method()
+                d["embeddings"] = self.embeddings


yeah; given this operation here I mean leaning even more toward moving this function to be part of element (here self refers to an element... NOT this class itself

badGarnet · 2023-10-12T18:29:30Z

unstructured/embed/bedrock.py

+    def _add_embeddings_to_elements(self, elements, embeddings) -> List[Element]:
+        assert len(elements) == len(embeddings)
+        elements_w_embedding = []
+
+        for i, element in enumerate(elements):
+            original_method = element.to_dict
+
+            def new_to_dict(self):
+                d = original_method()
+                d["embeddings"] = self.embeddings
+                return d
+
+            element.embeddings = embeddings[i]
+            elements_w_embedding.append(element)
+            element.to_dict = types.MethodType(new_to_dict, element)
+        return elements


I see this is using

unstructured/unstructured/embed/openai.py

Lines 37 to 52 in 97f8326

def _add_embeddings_to_elements(self, elements, embeddings) -> List[Element]:

assert len(elements) == len(embeddings)

elements_w_embedding = []

for i, element in enumerate(elements):

original_method = element.to_dict

def new_to_dict(self):

d = original_method()

d["embeddings"] = self.embeddings

return d

element.embeddings = embeddings[i]

elements_w_embedding.append(element)

element.to_dict = types.MethodType(new_to_dict, element)

return elements

this part of the code essentially depend on element data definition so can be shared by different embeddings. So we can consider refactor this either:

as a method for elements

as a utils func in unstructured/embed/utils.py so both here and openai version can import the same function

unstructured/embed/bedrock.py

ahmetmeleq · 2023-10-16T12:35:49Z

Reminder to update the docs: https://github.com/Unstructured-IO/unstructured/blob/main/docs/source/bricks/embedding.rst

ahmetmeleq · 2023-10-16T13:13:24Z

Let's form an extra named embed-aws-bedrock and add it as an extra key to setup.py as in:

unstructured/setup.py

Line 160 in 282b8f7

"openai": load_requirements("requirements/ingest-openai.in"),
We need an .in requirements file for this to work (we can name it embed-aws-bedrock.in). Example:
https://github.com/Unstructured-IO/unstructured/blob/282b8f700d9471b9430e8a37af73ca96b980e0f0/requirements/ingest-openai.in

Finally, let's add a @requires_dependencies decorator to the methods where we use the extra libraries explicitly. Example:

unstructured/unstructured/embed/openai.py

Lines 44 to 49 in 282b8f7

    
               @EmbeddingEncoderConnectionError.wrap 
        
               @requires_dependencies( 
        
                   ["langchain", "openai", "tiktoken"], 
        
                   extras="openai", 
        
               ) 
        
               def get_openai_client(self):

ahmetmeleq · 2023-10-16T13:16:59Z

Let's wrap potential connection errors as in:

unstructured/unstructured/embed/openai.py

Line 44 in 282b8f7

@EmbeddingEncoderConnectionError.wrap

Co-authored-by: Ahmet Melek <[email protected]>

unstructured/embed/bedrock.py

ahmetmeleq · 2023-10-17T18:05:40Z

Can we add a mock test as in: https://github.com/Unstructured-IO/unstructured/blob/main/test_unstructured/embed/test_openai.py

Co-authored-by: Ahmet Melek <[email protected]>

…dding

docs/requirements.txt

Co-authored-by: Ahmet Melek <[email protected]>

ahmetmeleq

LGTM!

Added bedrock embeddings

97f8326

jackretterer requested review from badGarnet, ahmetmeleq and ron-unstructured October 12, 2023 18:13

badGarnet reviewed Oct 12, 2023

View reviewed changes

jackretterer added 2 commits October 13, 2023 13:54

Updated Bedrock.py to remove erroneous logic.

698dc3a

Updated Bedrock.py

7af3d96

ahmetmeleq reviewed Oct 16, 2023

View reviewed changes

unstructured/embed/bedrock.py Outdated Show resolved Hide resolved

jackretterer and others added 2 commits October 16, 2023 12:15

Update unstructured/embed/bedrock.py

0914d1f

Co-authored-by: Ahmet Melek <[email protected]>

Added Bedrock Documentation

6ed4abb

jackretterer requested review from badGarnet, ahmetmeleq and ryannikolaidis October 16, 2023 19:53

ahmetmeleq reviewed Oct 17, 2023

View reviewed changes

unstructured/embed/bedrock.py Outdated Show resolved Hide resolved

Update unstructured/embed/bedrock.py

a1e3af6

Co-authored-by: Ahmet Melek <[email protected]>

ryannikolaidis mentioned this pull request Oct 18, 2023

feat: support specifying an encoder via unstructured-ingest #1782

Closed

badGarnet added 5 commits October 18, 2023 10:12

delay import of big dependencies; add .in requirements

14ee9a2

compile

d8a3ce3

Merge remote-tracking branch 'origin/main' into jack/add-bedrock-embe…

b17e563

…dding

add setup

904a988

changelog and version bump

d5e797c

badGarnet requested a review from ahmetmeleq October 18, 2023 17:19

revert

1c82b19

ahmetmeleq reviewed Oct 18, 2023

View reviewed changes

docs/requirements.txt Show resolved Hide resolved

badGarnet and others added 2 commits October 18, 2023 12:51

treat bedrock as ingest

c148f05

Update docs/requirements.txt

6b002bc

Co-authored-by: Ahmet Melek <[email protected]>

ahmetmeleq approved these changes Oct 18, 2023

View reviewed changes

Merge branch 'main' into jack/add-bedrock-embedding

72c62e0

ahmetmeleq force-pushed the jack/add-bedrock-embedding branch from 71a91cc to 72c62e0 Compare October 18, 2023 20:21

ahmetmeleq and others added 4 commits October 18, 2023 21:45

add compiled dependencies with constraints

4ea1786

update pip-compile script to comply constraints

7d9b260

compile dev.txt last

4d14635

add more constraints

7da9d82

badGarnet enabled auto-merge October 18, 2023 23:52

badGarnet disabled auto-merge October 19, 2023 00:36

badGarnet merged commit b8f24ba into main Oct 19, 2023
38 of 39 checks passed

badGarnet deleted the jack/add-bedrock-embedding branch October 19, 2023 00:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added AWS Bedrock embeddings #1738

Added AWS Bedrock embeddings #1738

jackretterer commented Oct 12, 2023 •

edited by badGarnet

Loading

badGarnet Oct 12, 2023

badGarnet Oct 13, 2023

badGarnet Oct 12, 2023

ryannikolaidis Oct 12, 2023

badGarnet Oct 12, 2023

badGarnet Oct 12, 2023

ahmetmeleq commented Oct 16, 2023

ahmetmeleq commented Oct 16, 2023

ahmetmeleq commented Oct 16, 2023

ahmetmeleq commented Oct 17, 2023

ahmetmeleq left a comment

	embeddings = [np.array(self.bedrock_client.embed_query(str(e))) for e in elements]
	embeddings = [np.array(self.embed_query(str(e))) for e in elements]

	def _add_embeddings_to_elements(self, elements, embeddings) -> List[Element]:
	assert len(elements) == len(embeddings)
	elements_w_embedding = []

	for i, element in enumerate(elements):
	original_method = element.to_dict

	def new_to_dict(self):
	d = original_method()
	d["embeddings"] = self.embeddings
	return d

	element.embeddings = embeddings[i]
	elements_w_embedding.append(element)
	element.to_dict = types.MethodType(new_to_dict, element)
	return elements

Added AWS Bedrock embeddings #1738

Added AWS Bedrock embeddings #1738

Conversation

jackretterer commented Oct 12, 2023 • edited by badGarnet Loading

badGarnet Oct 12, 2023

Choose a reason for hiding this comment

badGarnet Oct 13, 2023

Choose a reason for hiding this comment

badGarnet Oct 12, 2023

Choose a reason for hiding this comment

ryannikolaidis Oct 12, 2023

Choose a reason for hiding this comment

badGarnet Oct 12, 2023

Choose a reason for hiding this comment

badGarnet Oct 12, 2023

Choose a reason for hiding this comment

ahmetmeleq commented Oct 16, 2023

ahmetmeleq commented Oct 16, 2023

ahmetmeleq commented Oct 16, 2023

ahmetmeleq commented Oct 17, 2023

ahmetmeleq left a comment

Choose a reason for hiding this comment

jackretterer commented Oct 12, 2023 •

edited by badGarnet

Loading