Skip to content

Commit

Permalink
Merge branch 'main' into klaijan/ci-text-extraction
Browse files Browse the repository at this point in the history
  • Loading branch information
Klaijan authored Oct 19, 2023
2 parents 0f0dec6 + a0b44f7 commit 506cda0
Show file tree
Hide file tree
Showing 52 changed files with 362 additions and 89 deletions.
6 changes: 5 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,17 @@
## 0.10.25-dev0
## 0.10.25-dev2

### Enhancements

* **Add CI evaluation workflow** Adds evaluation metrics to the current ingest workflow to measure the performance of each file extracted as well as aggregated-level performance.

### Features

* **Add AWS bedrock embedding connector** `unstructured.embed.bedrock` now provides a connector to use AWS bedrock's `titan-embed-text` model to generate embeddings for elements. This features requires valid AWS bedrock setup and an internet connectionto run.

### Fixes

* **Import PDFResourceManager more directly** We were importing `PDFResourceManager` from `pdfminer.converter` which was causing an error for some users. We changed to import from the actual location of `PDFResourceManager`, which is `pdfminer.pdfinterp`.

## 0.10.24

### Enhancements
Expand Down
4 changes: 2 additions & 2 deletions docs/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
# This file is autogenerated by pip-compile with Python 3.8
# by the following command:
#
# pip-compile requirements/build.in
# pip-compile --constraint=requirements/constraints.in requirements/build.in
#
alabaster==0.7.13
# via sphinx
Expand Down Expand Up @@ -116,7 +116,7 @@ sphinxcontrib-serializinghtml==1.1.5
# via
# -r requirements/build.in
# sphinx
urllib3==1.26.17
urllib3==1.26.18
# via
# -c requirements/base.txt
# -c requirements/constraints.in
Expand Down
55 changes: 55 additions & 0 deletions docs/source/bricks/embedding.rst
Original file line number Diff line number Diff line change
Expand Up @@ -45,14 +45,69 @@ To obtain an api key, visit: https://platform.openai.com/account/api-keys
from unstructured.documents.elements import Text
from unstructured.embed.openai import OpenAIEmbeddingEncoder
# Initialize the encoder with OpenAI credentials
embedding_encoder = OpenAIEmbeddingEncoder(api_key=os.environ["OPENAI_API_KEY"])
# Embed a list of Elements
elements = embedding_encoder.embed_documents(
elements=[Text("This is sentence 1"), Text("This is sentence 2")],
)
# Embed a single query string
query = "This is the query"
query_embedding = embedding_encoder.embed_query(query=query)
# Print embeddings
[print(e.embeddings, e) for e in elements]
print(query_embedding, query)
print(embedding_encoder.is_unit_vector(), embedding_encoder.num_of_dimensions())
``BedrockEmbeddingEncoder``
--------------------------

The ``BedrockEmbeddingEncoder`` class provides an interface to obtain embeddings for text using the Bedrock embeddings via the langchain integration. It connects to the Bedrock Runtime using AWS's boto3 package.

Key methods and attributes include:

``embed_documents``: This function takes a list of Elements as its input and returns the same list with an updated embeddings attribute for each Element.

``embed_query``: This method takes a query as a string and returns the embedding vector for the given query string.

``num_of_dimensions``: A metadata property that signifies the number of dimensions in any embedding vector obtained via this class.

``is_unit_vector``: A metadata property that checks if embedding vectors obtained via this class are unit vectors.

Initialization:
To create an instance of the `BedrockEmbeddingEncoder`, AWS credentials and the region name are required.

.. code:: python
import os
from unstructured.documents.elements import Text
from unstructured.embed.bedrock import BedrockEmbeddingEncoder
# Initialize the encoder with AWS credentials
embedding_encoder = BedrockEmbeddingEncoder(
aws_access_key_id="YOUR_AWS_ACCESS_KEY_ID",
aws_secret_access_key="YOUR_AWS_SECRET_ACCESS_KEY",
region_name="us-west-2"
)
# Embed a list of Elements
elements = embedding_encoder.embed_documents(
elements=[Text("Sentence A"), Text("Sentence B")]
)
# Embed a single query string
query = "Example query"
query_embedding = embedding_encoder.embed_query(query=query)
# Print embeddings
[print(e.embeddings, e) for e in elements]
print(query_embedding, query)
print(embedding_encoder.is_unit_vector(), embedding_encoder.num_of_dimensions())
Dependencies:
This class relies on several dependencies which include boto3, numpy, and langchain. Ensure these are installed and available in the environment where this class is utilized.
4 changes: 2 additions & 2 deletions requirements/base.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
# This file is autogenerated by pip-compile with Python 3.8
# by the following command:
#
# pip-compile requirements/base.in
# pip-compile --constraint=requirements/constraints.in requirements/base.in
#
backoff==2.2.1
# via -r requirements/base.in
Expand Down Expand Up @@ -66,7 +66,7 @@ typing-extensions==4.8.0
# via typing-inspect
typing-inspect==0.9.0
# via dataclasses-json
urllib3==1.26.17
urllib3==1.26.18
# via
# -c requirements/constraints.in
# requests
4 changes: 2 additions & 2 deletions requirements/build.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
# This file is autogenerated by pip-compile with Python 3.8
# by the following command:
#
# pip-compile requirements/build.in
# pip-compile --constraint=requirements/constraints.in requirements/build.in
#
alabaster==0.7.13
# via sphinx
Expand Down Expand Up @@ -116,7 +116,7 @@ sphinxcontrib-serializinghtml==1.1.5
# via
# -r requirements/build.in
# sphinx
urllib3==1.26.17
urllib3==1.26.18
# via
# -c requirements/base.txt
# -c requirements/constraints.in
Expand Down
2 changes: 2 additions & 0 deletions requirements/constraints.in
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,8 @@
####################################################################################################
# NOTE(alan): Pinning to avoid conflicts with downstream ingest-s3
urllib3<1.27, >=1.25.4
boto3<1.28.18
botocore<1.31.18
# consistency with local-inference-pin
protobuf<4.24
# NOTE(robinson) - Required pins for security scans
Expand Down
6 changes: 3 additions & 3 deletions requirements/dev.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
# This file is autogenerated by pip-compile with Python 3.8
# by the following command:
#
# pip-compile requirements/dev.in
# pip-compile --constraint=requirements/constraints.in requirements/dev.in
#
anyio==3.7.1
# via
Expand Down Expand Up @@ -213,7 +213,7 @@ nest-asyncio==1.5.8
# via ipykernel
nodeenv==1.8.0
# via pre-commit
notebook==7.0.5
notebook==7.0.6
# via jupyter
notebook-shim==0.2.3
# via
Expand Down Expand Up @@ -390,7 +390,7 @@ typing-extensions==4.8.0
# ipython
uri-template==1.3.0
# via jsonschema
urllib3==1.26.17
urllib3==1.26.18
# via
# -c requirements/base.txt
# -c requirements/constraints.in
Expand Down
2 changes: 1 addition & 1 deletion requirements/extra-csv.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
# This file is autogenerated by pip-compile with Python 3.8
# by the following command:
#
# pip-compile requirements/extra-csv.in
# pip-compile --constraint=requirements/constraints.in requirements/extra-csv.in
#
numpy==1.24.4
# via
Expand Down
2 changes: 1 addition & 1 deletion requirements/extra-docx.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
# This file is autogenerated by pip-compile with Python 3.8
# by the following command:
#
# pip-compile requirements/extra-docx.in
# pip-compile --constraint=requirements/constraints.in requirements/extra-docx.in
#
lxml==4.9.3
# via
Expand Down
2 changes: 1 addition & 1 deletion requirements/extra-epub.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
# This file is autogenerated by pip-compile with Python 3.8
# by the following command:
#
# pip-compile requirements/extra-epub.in
# pip-compile --constraint=requirements/constraints.in requirements/extra-epub.in
#
ebooklib==0.18
# via -r requirements/extra-epub.in
Expand Down
2 changes: 1 addition & 1 deletion requirements/extra-markdown.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
# This file is autogenerated by pip-compile with Python 3.8
# by the following command:
#
# pip-compile requirements/extra-markdown.in
# pip-compile --constraint=requirements/constraints.in requirements/extra-markdown.in
#
importlib-metadata==6.8.0
# via markdown
Expand Down
2 changes: 1 addition & 1 deletion requirements/extra-msg.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
# This file is autogenerated by pip-compile with Python 3.8
# by the following command:
#
# pip-compile requirements/extra-msg.in
# pip-compile --constraint=requirements/constraints.in requirements/extra-msg.in
#
msg-parser==1.2.0
# via -r requirements/extra-msg.in
Expand Down
2 changes: 1 addition & 1 deletion requirements/extra-odt.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
# This file is autogenerated by pip-compile with Python 3.8
# by the following command:
#
# pip-compile requirements/extra-odt.in
# pip-compile --constraint=requirements/constraints.in requirements/extra-odt.in
#
lxml==4.9.3
# via
Expand Down
6 changes: 3 additions & 3 deletions requirements/extra-paddleocr.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
# This file is autogenerated by pip-compile with Python 3.8
# by the following command:
#
# pip-compile requirements/extra-paddleocr.in
# pip-compile --constraint=requirements/constraints.in requirements/extra-paddleocr.in
#
attrdict==2.0.1
# via unstructured-paddleocr
Expand Down Expand Up @@ -35,7 +35,7 @@ cssutils==2.9.0
# via premailer
cycler==0.12.1
# via matplotlib
cython==3.0.3
cython==3.0.4
# via unstructured-paddleocr
et-xmlfile==1.1.0
# via openpyxl
Expand Down Expand Up @@ -213,7 +213,7 @@ tzdata==2023.3
# via pandas
unstructured-paddleocr==2.6.1.3
# via -r requirements/extra-paddleocr.in
urllib3==1.26.17
urllib3==1.26.18
# via
# -c requirements/base.txt
# -c requirements/constraints.in
Expand Down
2 changes: 1 addition & 1 deletion requirements/extra-pandoc.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
# This file is autogenerated by pip-compile with Python 3.8
# by the following command:
#
# pip-compile requirements/extra-pandoc.in
# pip-compile --constraint=requirements/constraints.in requirements/extra-pandoc.in
#
pypandoc==1.12
# via -r requirements/extra-pandoc.in
8 changes: 4 additions & 4 deletions requirements/extra-pdf-image.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
# This file is autogenerated by pip-compile with Python 3.8
# by the following command:
#
# pip-compile requirements/extra-pdf-image.in
# pip-compile --constraint=requirements/constraints.in requirements/extra-pdf-image.in
#
antlr4-python3-runtime==4.9.3
# via omegaconf
Expand Down Expand Up @@ -223,7 +223,7 @@ tqdm==4.66.1
# huggingface-hub
# iopath
# transformers
transformers==4.34.0
transformers==4.34.1
# via unstructured-inference
typing-extensions==4.8.0
# via
Expand All @@ -234,13 +234,13 @@ typing-extensions==4.8.0
# torch
tzdata==2023.3
# via pandas
unstructured-inference==0.7.5
unstructured-inference==0.7.7
# via -r requirements/extra-pdf-image.in
unstructured-pytesseract==0.3.12
# via
# -c requirements/constraints.in
# -r requirements/extra-pdf-image.in
urllib3==1.26.17
urllib3==1.26.18
# via
# -c requirements/base.txt
# -c requirements/constraints.in
Expand Down
2 changes: 1 addition & 1 deletion requirements/extra-pptx.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
# This file is autogenerated by pip-compile with Python 3.8
# by the following command:
#
# pip-compile requirements/extra-pptx.in
# pip-compile --constraint=requirements/constraints.in requirements/extra-pptx.in
#
lxml==4.9.3
# via python-pptx
Expand Down
2 changes: 1 addition & 1 deletion requirements/extra-xlsx.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
# This file is autogenerated by pip-compile with Python 3.8
# by the following command:
#
# pip-compile requirements/extra-xlsx.in
# pip-compile --constraint=requirements/constraints.in requirements/extra-xlsx.in
#
et-xmlfile==1.1.0
# via openpyxl
Expand Down
6 changes: 3 additions & 3 deletions requirements/huggingface.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
# This file is autogenerated by pip-compile with Python 3.8
# by the following command:
#
# pip-compile requirements/huggingface.in
# pip-compile --constraint=requirements/constraints.in requirements/huggingface.in
#
certifi==2023.7.22
# via
Expand Down Expand Up @@ -102,14 +102,14 @@ tqdm==4.66.1
# huggingface-hub
# sacremoses
# transformers
transformers==4.34.0
transformers==4.34.1
# via -r requirements/huggingface.in
typing-extensions==4.8.0
# via
# -c requirements/base.txt
# huggingface-hub
# torch
urllib3==1.26.17
urllib3==1.26.18
# via
# -c requirements/base.txt
# -c requirements/constraints.in
Expand Down
4 changes: 2 additions & 2 deletions requirements/ingest-airtable.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
# This file is autogenerated by pip-compile with Python 3.8
# by the following command:
#
# pip-compile requirements/ingest-airtable.in
# pip-compile --constraint=requirements/constraints.in requirements/ingest-airtable.in
#
certifi==2023.7.22
# via
Expand Down Expand Up @@ -34,7 +34,7 @@ typing-extensions==4.8.0
# -c requirements/base.txt
# pyairtable
# pydantic
urllib3==1.26.17
urllib3==1.26.18
# via
# -c requirements/base.txt
# -c requirements/constraints.in
Expand Down
4 changes: 2 additions & 2 deletions requirements/ingest-azure-cognitive-search.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
# This file is autogenerated by pip-compile with Python 3.8
# by the following command:
#
# pip-compile requirements/ingest-azure-cognitive-search.in
# pip-compile --constraint=requirements/constraints.in requirements/ingest-azure-cognitive-search.in
#
azure-common==1.1.28
# via azure-search-documents
Expand Down Expand Up @@ -50,7 +50,7 @@ typing-extensions==4.8.0
# -c requirements/base.txt
# azure-core
# azure-search-documents
urllib3==1.26.17
urllib3==1.26.18
# via
# -c requirements/base.txt
# -c requirements/constraints.in
Expand Down
Loading

0 comments on commit 506cda0

Please sign in to comment.