Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: bump inference to 0.6.6 #1563

Merged
merged 19 commits into from
Sep 29, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 10 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,13 @@
## 0.10.19-dev0
## 0.10.19-dev1

### Enhancements

* **bump `unstructured-inference` to `0.6.6`** The updated version of `unstructured-inference` makes table extraction in `hi_res` mode configurable to fine tune table extraction performance; it also improves element detection by adding a deduplication post processing step in the `hi_res` partitioning of pdfs and images.

### Features

### Fixes


## 0.10.18

Expand Down
7 changes: 5 additions & 2 deletions requirements/constraints.in
Original file line number Diff line number Diff line change
Expand Up @@ -39,5 +39,8 @@ matplotlib==3.7.2
# NOTE(crag) - pin to available pandas for python 3.8 (at least in CI)
fsspec==2023.9.1
pandas<2.0.4
# langchain limits this to 3.1.7
anyio==3.1.7
# langchain limits anyio to below 4.0
anyio<4.0
# pinned in unstructured paddleocr
opencv-python==4.8.0.76
opencv-contrib-python==4.8.0.76
12 changes: 7 additions & 5 deletions requirements/dev.txt
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,10 @@
#
# pip-compile requirements/dev.in
#
anyio==4.0.0
# via jupyter-server
anyio==3.7.1
# via
# -c requirements/constraints.in
# jupyter-server
appnope==0.1.3
# via
# ipykernel
Expand Down Expand Up @@ -42,7 +44,7 @@ certifi==2023.7.22
# -c requirements/constraints.in
# -c requirements/test.txt
# requests
cffi==1.15.1
cffi==1.16.0
# via argon2-cffi-bindings
cfgv==3.4.0
# via pre-commit
Expand Down Expand Up @@ -151,7 +153,7 @@ jupyter-client==8.3.1
# qtconsole
jupyter-console==6.6.3
# via jupyter
jupyter-core==5.3.1
jupyter-core==5.3.2
# via
# -c requirements/constraints.in
# ipykernel
Expand Down Expand Up @@ -393,7 +395,7 @@ urllib3==1.26.16
# requests
virtualenv==20.24.5
# via pre-commit
wcwidth==0.2.6
wcwidth==0.2.7
# via prompt-toolkit
webcolors==1.13
# via jsonschema
Expand Down
7 changes: 5 additions & 2 deletions requirements/extra-paddleocr.txt
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ cssselect==1.2.0
# via premailer
cssutils==2.7.1
# via premailer
cycler==0.11.0
cycler==0.12.0
# via matplotlib
cython==3.0.2
# via unstructured-paddleocr
Expand Down Expand Up @@ -112,9 +112,12 @@ numpy==1.24.4
# unstructured-paddleocr
# visualdl
opencv-contrib-python==4.8.0.76
# via unstructured-paddleocr
# via
# -c requirements/constraints.in
# unstructured-paddleocr
opencv-python==4.8.0.76
# via
# -c requirements/constraints.in
# imgaug
# unstructured-paddleocr
openpyxl==3.1.2
Expand Down
2 changes: 1 addition & 1 deletion requirements/extra-pdf-image.in
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ pdf2image
pdfminer.six
# Do not move to contsraints.in, otherwise unstructured-inference will not be upgraded
# when unstructured library is.
unstructured-inference==0.5.31
unstructured-inference==0.6.6
# unstructured fork of pytesseract that provides an interface to allow for multiple output formats
# from one tesseract call
unstructured.pytesseract>=0.3.12
9 changes: 5 additions & 4 deletions requirements/extra-pdf-image.txt
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ certifi==2023.7.22
# -c requirements/base.txt
# -c requirements/constraints.in
# requests
cffi==1.15.1
cffi==1.16.0
# via cryptography
charset-normalizer==3.2.0
# via
Expand All @@ -24,7 +24,7 @@ contourpy==1.1.1
# via matplotlib
cryptography==41.0.4
# via pdfminer-six
cycler==0.11.0
cycler==0.12.0
# via matplotlib
effdet==0.4.1
# via layoutparser
Expand Down Expand Up @@ -95,6 +95,7 @@ onnxruntime==1.16.0
# via unstructured-inference
opencv-python==4.8.0.76
# via
# -c requirements/constraints.in
# layoutparser
# unstructured-inference
packaging==23.1
Expand Down Expand Up @@ -213,7 +214,7 @@ tqdm==4.66.1
# huggingface-hub
# iopath
# transformers
transformers==4.33.2
transformers==4.33.3
# via unstructured-inference
typing-extensions==4.8.0
# via
Expand All @@ -224,7 +225,7 @@ typing-extensions==4.8.0
# torch
tzdata==2023.3
# via pandas
unstructured-inference==0.5.31
unstructured-inference==0.6.6
# via -r requirements/extra-pdf-image.in
unstructured-pytesseract==0.3.12
# via
Expand Down
2 changes: 1 addition & 1 deletion requirements/huggingface.txt
Original file line number Diff line number Diff line change
Expand Up @@ -97,7 +97,7 @@ tqdm==4.66.1
# huggingface-hub
# sacremoses
# transformers
transformers==4.33.2
transformers==4.33.3
# via -r requirements/huggingface.in
typing-extensions==4.8.0
# via
Expand Down
2 changes: 1 addition & 1 deletion requirements/ingest-airtable.txt
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ inflection==0.5.1
# via pyairtable
pyairtable==2.1.0.post1
# via -r requirements/ingest-airtable.in
pydantic==1.10.12
pydantic==1.10.13
# via
# -c requirements/constraints.in
# pyairtable
Expand Down
2 changes: 1 addition & 1 deletion requirements/ingest-azure.txt
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ certifi==2023.7.22
# -c requirements/base.txt
# -c requirements/constraints.in
# requests
cffi==1.15.1
cffi==1.16.0
# via
# azure-datalake-store
# cryptography
Expand Down
2 changes: 1 addition & 1 deletion requirements/ingest-box.txt
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ certifi==2023.7.22
# -c requirements/base.txt
# -c requirements/constraints.in
# requests
cffi==1.15.1
cffi==1.16.0
# via cryptography
charset-normalizer==3.2.0
# via
Expand Down
3 changes: 1 addition & 2 deletions requirements/ingest-gcs.txt
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@ google-api-core==2.12.0
# via
# google-cloud-core
# google-cloud-storage
google-auth==2.23.0
google-auth==2.23.2
# via
# gcsfs
# google-api-core
Expand Down Expand Up @@ -107,7 +107,6 @@ urllib3==1.26.16
# via
# -c requirements/base.txt
# -c requirements/constraints.in
# google-auth
# requests
yarl==1.9.2
# via aiohttp
2 changes: 1 addition & 1 deletion requirements/ingest-github.txt
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ certifi==2023.7.22
# -c requirements/base.txt
# -c requirements/constraints.in
# requests
cffi==1.15.1
cffi==1.16.0
# via
# cryptography
# pynacl
Expand Down
3 changes: 1 addition & 2 deletions requirements/ingest-google-drive.txt
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ google-api-core==2.12.0
# via google-api-python-client
google-api-python-client==2.101.0
# via -r requirements/ingest-google-drive.in
google-auth==2.23.0
google-auth==2.23.2
# via
# google-api-core
# google-api-python-client
Expand Down Expand Up @@ -63,5 +63,4 @@ urllib3==1.26.16
# via
# -c requirements/base.txt
# -c requirements/constraints.in
# google-auth
# requests
22 changes: 12 additions & 10 deletions requirements/ingest-notion.txt
Original file line number Diff line number Diff line change
Expand Up @@ -4,33 +4,35 @@
#
# pip-compile requirements/ingest-notion.in
#
certifi==2023.7.22
anyio==3.7.1
# via
# -c requirements/base.txt
# -c requirements/constraints.in
# httpx
charset-normalizer==3.2.0
# httpcore
certifi==2023.7.22
# via
# -c requirements/base.txt
# -c requirements/constraints.in
# httpcore
# httpx
h11==0.12.0
exceptiongroup==1.1.3
# via anyio
h11==0.14.0
# via httpcore
htmlbuilder==1.0.0
# via -r requirements/ingest-notion.in
httpcore==0.13.3
httpcore==0.18.0
# via httpx
httpx==0.20.0
httpx==0.25.0
# via notion-client
idna==3.4
# via
# -c requirements/base.txt
# anyio
# httpx
# rfc3986
notion-client==2.0.0
# via -r requirements/ingest-notion.in
rfc3986[idna2008]==1.5.0
# via httpx
sniffio==1.3.0
# via
# anyio
# httpcore
# httpx
2 changes: 1 addition & 1 deletion requirements/ingest-onedrive.txt
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ certifi==2023.7.22
# -c requirements/base.txt
# -c requirements/constraints.in
# requests
cffi==1.15.1
cffi==1.16.0
# via cryptography
charset-normalizer==3.2.0
# via
Expand Down
17 changes: 15 additions & 2 deletions requirements/ingest-openai.txt
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,10 @@ aiohttp==3.8.5
# openai
aiosignal==1.3.1
# via aiohttp
anyio==3.7.1
# via
# -c requirements/constraints.in
# langchain
async-timeout==4.0.3
# via
# aiohttp
Expand All @@ -30,16 +34,23 @@ dataclasses-json==0.6.1
# via
# -c requirements/base.txt
# langchain
exceptiongroup==1.1.3
# via anyio
frozenlist==1.4.0
# via
# aiohttp
# aiosignal
idna==3.4
# via
# -c requirements/base.txt
# anyio
# requests
# yarl
langchain==0.0.298
jsonpatch==1.33
# via langchain
jsonpointer==2.4
# via jsonpatch
langchain==0.0.304
# via -r requirements/ingest-openai.in
langsmith==0.0.41
# via langchain
Expand Down Expand Up @@ -69,7 +80,7 @@ packaging==23.1
# via
# -c requirements/base.txt
# marshmallow
pydantic==1.10.12
pydantic==1.10.13
# via
# -c requirements/constraints.in
# langchain
Expand All @@ -87,6 +98,8 @@ requests==2.31.0
# langsmith
# openai
# tiktoken
sniffio==1.3.0
# via anyio
sqlalchemy==2.0.21
# via langchain
tenacity==8.2.3
Expand Down
2 changes: 1 addition & 1 deletion requirements/ingest-outlook.txt
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ certifi==2023.7.22
# -c requirements/base.txt
# -c requirements/constraints.in
# requests
cffi==1.15.1
cffi==1.16.0
# via cryptography
charset-normalizer==3.2.0
# via
Expand Down
2 changes: 1 addition & 1 deletion requirements/ingest-salesforce.txt
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ certifi==2023.7.22
# -c requirements/base.txt
# -c requirements/constraints.in
# requests
cffi==1.15.1
cffi==1.16.0
# via cryptography
charset-normalizer==3.2.0
# via
Expand Down
2 changes: 1 addition & 1 deletion requirements/ingest-sharepoint.txt
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ certifi==2023.7.22
# -c requirements/base.txt
# -c requirements/constraints.in
# requests
cffi==1.15.1
cffi==1.16.0
# via cryptography
charset-normalizer==3.2.0
# via
Expand Down
4 changes: 2 additions & 2 deletions requirements/test.txt
Original file line number Diff line number Diff line change
Expand Up @@ -74,7 +74,7 @@ pluggy==1.3.0
# via pytest
pycodestyle==2.11.0
# via flake8
pydantic==1.10.12
pydantic==1.10.13
# via
# -c requirements/constraints.in
# -r requirements/test.in
Expand Down Expand Up @@ -113,7 +113,7 @@ types-click==7.1.8
# via -r requirements/test.in
types-markdown==3.4.2.10
# via -r requirements/test.in
types-requests==2.31.0.5
types-requests==2.31.0.6
# via -r requirements/test.in
types-tabulate==0.9.0.3
# via -r requirements/test.in
Expand Down
2 changes: 1 addition & 1 deletion scripts/elasticsearch-test-helpers/create-and-check-es.sh
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ docker run -d --rm -p 9200:9200 -p 9300:9300 -e "xpack.security.enabled=false" -
echo "Waiting for Elasticsearch container to start..."
sleep 1

url="http://localhost:9200/_cluster/health"
url="http://localhost:9200/_cluster/health?wait_for_status=green&timeout=50s"
status_code=0
retry_count=0
max_retries=6
Expand Down
2 changes: 1 addition & 1 deletion test_unstructured/partition/pdf-image/test_image.py
Original file line number Diff line number Diff line change
Expand Up @@ -440,7 +440,7 @@ def test_partition_image_formats_languages_for_tesseract():
ocr_languages="jpn_vert",
ocr_mode="entire_page",
extract_tables=False,
model_name=None,
model_name="detectron2_onnx",
)


Expand Down
Loading
Loading