Skip to content

Commit

Permalink
chore: bump inference to 0.6.6 (#1563)
Browse files Browse the repository at this point in the history
- bump `unstructured-inference` to `0.6.6`
- specify default model name for element detection to be
`detectron2_onnx` to keep current behavior
- NOTE: the updated inference package by default would use yolox as
element detection model; this will be evaluated and enabled in a
separated PR

---------

Co-authored-by: ryannikolaidis <[email protected]>
Co-authored-by: badGarnet <[email protected]>
  • Loading branch information
3 people authored Sep 29, 2023
1 parent af7639e commit ad59a87
Show file tree
Hide file tree
Showing 34 changed files with 333 additions and 3,809 deletions.
11 changes: 10 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,13 @@
## 0.10.19-dev0
## 0.10.19-dev1

### Enhancements

* **bump `unstructured-inference` to `0.6.6`** The updated version of `unstructured-inference` makes table extraction in `hi_res` mode configurable to fine tune table extraction performance; it also improves element detection by adding a deduplication post processing step in the `hi_res` partitioning of pdfs and images.

### Features

### Fixes


## 0.10.18

Expand Down
7 changes: 5 additions & 2 deletions requirements/constraints.in
Original file line number Diff line number Diff line change
Expand Up @@ -39,5 +39,8 @@ matplotlib==3.7.2
# NOTE(crag) - pin to available pandas for python 3.8 (at least in CI)
fsspec==2023.9.1
pandas<2.0.4
# langchain limits this to 3.1.7
anyio==3.1.7
# langchain limits anyio to below 4.0
anyio<4.0
# pinned in unstructured paddleocr
opencv-python==4.8.0.76
opencv-contrib-python==4.8.0.76
12 changes: 7 additions & 5 deletions requirements/dev.txt
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,10 @@
#
# pip-compile requirements/dev.in
#
anyio==4.0.0
# via jupyter-server
anyio==3.7.1
# via
# -c requirements/constraints.in
# jupyter-server
appnope==0.1.3
# via
# ipykernel
Expand Down Expand Up @@ -42,7 +44,7 @@ certifi==2023.7.22
# -c requirements/constraints.in
# -c requirements/test.txt
# requests
cffi==1.15.1
cffi==1.16.0
# via argon2-cffi-bindings
cfgv==3.4.0
# via pre-commit
Expand Down Expand Up @@ -151,7 +153,7 @@ jupyter-client==8.3.1
# qtconsole
jupyter-console==6.6.3
# via jupyter
jupyter-core==5.3.1
jupyter-core==5.3.2
# via
# -c requirements/constraints.in
# ipykernel
Expand Down Expand Up @@ -393,7 +395,7 @@ urllib3==1.26.16
# requests
virtualenv==20.24.5
# via pre-commit
wcwidth==0.2.6
wcwidth==0.2.7
# via prompt-toolkit
webcolors==1.13
# via jsonschema
Expand Down
7 changes: 5 additions & 2 deletions requirements/extra-paddleocr.txt
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ cssselect==1.2.0
# via premailer
cssutils==2.7.1
# via premailer
cycler==0.11.0
cycler==0.12.0
# via matplotlib
cython==3.0.2
# via unstructured-paddleocr
Expand Down Expand Up @@ -112,9 +112,12 @@ numpy==1.24.4
# unstructured-paddleocr
# visualdl
opencv-contrib-python==4.8.0.76
# via unstructured-paddleocr
# via
# -c requirements/constraints.in
# unstructured-paddleocr
opencv-python==4.8.0.76
# via
# -c requirements/constraints.in
# imgaug
# unstructured-paddleocr
openpyxl==3.1.2
Expand Down
2 changes: 1 addition & 1 deletion requirements/extra-pdf-image.in
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ pdf2image
pdfminer.six
# Do not move to contsraints.in, otherwise unstructured-inference will not be upgraded
# when unstructured library is.
unstructured-inference==0.5.31
unstructured-inference==0.6.6
# unstructured fork of pytesseract that provides an interface to allow for multiple output formats
# from one tesseract call
unstructured.pytesseract>=0.3.12
9 changes: 5 additions & 4 deletions requirements/extra-pdf-image.txt
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ certifi==2023.7.22
# -c requirements/base.txt
# -c requirements/constraints.in
# requests
cffi==1.15.1
cffi==1.16.0
# via cryptography
charset-normalizer==3.2.0
# via
Expand All @@ -24,7 +24,7 @@ contourpy==1.1.1
# via matplotlib
cryptography==41.0.4
# via pdfminer-six
cycler==0.11.0
cycler==0.12.0
# via matplotlib
effdet==0.4.1
# via layoutparser
Expand Down Expand Up @@ -95,6 +95,7 @@ onnxruntime==1.16.0
# via unstructured-inference
opencv-python==4.8.0.76
# via
# -c requirements/constraints.in
# layoutparser
# unstructured-inference
packaging==23.1
Expand Down Expand Up @@ -213,7 +214,7 @@ tqdm==4.66.1
# huggingface-hub
# iopath
# transformers
transformers==4.33.2
transformers==4.33.3
# via unstructured-inference
typing-extensions==4.8.0
# via
Expand All @@ -224,7 +225,7 @@ typing-extensions==4.8.0
# torch
tzdata==2023.3
# via pandas
unstructured-inference==0.5.31
unstructured-inference==0.6.6
# via -r requirements/extra-pdf-image.in
unstructured-pytesseract==0.3.12
# via
Expand Down
2 changes: 1 addition & 1 deletion requirements/huggingface.txt
Original file line number Diff line number Diff line change
Expand Up @@ -97,7 +97,7 @@ tqdm==4.66.1
# huggingface-hub
# sacremoses
# transformers
transformers==4.33.2
transformers==4.33.3
# via -r requirements/huggingface.in
typing-extensions==4.8.0
# via
Expand Down
2 changes: 1 addition & 1 deletion requirements/ingest-airtable.txt
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ inflection==0.5.1
# via pyairtable
pyairtable==2.1.0.post1
# via -r requirements/ingest-airtable.in
pydantic==1.10.12
pydantic==1.10.13
# via
# -c requirements/constraints.in
# pyairtable
Expand Down
2 changes: 1 addition & 1 deletion requirements/ingest-azure.txt
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ certifi==2023.7.22
# -c requirements/base.txt
# -c requirements/constraints.in
# requests
cffi==1.15.1
cffi==1.16.0
# via
# azure-datalake-store
# cryptography
Expand Down
2 changes: 1 addition & 1 deletion requirements/ingest-box.txt
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ certifi==2023.7.22
# -c requirements/base.txt
# -c requirements/constraints.in
# requests
cffi==1.15.1
cffi==1.16.0
# via cryptography
charset-normalizer==3.2.0
# via
Expand Down
3 changes: 1 addition & 2 deletions requirements/ingest-gcs.txt
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@ google-api-core==2.12.0
# via
# google-cloud-core
# google-cloud-storage
google-auth==2.23.0
google-auth==2.23.2
# via
# gcsfs
# google-api-core
Expand Down Expand Up @@ -107,7 +107,6 @@ urllib3==1.26.16
# via
# -c requirements/base.txt
# -c requirements/constraints.in
# google-auth
# requests
yarl==1.9.2
# via aiohttp
2 changes: 1 addition & 1 deletion requirements/ingest-github.txt
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ certifi==2023.7.22
# -c requirements/base.txt
# -c requirements/constraints.in
# requests
cffi==1.15.1
cffi==1.16.0
# via
# cryptography
# pynacl
Expand Down
3 changes: 1 addition & 2 deletions requirements/ingest-google-drive.txt
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ google-api-core==2.12.0
# via google-api-python-client
google-api-python-client==2.101.0
# via -r requirements/ingest-google-drive.in
google-auth==2.23.0
google-auth==2.23.2
# via
# google-api-core
# google-api-python-client
Expand Down Expand Up @@ -63,5 +63,4 @@ urllib3==1.26.16
# via
# -c requirements/base.txt
# -c requirements/constraints.in
# google-auth
# requests
22 changes: 12 additions & 10 deletions requirements/ingest-notion.txt
Original file line number Diff line number Diff line change
Expand Up @@ -4,33 +4,35 @@
#
# pip-compile requirements/ingest-notion.in
#
certifi==2023.7.22
anyio==3.7.1
# via
# -c requirements/base.txt
# -c requirements/constraints.in
# httpx
charset-normalizer==3.2.0
# httpcore
certifi==2023.7.22
# via
# -c requirements/base.txt
# -c requirements/constraints.in
# httpcore
# httpx
h11==0.12.0
exceptiongroup==1.1.3
# via anyio
h11==0.14.0
# via httpcore
htmlbuilder==1.0.0
# via -r requirements/ingest-notion.in
httpcore==0.13.3
httpcore==0.18.0
# via httpx
httpx==0.20.0
httpx==0.25.0
# via notion-client
idna==3.4
# via
# -c requirements/base.txt
# anyio
# httpx
# rfc3986
notion-client==2.0.0
# via -r requirements/ingest-notion.in
rfc3986[idna2008]==1.5.0
# via httpx
sniffio==1.3.0
# via
# anyio
# httpcore
# httpx
2 changes: 1 addition & 1 deletion requirements/ingest-onedrive.txt
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ certifi==2023.7.22
# -c requirements/base.txt
# -c requirements/constraints.in
# requests
cffi==1.15.1
cffi==1.16.0
# via cryptography
charset-normalizer==3.2.0
# via
Expand Down
17 changes: 15 additions & 2 deletions requirements/ingest-openai.txt
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,10 @@ aiohttp==3.8.5
# openai
aiosignal==1.3.1
# via aiohttp
anyio==3.7.1
# via
# -c requirements/constraints.in
# langchain
async-timeout==4.0.3
# via
# aiohttp
Expand All @@ -30,16 +34,23 @@ dataclasses-json==0.6.1
# via
# -c requirements/base.txt
# langchain
exceptiongroup==1.1.3
# via anyio
frozenlist==1.4.0
# via
# aiohttp
# aiosignal
idna==3.4
# via
# -c requirements/base.txt
# anyio
# requests
# yarl
langchain==0.0.298
jsonpatch==1.33
# via langchain
jsonpointer==2.4
# via jsonpatch
langchain==0.0.304
# via -r requirements/ingest-openai.in
langsmith==0.0.41
# via langchain
Expand Down Expand Up @@ -69,7 +80,7 @@ packaging==23.1
# via
# -c requirements/base.txt
# marshmallow
pydantic==1.10.12
pydantic==1.10.13
# via
# -c requirements/constraints.in
# langchain
Expand All @@ -87,6 +98,8 @@ requests==2.31.0
# langsmith
# openai
# tiktoken
sniffio==1.3.0
# via anyio
sqlalchemy==2.0.21
# via langchain
tenacity==8.2.3
Expand Down
2 changes: 1 addition & 1 deletion requirements/ingest-outlook.txt
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ certifi==2023.7.22
# -c requirements/base.txt
# -c requirements/constraints.in
# requests
cffi==1.15.1
cffi==1.16.0
# via cryptography
charset-normalizer==3.2.0
# via
Expand Down
2 changes: 1 addition & 1 deletion requirements/ingest-salesforce.txt
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ certifi==2023.7.22
# -c requirements/base.txt
# -c requirements/constraints.in
# requests
cffi==1.15.1
cffi==1.16.0
# via cryptography
charset-normalizer==3.2.0
# via
Expand Down
2 changes: 1 addition & 1 deletion requirements/ingest-sharepoint.txt
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ certifi==2023.7.22
# -c requirements/base.txt
# -c requirements/constraints.in
# requests
cffi==1.15.1
cffi==1.16.0
# via cryptography
charset-normalizer==3.2.0
# via
Expand Down
4 changes: 2 additions & 2 deletions requirements/test.txt
Original file line number Diff line number Diff line change
Expand Up @@ -74,7 +74,7 @@ pluggy==1.3.0
# via pytest
pycodestyle==2.11.0
# via flake8
pydantic==1.10.12
pydantic==1.10.13
# via
# -c requirements/constraints.in
# -r requirements/test.in
Expand Down Expand Up @@ -113,7 +113,7 @@ types-click==7.1.8
# via -r requirements/test.in
types-markdown==3.4.2.10
# via -r requirements/test.in
types-requests==2.31.0.5
types-requests==2.31.0.6
# via -r requirements/test.in
types-tabulate==0.9.0.3
# via -r requirements/test.in
Expand Down
2 changes: 1 addition & 1 deletion scripts/elasticsearch-test-helpers/create-and-check-es.sh
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ docker run -d --rm -p 9200:9200 -p 9300:9300 -e "xpack.security.enabled=false" -
echo "Waiting for Elasticsearch container to start..."
sleep 1

url="http://localhost:9200/_cluster/health"
url="http://localhost:9200/_cluster/health?wait_for_status=green&timeout=50s"
status_code=0
retry_count=0
max_retries=6
Expand Down
2 changes: 1 addition & 1 deletion test_unstructured/partition/pdf-image/test_image.py
Original file line number Diff line number Diff line change
Expand Up @@ -440,7 +440,7 @@ def test_partition_image_formats_languages_for_tesseract():
ocr_languages="jpn_vert",
ocr_mode="entire_page",
extract_tables=False,
model_name=None,
model_name="detectron2_onnx",
)


Expand Down
Loading

0 comments on commit ad59a87

Please sign in to comment.