Merge branch 'main' into klaijan/xlsx-sub-tables

Unstructured-IO · Sep 29, 2023 · 91c2ac7 · 91c2ac7
2 parents 1c6f199 + ad59a87
commit 91c2ac7
Show file tree

Hide file tree

Showing 117 changed files with 4,507 additions and 6,165 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,18 +1,33 @@
-## 0.10.17-dev10
+## 0.10.19-dev1
 
 ### Enhancements
 
-* **Adds data source properties to SharePoint, Outlook, Onedrive, Reddit, and Slack connectors** These properties (date_created, date_modified, version, source_url, record_locator) are written to element metadata during ingest, mapping elements to information about the document source from which they derive. This functionality enables downstream applications to reveal source document applications, e.g. a link to a GDrive doc, Salesforce record, etc.
+* **bump `unstructured-inference` to `0.6.6`** The updated version of `unstructured-inference` makes table extraction in `hi_res` mode configurable to fine tune table extraction performance; it also improves element detection by adding a deduplication post processing step in the `hi_res` partitioning of pdfs and images.
+
+### Features
+
+### Fixes
+
+
+## 0.10.18
+
+### Enhancements
+
+* **Better detection of natural reading order in images and PDF's** The elements returned by partition better reflect natural reading order in some cases, particularly in complicated multi-column layouts, leading to better chunking and retrieval for downstream applications. Achieved by improving the `xy-cut` sorting to preprocess bboxes, shrinking all bounding boxes by 90% along x and y axes (still centered around the same center point), which allows projection lines to be drawn where not possible before if layout bboxes overlapped.
+* **Improves `partition_xml` to be faster and more memory efficient when partitioning large XML files** The new behavior is to partition iteratively to prevent loading the entire XML tree into memory at once in most use cases.
+* **Adds data source properties to SharePoint, Outlook, Onedrive, Reddit, Slack, and DeltaTable connectors** These properties (date_created, date_modified, version, source_url, record_locator) are written to element metadata during ingest, mapping elements to information about the document source from which they derive. This functionality enables downstream applications to reveal source document applications, e.g. a link to a GDrive doc, Salesforce record, etc.
 * **Add functionality to save embedded images in PDF's separately as images** This allows users to save embedded images in PDF's separately as images, given some directory path. The saved image path is written to the metadata for the Image element. Downstream applications may benefit by providing users with image links from relevant "hits."
 * **Azure Cognite Search destination connector** New Azure Cognitive Search destination connector added to ingest CLI.  Users may now use `unstructured-ingest` to write partitioned data from over 20 data sources (so far) to an Azure Cognitive Search index.
 * **Improves salesforce partitioning** Partitions Salesforce data as xlm instead of text for improved detail and flexibility. Partitions htmlbody instead of textbody for Salesforce emails. Importance: Allows all Salesforce fields to be ingested and gives Salesforce emails more detailed partitioning.
 * **Add document level language detection functionality.** Introduces the "auto" default for the languages param, which then detects the languages present in the document using the `langdetect` package. Adds the document languages as ISO 639-3 codes to the element metadata. Implemented only for the partition_text function to start.
 * **PPTX partitioner refactored in preparation for enhancement.** Behavior should be unchanged except that shapes enclosed in a group-shape are now included, as many levels deep as required (a group-shape can itself contain a group-shape).
 * **Embeddings support for the SharePoint SourceConnector via unstructured-ingest CLI** The SharePoint connector can now optionally create embeddings from the elements it pulls out during partition and upload those embeddings to Azure Cognitive Search index.
-* **Improves hierarchy from docx files by leveraging natural hierarchies built into docx documents**  Hierarchy can now be detected from an indentation level for list bullets/numbers and by style name (e.g. Heading 1, List Bullet 2, List Number).  
+* **Improves hierarchy from docx files by leveraging natural hierarchies built into docx documents**  Hierarchy can now be detected from an indentation level for list bullets/numbers and by style name (e.g. Heading 1, List Bullet 2, List Number).
+* **Chunking support for the SharePoint SourceConnector via unstructured-ingest CLI** The SharePoint connector can now optionally chunk the elements pulled out during partition via the chunking unstructured brick. This can be used as a stage before creating embeddings.
 
 ### Features
 
+* **Adds `links` metadata in `partition_pdf` for `fast` strategy.** Problem: PDF files contain rich information and hyperlink that Unstructured did not captured earlier. Feature: `partition_pdf` now can capture embedded links within the file along with its associated text and page number. Importance: Providing depth in extracted elements give user a better understanding and richer context of documents. This also enables user to map to other elements within the document if the hyperlink is refered internally.
 * **Adds the embedding module to be able to embed Elements** Problem: Many NLP applications require the ability to represent parts of documents in a semantic way. Until now, Unstructured did not have text embedding ability within the core library. Feature: This embedding module is able to track embeddings related data with a class, embed a list of elements, and return an updated list of Elements with the *embeddings* property. The module is also able to embed query strings. Importance: Ability to embed documents or parts of documents will enable users to make use of these semantic representations in different NLP applications, such as search, retrieval, and retrieval augmented generation.
 
 ### Fixes
@@ -22,9 +37,12 @@
 * **Fixes SharePoint connector failures if any document has an unsupported filetype** Problem: Currently the entire connector ingest run fails if a single IngestDoc has an unsupported filetype. This is because a ValueError is raised in the IngestDoc's `__post_init__`. Fix: Adds a try/catch when the IngestConnector runs get_ingest_docs such that the error is logged but all processable documents->IngestDocs are still instantiated and returned. Importance: Allows users to ingest SharePoint content even when some files with unsupported filetypes exist there.
 * **Fixes Sharepoint connector server_path issue** Problem: Server path for the Sharepoint Ingest Doc was incorrectly formatted, causing issues while fetching pages from the remote source. Fix: changes formatting of remote file path before instantiating SharepointIngestDocs and appends a '/' while fetching pages from the remote source. Importance: Allows users to fetch pages from Sharepoint Sites.
 * **Fixes badly initialized Formula** Problem: YoloX contain new types of elements, when loading a document that contain formulas a new element of that class
-should be generated, however the Formula class inherits from Element instead of Text. After this change the element is correctly created with the correct class 
+should be generated, however the Formula class inherits from Element instead of Text. After this change the element is correctly created with the correct class
 allowing the document to be loaded. Fix: Change parent class for Formula to Text. Importance: Crucial to be able to load documents that contain formulas.
 * **Fixes Sphinx errors.** Fixes errors when running Sphinx `make html` and installs library to suppress warnings.
+* **Fixes a metadata backwards compatibility error** Problem: When calling `partition_via_api`, the hosted api may return an element schema that's newer than the current `unstructured`. In this case, metadata fields were added which did not exist in the local `ElementMetadata` dataclass, and `__init__()` threw an error. Fix: remove nonexistent fields before instantiating in `ElementMetadata.from_json()`. Importance: Crucial to avoid breaking changes when adding fields.
+* **Fixes issue with Discord connector when a channel returns `None`** Problem: Getting the `jump_url` from a nonexistent Discord `channel` fails. Fix: property `jump_url` is now retrieved within the same context as the messages from the channel. Importance: Avoids cascading issues when the connector fails to fetch information about a Discord channel.
+* **Fixes occasionally SIGABTR when writing table with `deltalake` on Linux** Problem: occasionally on Linux ingest can throw a `SIGABTR` when writing `deltalake` table even though the table was written correctly. Fix: put the writing function into a `Process` to ensure its execution to the fullest extent before returning to the main process. Importance: Improves stability of connectors using `deltalake`
 
 
 ## 0.10.16

diff --git a/Dockerfile b/Dockerfile
@@ -1,5 +1,5 @@
 # syntax=docker/dockerfile:experimental
-FROM quay.io/unstructured-io/base-images:rocky9.2-4@sha256:b1063ffbf08c3037ee211620f011dd05bd2da9287c6e6a3473b15c1597724e4b as base
+FROM quay.io/unstructured-io/base-images:rocky9.2-5@sha256:1721c3b0711e4e90587e3b4917f1b616e4603ddf5b4986bfaa68d02d82a13aba as base
 
 # NOTE(crag): NB_USER ARG for mybinder.org compat:
 #             https://mybinder.readthedocs.io/en/latest/tutorials/dockerfile.html

diff --git a/example-docs/embedded-link.pdf b/example-docs/embedded-link.pdf
diff --git a/example-docs/emphasis-text.pdf b/example-docs/emphasis-text.pdf
diff --git a/requirements/base.in b/requirements/base.in
@@ -11,6 +11,4 @@ emoji
 dataclasses-json
 python-iso639
 langdetect
-# (Trevor): This is a simple hello world package that is used to track
-# download count for this package using scarf.
-https://packages.unstructured.io/scarf.tgz
+numpy
diff --git a/requirements/base.txt b/requirements/base.txt
@@ -36,6 +36,10 @@ mypy-extensions==1.0.0
     # via typing-inspect
 nltk==3.8.1
     # via -r requirements/base.in
+numpy==1.24.4
+    # via
+    #   -c requirements/constraints.in
+    #   -r requirements/base.in
 packaging==23.1
     # via marshmallow
 python-iso639==2023.6.15
@@ -46,8 +50,6 @@ regex==2023.8.8
     # via nltk
 requests==2.31.0
     # via -r requirements/base.in
-scarf @ https://packages.unstructured.io/scarf.tgz
-    # via -r requirements/base.in
 six==1.16.0
     # via langdetect
 soupsieve==2.5

diff --git a/requirements/constraints.in b/requirements/constraints.in
@@ -39,5 +39,8 @@ matplotlib==3.7.2
 # NOTE(crag) - pin to available pandas for python 3.8 (at least in CI)
 fsspec==2023.9.1
 pandas<2.0.4
-# langchain limits this to 3.1.7
-anyio==3.1.7
+# langchain limits anyio to below 4.0
+anyio<4.0
+# pinned in unstructured paddleocr
+opencv-python==4.8.0.76
+opencv-contrib-python==4.8.0.76
diff --git a/requirements/dev.txt b/requirements/dev.txt
@@ -4,8 +4,10 @@
 #
 #    pip-compile requirements/dev.in
 #
-anyio==4.0.0
-    # via jupyter-server
+anyio==3.7.1
+    # via
+    #   -c requirements/constraints.in
+    #   jupyter-server
 appnope==0.1.3
     # via
     #   ipykernel
@@ -42,7 +44,7 @@ certifi==2023.7.22
     #   -c requirements/constraints.in
     #   -c requirements/test.txt
     #   requests
-cffi==1.15.1
+cffi==1.16.0
     # via argon2-cffi-bindings
 cfgv==3.4.0
     # via pre-commit
@@ -151,7 +153,7 @@ jupyter-client==8.3.1
     #   qtconsole
 jupyter-console==6.6.3
     # via jupyter
-jupyter-core==5.3.1
+jupyter-core==5.3.2
     # via
     #   -c requirements/constraints.in
     #   ipykernel
@@ -393,7 +395,7 @@ urllib3==1.26.16
     #   requests
 virtualenv==20.24.5
     # via pre-commit
-wcwidth==0.2.6
+wcwidth==0.2.7
     # via prompt-toolkit
 webcolors==1.13
     # via jsonschema

diff --git a/requirements/extra-csv.txt b/requirements/extra-csv.txt
@@ -6,6 +6,7 @@
 #
 numpy==1.24.4
     # via
+    #   -c requirements/base.txt
     #   -c requirements/constraints.in
     #   pandas
 pandas==2.0.3

diff --git a/requirements/extra-paddleocr.txt b/requirements/extra-paddleocr.txt
@@ -33,7 +33,7 @@ cssselect==1.2.0
     # via premailer
 cssutils==2.7.1
     # via premailer
-cycler==0.11.0
+cycler==0.12.0
     # via matplotlib
 cython==3.0.2
     # via unstructured-paddleocr
@@ -95,6 +95,7 @@ networkx==3.1
     # via scikit-image
 numpy==1.24.4
     # via
+    #   -c requirements/base.txt
     #   -c requirements/constraints.in
     #   contourpy
     #   imageio
@@ -111,9 +112,12 @@ numpy==1.24.4
     #   unstructured-paddleocr
     #   visualdl
 opencv-contrib-python==4.8.0.76
-    # via unstructured-paddleocr
+    # via
+    #   -c requirements/constraints.in
+    #   unstructured-paddleocr
 opencv-python==4.8.0.76
     # via
+    #   -c requirements/constraints.in
     #   imgaug
     #   unstructured-paddleocr
 openpyxl==3.1.2

diff --git a/requirements/extra-pdf-image.in b/requirements/extra-pdf-image.in
@@ -5,7 +5,7 @@ pdf2image
 pdfminer.six
 # Do not move to contsraints.in, otherwise unstructured-inference will not be upgraded
 # when unstructured library is.
-unstructured-inference==0.5.31
+unstructured-inference==0.6.6
 # unstructured fork of pytesseract that provides an interface to allow for multiple output formats
 # from one tesseract call
 unstructured.pytesseract>=0.3.12
diff --git a/requirements/extra-pdf-image.txt b/requirements/extra-pdf-image.txt
@@ -11,7 +11,7 @@ certifi==2023.7.22
     #   -c requirements/base.txt
     #   -c requirements/constraints.in
     #   requests
-cffi==1.15.1
+cffi==1.16.0
     # via cryptography
 charset-normalizer==3.2.0
     # via
@@ -24,7 +24,7 @@ contourpy==1.1.1
     # via matplotlib
 cryptography==41.0.4
     # via pdfminer-six
-cycler==0.11.0
+cycler==0.12.0
     # via matplotlib
 effdet==0.4.1
     # via layoutparser
@@ -74,6 +74,7 @@ networkx==3.1
     # via torch
 numpy==1.24.4
     # via
+    #   -c requirements/base.txt
     #   -c requirements/constraints.in
     #   contourpy
     #   layoutparser
@@ -94,6 +95,7 @@ onnxruntime==1.16.0
     # via unstructured-inference
 opencv-python==4.8.0.76
     # via
+    #   -c requirements/constraints.in
     #   layoutparser
     #   unstructured-inference
 packaging==23.1
@@ -212,7 +214,7 @@ tqdm==4.66.1
     #   huggingface-hub
     #   iopath
     #   transformers
-transformers==4.33.2
+transformers==4.33.3
     # via unstructured-inference
 typing-extensions==4.8.0
     # via
@@ -223,7 +225,7 @@ typing-extensions==4.8.0
     #   torch
 tzdata==2023.3
     # via pandas
-unstructured-inference==0.5.31
+unstructured-inference==0.6.6
     # via -r requirements/extra-pdf-image.in
 unstructured-pytesseract==0.3.12
     # via

diff --git a/requirements/extra-xlsx.txt b/requirements/extra-xlsx.txt
@@ -8,6 +8,7 @@ et-xmlfile==1.1.0
     # via openpyxl
 numpy==1.24.4
     # via
+    #   -c requirements/base.txt
     #   -c requirements/constraints.in
     #   pandas
 openpyxl==3.1.2

diff --git a/requirements/huggingface.txt b/requirements/huggingface.txt
@@ -50,6 +50,7 @@ networkx==3.1
     # via torch
 numpy==1.24.4
     # via
+    #   -c requirements/base.txt
     #   -c requirements/constraints.in
     #   transformers
 packaging==23.1
@@ -96,7 +97,7 @@ tqdm==4.66.1
     #   huggingface-hub
     #   sacremoses
     #   transformers
-transformers==4.33.2
+transformers==4.33.3
     # via -r requirements/huggingface.in
 typing-extensions==4.8.0
     # via

diff --git a/requirements/ingest-airtable.txt b/requirements/ingest-airtable.txt
@@ -21,7 +21,7 @@ inflection==0.5.1
     # via pyairtable
 pyairtable==2.1.0.post1
     # via -r requirements/ingest-airtable.in
-pydantic==1.10.12
+pydantic==1.10.13
     # via
     #   -c requirements/constraints.in
     #   pyairtable

diff --git a/requirements/ingest-azure.txt b/requirements/ingest-azure.txt
@@ -30,7 +30,7 @@ certifi==2023.7.22
     #   -c requirements/base.txt
     #   -c requirements/constraints.in
     #   requests
-cffi==1.15.1
+cffi==1.16.0
     # via
     #   azure-datalake-store
     #   cryptography

diff --git a/requirements/ingest-box.txt b/requirements/ingest-box.txt
@@ -15,7 +15,7 @@ certifi==2023.7.22
     #   -c requirements/base.txt
     #   -c requirements/constraints.in
     #   requests
-cffi==1.15.1
+cffi==1.16.0
     # via cryptography
 charset-normalizer==3.2.0
     # via

diff --git a/requirements/ingest-delta-table.txt b/requirements/ingest-delta-table.txt
@@ -12,6 +12,7 @@ fsspec==2023.9.1
     #   -r requirements/ingest-delta-table.in
 numpy==1.24.4
     # via
+    #   -c requirements/base.txt
     #   -c requirements/constraints.in
     #   pyarrow
 pyarrow==12.0.0

diff --git a/requirements/ingest-gcs.txt b/requirements/ingest-gcs.txt
@@ -47,7 +47,7 @@ google-api-core==2.12.0
     # via
     #   google-cloud-core
     #   google-cloud-storage
-google-auth==2.23.0
+google-auth==2.23.2
     # via
     #   gcsfs
     #   google-api-core
@@ -107,7 +107,6 @@ urllib3==1.26.16
     # via
     #   -c requirements/base.txt
     #   -c requirements/constraints.in
-    #   google-auth
     #   requests
 yarl==1.9.2
     # via aiohttp
diff --git a/requirements/ingest-github.txt b/requirements/ingest-github.txt
@@ -9,7 +9,7 @@ certifi==2023.7.22
     #   -c requirements/base.txt
     #   -c requirements/constraints.in
     #   requests
-cffi==1.15.1
+cffi==1.16.0
     # via
     #   cryptography
     #   pynacl

diff --git a/requirements/ingest-google-drive.txt b/requirements/ingest-google-drive.txt
@@ -19,7 +19,7 @@ google-api-core==2.12.0
     # via google-api-python-client
 google-api-python-client==2.101.0
     # via -r requirements/ingest-google-drive.in
-google-auth==2.23.0
+google-auth==2.23.2
     # via
     #   google-api-core
     #   google-api-python-client
@@ -63,5 +63,4 @@ urllib3==1.26.16
     # via
     #   -c requirements/base.txt
     #   -c requirements/constraints.in
-    #   google-auth
     #   requests
diff --git a/requirements/ingest-notion.txt b/requirements/ingest-notion.txt
@@ -4,33 +4,35 @@
 #
 #    pip-compile requirements/ingest-notion.in
 #
-certifi==2023.7.22
+anyio==3.7.1
     # via
-    #   -c requirements/base.txt
     #   -c requirements/constraints.in
-    #   httpx
-charset-normalizer==3.2.0
+    #   httpcore
+certifi==2023.7.22
     # via
     #   -c requirements/base.txt
+    #   -c requirements/constraints.in
+    #   httpcore
     #   httpx
-h11==0.12.0
+exceptiongroup==1.1.3
+    # via anyio
+h11==0.14.0
     # via httpcore
 htmlbuilder==1.0.0
     # via -r requirements/ingest-notion.in
-httpcore==0.13.3
+httpcore==0.18.0
     # via httpx
-httpx==0.20.0
+httpx==0.25.0
     # via notion-client
 idna==3.4
     # via
     #   -c requirements/base.txt
+    #   anyio
     #   httpx
-    #   rfc3986
 notion-client==2.0.0
     # via -r requirements/ingest-notion.in
-rfc3986[idna2008]==1.5.0
-    # via httpx
 sniffio==1.3.0
     # via
+    #   anyio
     #   httpcore
     #   httpx
diff --git a/requirements/ingest-onedrive.txt b/requirements/ingest-onedrive.txt
@@ -15,7 +15,7 @@ certifi==2023.7.22
     #   -c requirements/base.txt
     #   -c requirements/constraints.in
     #   requests
-cffi==1.15.1
+cffi==1.16.0
     # via cryptography
 charset-normalizer==3.2.0
     # via