Skip to content

Commit

Permalink
Merge branch 'main' into klaijan/xlsx-sub-tables
Browse files Browse the repository at this point in the history
  • Loading branch information
Klaijan authored Sep 29, 2023
2 parents 1c6f199 + ad59a87 commit 91c2ac7
Show file tree
Hide file tree
Showing 117 changed files with 4,507 additions and 6,165 deletions.
26 changes: 22 additions & 4 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,18 +1,33 @@
## 0.10.17-dev10
## 0.10.19-dev1

### Enhancements

* **Adds data source properties to SharePoint, Outlook, Onedrive, Reddit, and Slack connectors** These properties (date_created, date_modified, version, source_url, record_locator) are written to element metadata during ingest, mapping elements to information about the document source from which they derive. This functionality enables downstream applications to reveal source document applications, e.g. a link to a GDrive doc, Salesforce record, etc.
* **bump `unstructured-inference` to `0.6.6`** The updated version of `unstructured-inference` makes table extraction in `hi_res` mode configurable to fine tune table extraction performance; it also improves element detection by adding a deduplication post processing step in the `hi_res` partitioning of pdfs and images.

### Features

### Fixes


## 0.10.18

### Enhancements

* **Better detection of natural reading order in images and PDF's** The elements returned by partition better reflect natural reading order in some cases, particularly in complicated multi-column layouts, leading to better chunking and retrieval for downstream applications. Achieved by improving the `xy-cut` sorting to preprocess bboxes, shrinking all bounding boxes by 90% along x and y axes (still centered around the same center point), which allows projection lines to be drawn where not possible before if layout bboxes overlapped.
* **Improves `partition_xml` to be faster and more memory efficient when partitioning large XML files** The new behavior is to partition iteratively to prevent loading the entire XML tree into memory at once in most use cases.
* **Adds data source properties to SharePoint, Outlook, Onedrive, Reddit, Slack, and DeltaTable connectors** These properties (date_created, date_modified, version, source_url, record_locator) are written to element metadata during ingest, mapping elements to information about the document source from which they derive. This functionality enables downstream applications to reveal source document applications, e.g. a link to a GDrive doc, Salesforce record, etc.
* **Add functionality to save embedded images in PDF's separately as images** This allows users to save embedded images in PDF's separately as images, given some directory path. The saved image path is written to the metadata for the Image element. Downstream applications may benefit by providing users with image links from relevant "hits."
* **Azure Cognite Search destination connector** New Azure Cognitive Search destination connector added to ingest CLI. Users may now use `unstructured-ingest` to write partitioned data from over 20 data sources (so far) to an Azure Cognitive Search index.
* **Improves salesforce partitioning** Partitions Salesforce data as xlm instead of text for improved detail and flexibility. Partitions htmlbody instead of textbody for Salesforce emails. Importance: Allows all Salesforce fields to be ingested and gives Salesforce emails more detailed partitioning.
* **Add document level language detection functionality.** Introduces the "auto" default for the languages param, which then detects the languages present in the document using the `langdetect` package. Adds the document languages as ISO 639-3 codes to the element metadata. Implemented only for the partition_text function to start.
* **PPTX partitioner refactored in preparation for enhancement.** Behavior should be unchanged except that shapes enclosed in a group-shape are now included, as many levels deep as required (a group-shape can itself contain a group-shape).
* **Embeddings support for the SharePoint SourceConnector via unstructured-ingest CLI** The SharePoint connector can now optionally create embeddings from the elements it pulls out during partition and upload those embeddings to Azure Cognitive Search index.
* **Improves hierarchy from docx files by leveraging natural hierarchies built into docx documents** Hierarchy can now be detected from an indentation level for list bullets/numbers and by style name (e.g. Heading 1, List Bullet 2, List Number).
* **Improves hierarchy from docx files by leveraging natural hierarchies built into docx documents** Hierarchy can now be detected from an indentation level for list bullets/numbers and by style name (e.g. Heading 1, List Bullet 2, List Number).
* **Chunking support for the SharePoint SourceConnector via unstructured-ingest CLI** The SharePoint connector can now optionally chunk the elements pulled out during partition via the chunking unstructured brick. This can be used as a stage before creating embeddings.

### Features

* **Adds `links` metadata in `partition_pdf` for `fast` strategy.** Problem: PDF files contain rich information and hyperlink that Unstructured did not captured earlier. Feature: `partition_pdf` now can capture embedded links within the file along with its associated text and page number. Importance: Providing depth in extracted elements give user a better understanding and richer context of documents. This also enables user to map to other elements within the document if the hyperlink is refered internally.
* **Adds the embedding module to be able to embed Elements** Problem: Many NLP applications require the ability to represent parts of documents in a semantic way. Until now, Unstructured did not have text embedding ability within the core library. Feature: This embedding module is able to track embeddings related data with a class, embed a list of elements, and return an updated list of Elements with the *embeddings* property. The module is also able to embed query strings. Importance: Ability to embed documents or parts of documents will enable users to make use of these semantic representations in different NLP applications, such as search, retrieval, and retrieval augmented generation.

### Fixes
Expand All @@ -22,9 +37,12 @@
* **Fixes SharePoint connector failures if any document has an unsupported filetype** Problem: Currently the entire connector ingest run fails if a single IngestDoc has an unsupported filetype. This is because a ValueError is raised in the IngestDoc's `__post_init__`. Fix: Adds a try/catch when the IngestConnector runs get_ingest_docs such that the error is logged but all processable documents->IngestDocs are still instantiated and returned. Importance: Allows users to ingest SharePoint content even when some files with unsupported filetypes exist there.
* **Fixes Sharepoint connector server_path issue** Problem: Server path for the Sharepoint Ingest Doc was incorrectly formatted, causing issues while fetching pages from the remote source. Fix: changes formatting of remote file path before instantiating SharepointIngestDocs and appends a '/' while fetching pages from the remote source. Importance: Allows users to fetch pages from Sharepoint Sites.
* **Fixes badly initialized Formula** Problem: YoloX contain new types of elements, when loading a document that contain formulas a new element of that class
should be generated, however the Formula class inherits from Element instead of Text. After this change the element is correctly created with the correct class
should be generated, however the Formula class inherits from Element instead of Text. After this change the element is correctly created with the correct class
allowing the document to be loaded. Fix: Change parent class for Formula to Text. Importance: Crucial to be able to load documents that contain formulas.
* **Fixes Sphinx errors.** Fixes errors when running Sphinx `make html` and installs library to suppress warnings.
* **Fixes a metadata backwards compatibility error** Problem: When calling `partition_via_api`, the hosted api may return an element schema that's newer than the current `unstructured`. In this case, metadata fields were added which did not exist in the local `ElementMetadata` dataclass, and `__init__()` threw an error. Fix: remove nonexistent fields before instantiating in `ElementMetadata.from_json()`. Importance: Crucial to avoid breaking changes when adding fields.
* **Fixes issue with Discord connector when a channel returns `None`** Problem: Getting the `jump_url` from a nonexistent Discord `channel` fails. Fix: property `jump_url` is now retrieved within the same context as the messages from the channel. Importance: Avoids cascading issues when the connector fails to fetch information about a Discord channel.
* **Fixes occasionally SIGABTR when writing table with `deltalake` on Linux** Problem: occasionally on Linux ingest can throw a `SIGABTR` when writing `deltalake` table even though the table was written correctly. Fix: put the writing function into a `Process` to ensure its execution to the fullest extent before returning to the main process. Importance: Improves stability of connectors using `deltalake`


## 0.10.16
Expand Down
2 changes: 1 addition & 1 deletion Dockerfile
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# syntax=docker/dockerfile:experimental
FROM quay.io/unstructured-io/base-images:rocky9.2-4@sha256:b1063ffbf08c3037ee211620f011dd05bd2da9287c6e6a3473b15c1597724e4b as base
FROM quay.io/unstructured-io/base-images:rocky9.2-5@sha256:1721c3b0711e4e90587e3b4917f1b616e4603ddf5b4986bfaa68d02d82a13aba as base

# NOTE(crag): NB_USER ARG for mybinder.org compat:
# https://mybinder.readthedocs.io/en/latest/tutorials/dockerfile.html
Expand Down
Binary file added example-docs/embedded-link.pdf
Binary file not shown.
Binary file added example-docs/emphasis-text.pdf
Binary file not shown.
4 changes: 1 addition & 3 deletions requirements/base.in
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,4 @@ emoji
dataclasses-json
python-iso639
langdetect
# (Trevor): This is a simple hello world package that is used to track
# download count for this package using scarf.
https://packages.unstructured.io/scarf.tgz
numpy
6 changes: 4 additions & 2 deletions requirements/base.txt
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,10 @@ mypy-extensions==1.0.0
# via typing-inspect
nltk==3.8.1
# via -r requirements/base.in
numpy==1.24.4
# via
# -c requirements/constraints.in
# -r requirements/base.in
packaging==23.1
# via marshmallow
python-iso639==2023.6.15
Expand All @@ -46,8 +50,6 @@ regex==2023.8.8
# via nltk
requests==2.31.0
# via -r requirements/base.in
scarf @ https://packages.unstructured.io/scarf.tgz
# via -r requirements/base.in
six==1.16.0
# via langdetect
soupsieve==2.5
Expand Down
7 changes: 5 additions & 2 deletions requirements/constraints.in
Original file line number Diff line number Diff line change
Expand Up @@ -39,5 +39,8 @@ matplotlib==3.7.2
# NOTE(crag) - pin to available pandas for python 3.8 (at least in CI)
fsspec==2023.9.1
pandas<2.0.4
# langchain limits this to 3.1.7
anyio==3.1.7
# langchain limits anyio to below 4.0
anyio<4.0
# pinned in unstructured paddleocr
opencv-python==4.8.0.76
opencv-contrib-python==4.8.0.76
12 changes: 7 additions & 5 deletions requirements/dev.txt
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,10 @@
#
# pip-compile requirements/dev.in
#
anyio==4.0.0
# via jupyter-server
anyio==3.7.1
# via
# -c requirements/constraints.in
# jupyter-server
appnope==0.1.3
# via
# ipykernel
Expand Down Expand Up @@ -42,7 +44,7 @@ certifi==2023.7.22
# -c requirements/constraints.in
# -c requirements/test.txt
# requests
cffi==1.15.1
cffi==1.16.0
# via argon2-cffi-bindings
cfgv==3.4.0
# via pre-commit
Expand Down Expand Up @@ -151,7 +153,7 @@ jupyter-client==8.3.1
# qtconsole
jupyter-console==6.6.3
# via jupyter
jupyter-core==5.3.1
jupyter-core==5.3.2
# via
# -c requirements/constraints.in
# ipykernel
Expand Down Expand Up @@ -393,7 +395,7 @@ urllib3==1.26.16
# requests
virtualenv==20.24.5
# via pre-commit
wcwidth==0.2.6
wcwidth==0.2.7
# via prompt-toolkit
webcolors==1.13
# via jsonschema
Expand Down
1 change: 1 addition & 0 deletions requirements/extra-csv.txt
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
#
numpy==1.24.4
# via
# -c requirements/base.txt
# -c requirements/constraints.in
# pandas
pandas==2.0.3
Expand Down
8 changes: 6 additions & 2 deletions requirements/extra-paddleocr.txt
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ cssselect==1.2.0
# via premailer
cssutils==2.7.1
# via premailer
cycler==0.11.0
cycler==0.12.0
# via matplotlib
cython==3.0.2
# via unstructured-paddleocr
Expand Down Expand Up @@ -95,6 +95,7 @@ networkx==3.1
# via scikit-image
numpy==1.24.4
# via
# -c requirements/base.txt
# -c requirements/constraints.in
# contourpy
# imageio
Expand All @@ -111,9 +112,12 @@ numpy==1.24.4
# unstructured-paddleocr
# visualdl
opencv-contrib-python==4.8.0.76
# via unstructured-paddleocr
# via
# -c requirements/constraints.in
# unstructured-paddleocr
opencv-python==4.8.0.76
# via
# -c requirements/constraints.in
# imgaug
# unstructured-paddleocr
openpyxl==3.1.2
Expand Down
2 changes: 1 addition & 1 deletion requirements/extra-pdf-image.in
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ pdf2image
pdfminer.six
# Do not move to contsraints.in, otherwise unstructured-inference will not be upgraded
# when unstructured library is.
unstructured-inference==0.5.31
unstructured-inference==0.6.6
# unstructured fork of pytesseract that provides an interface to allow for multiple output formats
# from one tesseract call
unstructured.pytesseract>=0.3.12
10 changes: 6 additions & 4 deletions requirements/extra-pdf-image.txt
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ certifi==2023.7.22
# -c requirements/base.txt
# -c requirements/constraints.in
# requests
cffi==1.15.1
cffi==1.16.0
# via cryptography
charset-normalizer==3.2.0
# via
Expand All @@ -24,7 +24,7 @@ contourpy==1.1.1
# via matplotlib
cryptography==41.0.4
# via pdfminer-six
cycler==0.11.0
cycler==0.12.0
# via matplotlib
effdet==0.4.1
# via layoutparser
Expand Down Expand Up @@ -74,6 +74,7 @@ networkx==3.1
# via torch
numpy==1.24.4
# via
# -c requirements/base.txt
# -c requirements/constraints.in
# contourpy
# layoutparser
Expand All @@ -94,6 +95,7 @@ onnxruntime==1.16.0
# via unstructured-inference
opencv-python==4.8.0.76
# via
# -c requirements/constraints.in
# layoutparser
# unstructured-inference
packaging==23.1
Expand Down Expand Up @@ -212,7 +214,7 @@ tqdm==4.66.1
# huggingface-hub
# iopath
# transformers
transformers==4.33.2
transformers==4.33.3
# via unstructured-inference
typing-extensions==4.8.0
# via
Expand All @@ -223,7 +225,7 @@ typing-extensions==4.8.0
# torch
tzdata==2023.3
# via pandas
unstructured-inference==0.5.31
unstructured-inference==0.6.6
# via -r requirements/extra-pdf-image.in
unstructured-pytesseract==0.3.12
# via
Expand Down
1 change: 1 addition & 0 deletions requirements/extra-xlsx.txt
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ et-xmlfile==1.1.0
# via openpyxl
numpy==1.24.4
# via
# -c requirements/base.txt
# -c requirements/constraints.in
# pandas
openpyxl==3.1.2
Expand Down
3 changes: 2 additions & 1 deletion requirements/huggingface.txt
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,7 @@ networkx==3.1
# via torch
numpy==1.24.4
# via
# -c requirements/base.txt
# -c requirements/constraints.in
# transformers
packaging==23.1
Expand Down Expand Up @@ -96,7 +97,7 @@ tqdm==4.66.1
# huggingface-hub
# sacremoses
# transformers
transformers==4.33.2
transformers==4.33.3
# via -r requirements/huggingface.in
typing-extensions==4.8.0
# via
Expand Down
2 changes: 1 addition & 1 deletion requirements/ingest-airtable.txt
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ inflection==0.5.1
# via pyairtable
pyairtable==2.1.0.post1
# via -r requirements/ingest-airtable.in
pydantic==1.10.12
pydantic==1.10.13
# via
# -c requirements/constraints.in
# pyairtable
Expand Down
2 changes: 1 addition & 1 deletion requirements/ingest-azure.txt
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ certifi==2023.7.22
# -c requirements/base.txt
# -c requirements/constraints.in
# requests
cffi==1.15.1
cffi==1.16.0
# via
# azure-datalake-store
# cryptography
Expand Down
2 changes: 1 addition & 1 deletion requirements/ingest-box.txt
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ certifi==2023.7.22
# -c requirements/base.txt
# -c requirements/constraints.in
# requests
cffi==1.15.1
cffi==1.16.0
# via cryptography
charset-normalizer==3.2.0
# via
Expand Down
1 change: 1 addition & 0 deletions requirements/ingest-delta-table.txt
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ fsspec==2023.9.1
# -r requirements/ingest-delta-table.in
numpy==1.24.4
# via
# -c requirements/base.txt
# -c requirements/constraints.in
# pyarrow
pyarrow==12.0.0
Expand Down
3 changes: 1 addition & 2 deletions requirements/ingest-gcs.txt
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@ google-api-core==2.12.0
# via
# google-cloud-core
# google-cloud-storage
google-auth==2.23.0
google-auth==2.23.2
# via
# gcsfs
# google-api-core
Expand Down Expand Up @@ -107,7 +107,6 @@ urllib3==1.26.16
# via
# -c requirements/base.txt
# -c requirements/constraints.in
# google-auth
# requests
yarl==1.9.2
# via aiohttp
2 changes: 1 addition & 1 deletion requirements/ingest-github.txt
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ certifi==2023.7.22
# -c requirements/base.txt
# -c requirements/constraints.in
# requests
cffi==1.15.1
cffi==1.16.0
# via
# cryptography
# pynacl
Expand Down
3 changes: 1 addition & 2 deletions requirements/ingest-google-drive.txt
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ google-api-core==2.12.0
# via google-api-python-client
google-api-python-client==2.101.0
# via -r requirements/ingest-google-drive.in
google-auth==2.23.0
google-auth==2.23.2
# via
# google-api-core
# google-api-python-client
Expand Down Expand Up @@ -63,5 +63,4 @@ urllib3==1.26.16
# via
# -c requirements/base.txt
# -c requirements/constraints.in
# google-auth
# requests
22 changes: 12 additions & 10 deletions requirements/ingest-notion.txt
Original file line number Diff line number Diff line change
Expand Up @@ -4,33 +4,35 @@
#
# pip-compile requirements/ingest-notion.in
#
certifi==2023.7.22
anyio==3.7.1
# via
# -c requirements/base.txt
# -c requirements/constraints.in
# httpx
charset-normalizer==3.2.0
# httpcore
certifi==2023.7.22
# via
# -c requirements/base.txt
# -c requirements/constraints.in
# httpcore
# httpx
h11==0.12.0
exceptiongroup==1.1.3
# via anyio
h11==0.14.0
# via httpcore
htmlbuilder==1.0.0
# via -r requirements/ingest-notion.in
httpcore==0.13.3
httpcore==0.18.0
# via httpx
httpx==0.20.0
httpx==0.25.0
# via notion-client
idna==3.4
# via
# -c requirements/base.txt
# anyio
# httpx
# rfc3986
notion-client==2.0.0
# via -r requirements/ingest-notion.in
rfc3986[idna2008]==1.5.0
# via httpx
sniffio==1.3.0
# via
# anyio
# httpcore
# httpx
2 changes: 1 addition & 1 deletion requirements/ingest-onedrive.txt
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ certifi==2023.7.22
# -c requirements/base.txt
# -c requirements/constraints.in
# requests
cffi==1.15.1
cffi==1.16.0
# via cryptography
charset-normalizer==3.2.0
# via
Expand Down
Loading

0 comments on commit 91c2ac7

Please sign in to comment.