Merge branch 'main' into jj/1534-doc-lvl-lang

Unstructured-IO · Oct 6, 2023 · c9bfd65 · c9bfd65
2 parents 6ce99bf + 2e1404e
commit c9bfd65
Show file tree

Hide file tree

Showing 172 changed files with 2,722 additions and 2,663 deletions.
diff --git a/.gitignore b/.gitignore
@@ -137,6 +137,8 @@ dmypy.json
 
 # ingest outputs
 /structured-output
+test_unstructured_ingest/workdir/
+test_unstructured_ingest/delta-table-dest/
 
 # suggested ingest mirror directory
 /mirror

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,21 +1,23 @@
-## 0.10.20-dev4
+## 0.10.20-dev5
 
 ### Enhancements
 
 * **Add document level language detection functionality.** Adds the "auto" default for the languages param to all partitioners. The primary language present in the document is detected using the `langdetect` package. Additional param `detect_language_per_element` is also added for partitioners that return multiple elements. Defaults to `False`.
 * **Align to top left when shrinking bounding boxes for `xy-curt` sorting:** Update `shrink_bbox()` to keep top left rather than center
+* **Align to top left when shrinking bounding boxes for `xy-cut` sorting:** Update `shrink_bbox()` to keep top left rather than center.
 * **Add visualization script to annotate elements** This script is often used to analyze/visualize elements with coordinates (e.g. partition_pdf()).
 * **Adds data source properties to the Jira connector** These properties (date_created, date_modified, version, source_url, record_locator) are written to element metadata during ingest, mapping elements to information about the document source from which they derive. This functionality enables downstream applications to reveal source document applications, e.g. a link to a GDrive doc, Salesforce record, etc.
 * **Improve title detection in pptx documents** The default title textboxes on a pptx slide are now categorized as titles.
 * **Improve hierarchy detection in pptx documents** List items, and other slide text are properly nested under the slide title. This will enable better chunking of pptx documents.
-
+* **Refactor of the ingest cli workflow** The refactored approach uses a dynamically set pipeline with a snapshot along each step to save progress and accommodate continuation from a snapshot if an error occurs. This also allows the pipeline to dynamically assign any number of steps to modify the partitioned content before it gets written to a destination.
 ### Features
 
 * **Adds detection_origin field to metadata** Problem: Currently isn't an easy way to find out how an element was created. With this change that information is added. Importance: With this information the developers and users are now able to know how an element was created to make decisions on how to use it. In order tu use this feature
-setting UNSTRUCTURED_INCLUDE_DEBUG_METADATA=true is needed. 
+setting UNSTRUCTURED_INCLUDE_DEBUG_METADATA=true is needed.
 
 ### Fixes
 
+* **Fix prevent metadata module from importing dependencies from unnecessary modules** Problem: The `metadata` module had several top level imports that were only used in and applicable to code related to specific document types, while there were many general-purpose functions. As a result, general-purpose functions couldn't be used without unnecessary dependencies being installed. Fix: moved 3rd party dependency top level imports to inside the functions in which they are used and applied a decorator to check that the dependency is installed and emit a helpful error message if not.
 * **Fixes category_depth None value for Title elements** Problem: `Title` elements from `chipper` get `category_depth`= None even when `Headline` and/or `Subheadline` elements are present in the same page. Fix: all `Title` elements with `category_depth` = None should be set to have a depth of 0 instead iff there are `Headline` and/or `Subheadline` element-types present. Importance: `Title` elements should be equivalent html `H1` when nested headings are present; otherwise, `category_depth` metadata can result ambiguous within elements in a page.
 * **Tweak `xy-cut` ordering output to be more column friendly** This results in the order of elements more closely reflecting natural reading order which benefits downstream applications. While element ordering from `xy-cut` is usually mostly correct when ordering multi-column documents, sometimes elements from a RHS column will appear before elements in a LHS column. Fix: add swapped `xy-cut` ordering by sorting by X coordinate first and then Y coordinate.
 * **Fixes badly initialized Formula** Problem: YoloX contain new types of elements, when loading a document that contain formulas a new element of that class
@@ -79,7 +81,7 @@ allowing the document to be loaded. Fix: Change parent class for Formula to Text
 
 
 * **Fix badly initialized Formula** Problem: YoloX contain new types of elements, when loading a document that contain formulas a new element of that class
-should be generated, however the Formula class inherits from Element instead of Text. After this change the element is correctly created with the correct class 
+should be generated, however the Formula class inherits from Element instead of Text. After this change the element is correctly created with the correct class
 allowing the document to be loaded. Fix: Change parent class for Formula to Text. Importance: Crucial to be able to load documents that contain formulas.
 
 ## 0.10.16

diff --git a/Dockerfile b/Dockerfile
@@ -67,6 +67,6 @@ USER ${NB_USER}
 COPY example-docs example-docs
 COPY unstructured unstructured
 
-RUN python3.10 -c "from unstructured.ingest.doc_processor.generalized import initialize; initialize()"
+RUN python3.10 -c "from unstructured.ingest.pipeline.initialize import initialize; initialize()"
 
 CMD ["/bin/bash"]
diff --git a/docs/requirements.txt b/docs/requirements.txt
@@ -6,7 +6,7 @@
 #
 alabaster==0.7.13
     # via sphinx
-babel==2.12.1
+babel==2.13.0
     # via sphinx
 beautifulsoup4==4.12.2
     # via
@@ -116,7 +116,7 @@ sphinxcontrib-serializinghtml==1.1.5
     # via
     #   -r requirements/build.in
     #   sphinx
-urllib3==1.26.16
+urllib3==1.26.17
     # via
     #   -c requirements/base.txt
     #   -c requirements/constraints.in

diff --git a/requirements/base.txt b/requirements/base.txt
@@ -46,7 +46,7 @@ python-iso639==2023.6.15
     # via -r requirements/base.in
 python-magic==0.4.27
     # via -r requirements/base.in
-regex==2023.8.8
+regex==2023.10.3
     # via nltk
 requests==2.31.0
     # via -r requirements/base.in
@@ -62,7 +62,7 @@ typing-extensions==4.8.0
     # via typing-inspect
 typing-inspect==0.9.0
     # via dataclasses-json
-urllib3==1.26.16
+urllib3==1.26.17
     # via
     #   -c requirements/constraints.in
     #   requests
diff --git a/requirements/build.txt b/requirements/build.txt
@@ -6,7 +6,7 @@
 #
 alabaster==0.7.13
     # via sphinx
-babel==2.12.1
+babel==2.13.0
     # via sphinx
 beautifulsoup4==4.12.2
     # via
@@ -116,7 +116,7 @@ sphinxcontrib-serializinghtml==1.1.5
     # via
     #   -r requirements/build.in
     #   sphinx
-urllib3==1.26.16
+urllib3==1.26.17
     # via
     #   -c requirements/base.txt
     #   -c requirements/constraints.in

diff --git a/requirements/constraints.in b/requirements/constraints.in
@@ -44,3 +44,4 @@ anyio<4.0
 # pinned in unstructured paddleocr
 opencv-python==4.8.0.76
 opencv-contrib-python==4.8.0.76
+onnxruntime==1.15.1