Skip to content

Commit

Permalink
Merge branch 'main' into jj/1534-doc-lvl-lang
Browse files Browse the repository at this point in the history
  • Loading branch information
Coniferish authored Oct 6, 2023
2 parents 6ce99bf + 2e1404e commit c9bfd65
Show file tree
Hide file tree
Showing 172 changed files with 2,722 additions and 2,663 deletions.
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -137,6 +137,8 @@ dmypy.json

# ingest outputs
/structured-output
test_unstructured_ingest/workdir/
test_unstructured_ingest/delta-table-dest/

# suggested ingest mirror directory
/mirror
Expand Down
10 changes: 6 additions & 4 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,21 +1,23 @@
## 0.10.20-dev4
## 0.10.20-dev5

### Enhancements

* **Add document level language detection functionality.** Adds the "auto" default for the languages param to all partitioners. The primary language present in the document is detected using the `langdetect` package. Additional param `detect_language_per_element` is also added for partitioners that return multiple elements. Defaults to `False`.
* **Align to top left when shrinking bounding boxes for `xy-curt` sorting:** Update `shrink_bbox()` to keep top left rather than center
* **Align to top left when shrinking bounding boxes for `xy-cut` sorting:** Update `shrink_bbox()` to keep top left rather than center.
* **Add visualization script to annotate elements** This script is often used to analyze/visualize elements with coordinates (e.g. partition_pdf()).
* **Adds data source properties to the Jira connector** These properties (date_created, date_modified, version, source_url, record_locator) are written to element metadata during ingest, mapping elements to information about the document source from which they derive. This functionality enables downstream applications to reveal source document applications, e.g. a link to a GDrive doc, Salesforce record, etc.
* **Improve title detection in pptx documents** The default title textboxes on a pptx slide are now categorized as titles.
* **Improve hierarchy detection in pptx documents** List items, and other slide text are properly nested under the slide title. This will enable better chunking of pptx documents.

* **Refactor of the ingest cli workflow** The refactored approach uses a dynamically set pipeline with a snapshot along each step to save progress and accommodate continuation from a snapshot if an error occurs. This also allows the pipeline to dynamically assign any number of steps to modify the partitioned content before it gets written to a destination.
### Features

* **Adds detection_origin field to metadata** Problem: Currently isn't an easy way to find out how an element was created. With this change that information is added. Importance: With this information the developers and users are now able to know how an element was created to make decisions on how to use it. In order tu use this feature
setting UNSTRUCTURED_INCLUDE_DEBUG_METADATA=true is needed.
setting UNSTRUCTURED_INCLUDE_DEBUG_METADATA=true is needed.

### Fixes

* **Fix prevent metadata module from importing dependencies from unnecessary modules** Problem: The `metadata` module had several top level imports that were only used in and applicable to code related to specific document types, while there were many general-purpose functions. As a result, general-purpose functions couldn't be used without unnecessary dependencies being installed. Fix: moved 3rd party dependency top level imports to inside the functions in which they are used and applied a decorator to check that the dependency is installed and emit a helpful error message if not.
* **Fixes category_depth None value for Title elements** Problem: `Title` elements from `chipper` get `category_depth`= None even when `Headline` and/or `Subheadline` elements are present in the same page. Fix: all `Title` elements with `category_depth` = None should be set to have a depth of 0 instead iff there are `Headline` and/or `Subheadline` element-types present. Importance: `Title` elements should be equivalent html `H1` when nested headings are present; otherwise, `category_depth` metadata can result ambiguous within elements in a page.
* **Tweak `xy-cut` ordering output to be more column friendly** This results in the order of elements more closely reflecting natural reading order which benefits downstream applications. While element ordering from `xy-cut` is usually mostly correct when ordering multi-column documents, sometimes elements from a RHS column will appear before elements in a LHS column. Fix: add swapped `xy-cut` ordering by sorting by X coordinate first and then Y coordinate.
* **Fixes badly initialized Formula** Problem: YoloX contain new types of elements, when loading a document that contain formulas a new element of that class
Expand Down Expand Up @@ -79,7 +81,7 @@ allowing the document to be loaded. Fix: Change parent class for Formula to Text


* **Fix badly initialized Formula** Problem: YoloX contain new types of elements, when loading a document that contain formulas a new element of that class
should be generated, however the Formula class inherits from Element instead of Text. After this change the element is correctly created with the correct class
should be generated, however the Formula class inherits from Element instead of Text. After this change the element is correctly created with the correct class
allowing the document to be loaded. Fix: Change parent class for Formula to Text. Importance: Crucial to be able to load documents that contain formulas.

## 0.10.16
Expand Down
2 changes: 1 addition & 1 deletion Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,6 @@ USER ${NB_USER}
COPY example-docs example-docs
COPY unstructured unstructured

RUN python3.10 -c "from unstructured.ingest.doc_processor.generalized import initialize; initialize()"
RUN python3.10 -c "from unstructured.ingest.pipeline.initialize import initialize; initialize()"

CMD ["/bin/bash"]
4 changes: 2 additions & 2 deletions docs/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
#
alabaster==0.7.13
# via sphinx
babel==2.12.1
babel==2.13.0
# via sphinx
beautifulsoup4==4.12.2
# via
Expand Down Expand Up @@ -116,7 +116,7 @@ sphinxcontrib-serializinghtml==1.1.5
# via
# -r requirements/build.in
# sphinx
urllib3==1.26.16
urllib3==1.26.17
# via
# -c requirements/base.txt
# -c requirements/constraints.in
Expand Down
4 changes: 2 additions & 2 deletions requirements/base.txt
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ python-iso639==2023.6.15
# via -r requirements/base.in
python-magic==0.4.27
# via -r requirements/base.in
regex==2023.8.8
regex==2023.10.3
# via nltk
requests==2.31.0
# via -r requirements/base.in
Expand All @@ -62,7 +62,7 @@ typing-extensions==4.8.0
# via typing-inspect
typing-inspect==0.9.0
# via dataclasses-json
urllib3==1.26.16
urllib3==1.26.17
# via
# -c requirements/constraints.in
# requests
4 changes: 2 additions & 2 deletions requirements/build.txt
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
#
alabaster==0.7.13
# via sphinx
babel==2.12.1
babel==2.13.0
# via sphinx
beautifulsoup4==4.12.2
# via
Expand Down Expand Up @@ -116,7 +116,7 @@ sphinxcontrib-serializinghtml==1.1.5
# via
# -r requirements/build.in
# sphinx
urllib3==1.26.16
urllib3==1.26.17
# via
# -c requirements/base.txt
# -c requirements/constraints.in
Expand Down
1 change: 1 addition & 0 deletions requirements/constraints.in
Original file line number Diff line number Diff line change
Expand Up @@ -44,3 +44,4 @@ anyio<4.0
# pinned in unstructured paddleocr
opencv-python==4.8.0.76
opencv-contrib-python==4.8.0.76
onnxruntime==1.15.1
Loading

0 comments on commit c9bfd65

Please sign in to comment.