Merge remote-tracking branch 'origin/main' into feat/weighted-average…

…-table-metrics
Unstructured-IO · Jul 9, 2024 · d5b233d · d5b233d
2 parents 0381edf + 176875b
commit d5b233d
Show file tree

Hide file tree

Showing 58 changed files with 9,123 additions and 6,191 deletions.
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -256,6 +256,8 @@ jobs:
       matrix:
         python-version: [ "3.9","3.10" ]
     runs-on: ubuntu-latest
+    env:
+      NLTK_DATA: ${{ github.workspace }}/nltk_data
     needs: [ setup_ingest, lint ]
     steps:
       # actions/checkout MUST come before auth

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,20 +1,32 @@
-## 0.14.10-dev13
+## 0.14.11-dev3
 
 ### Enhancements
 
-* **Update unstructured-client dependency** Change unstructured-client dependency pin back to
-  greater than min version and updated tests that were failing given the update.
-* **`.doc` files are now supported in the `arm64` image.**. `libreoffice24` is added to the `arm64` image, meaning `.doc` files are now supported. We have follow on work planned to investigate adding `.ppt` support for `arm64` as well.
-* Add table detection metrics: recall, precision and f1
-* Remove unused `_with_spans` metrics
+* **Refine HTML parser to accommodate block element nested in phrasing.** HTML parser no longer raises on a block element (e.g. `<p>`, `<div>`) nested inside a phrasing element (e.g. `<strong>` or `<cite>`). Instead it breaks the phrasing run (and therefore element) at the block-item start and begins a new phrasing run after the block-item. This is consistent with how the browser determines element boundaries in this situation.
 * **Use (number of actual table) weighted average for table metrics** In evaluating table metrics the mean aggregation now uses the actual number of tables in a document to weight the metric scores
 
 ### Features
 
 ### Fixes
 
+## 0.14.10
+
+### Enhancements
+
+* **Update unstructured-client dependency** Change unstructured-client dependency pin back to greater than min version and updated tests that were failing given the update.
+* **`.doc` files are now supported in the `arm64` image.**. `libreoffice24` is added to the `arm64` image, meaning `.doc` files are now supported. We have follow on work planned to investigate adding `.ppt` support for `arm64` as well.
+* **Add table detection metrics: recall, precision and f1.**
+* **Remove unused _with_spans metrics.**
+
+### Features
+
+**Add Object Detection Metrics to CI** Add object detection metrics (average precision, precision, recall and f1-score) implementations.
+
+### Fixes
+
 * **Fix counting false negatives and false positives in table structure evaluation**
 * **Fix Slack CI test** Change channel that Slack test is pointing to because previous test bot expired
+* **Remove NLTK download** Removes `nltk.download` in favor of downloading from an S3 bucket we host to mitigate CVE-2024-39705
 
 ## 0.14.9
 

diff --git a/Dockerfile b/Dockerfile
@@ -1,4 +1,4 @@
-FROM quay.io/unstructured-io/base-images:wolfi-base-d46498e@sha256:3db0544df1d8d9989cd3c3b28670d8b81351dfdc1d9129004c71ff05996fd51e as base
+FROM quay.io/unstructured-io/base-images:wolfi-base-e48da6b@sha256:8ad3479e5dc87a86e4794350cca6385c01c6d110902c5b292d1a62e231be711b as base
 
 USER root
 
@@ -18,8 +18,7 @@ USER notebook-user
 
 RUN find requirements/ -type f -name "*.txt" -exec pip3.11 install --no-cache-dir --user -r '{}' ';' && \
   pip3.11 install unstructured.paddlepaddle && \
-  python3.11 -c "import nltk; nltk.download('punkt')" && \
-  python3.11 -c "import nltk; nltk.download('averaged_perceptron_tagger')" && \
+  python3.11 -c "from unstructured.nlp.tokenize import download_nltk_packages; download_nltk_packages()" && \
   python3.11 -c "from unstructured.partition.model_init import initialize; initialize()" && \
   python3.11 -c "from unstructured_inference.models.tables import UnstructuredTableTransformerModel; model = UnstructuredTableTransformerModel(); model.initialize('microsoft/table-transformer-structure-recognition')"
 

diff --git a/test_unstructured/nlp/test_tokenize.py b/test_unstructured/nlp/test_tokenize.py
@@ -2,22 +2,28 @@
 from unittest.mock import patch
 
 import nltk
+import pytest
 
 from test_unstructured.nlp.mock_nltk import mock_sent_tokenize, mock_word_tokenize
 from unstructured.nlp import tokenize
 
 
+def test_error_raised_on_nltk_download():
+    with pytest.raises(ValueError):
+        tokenize.nltk.download("tokenizers/punkt")
+
+
 def test_nltk_packages_download_if_not_present():
     with patch.object(nltk, "find", side_effect=LookupError):
-        with patch.object(nltk, "download") as mock_download:
-            tokenize._download_nltk_package_if_not_present("fake_package", "tokenizers")
+        with patch.object(tokenize, "download_nltk_packages") as mock_download:
+            tokenize._download_nltk_packages_if_not_present()
 
-    mock_download.assert_called_with("fake_package")
+    mock_download.assert_called_once()
 
 
 def test_nltk_packages_do_not_download_if():
     with patch.object(nltk, "find"), patch.object(nltk, "download") as mock_download:
-        tokenize._download_nltk_package_if_not_present("fake_package", "tokenizers")
+        tokenize._download_nltk_packages_if_not_present()
 
     mock_download.assert_not_called()