Skip to content

Commit

Permalink
Merge remote-tracking branch 'origin/main' into feat/weighted-average…
Browse files Browse the repository at this point in the history
…-table-metrics
  • Loading branch information
badGarnet committed Jul 9, 2024
2 parents 0381edf + 176875b commit d5b233d
Show file tree
Hide file tree
Showing 58 changed files with 9,123 additions and 6,191 deletions.
2 changes: 2 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -256,6 +256,8 @@ jobs:
matrix:
python-version: [ "3.9","3.10" ]
runs-on: ubuntu-latest
env:
NLTK_DATA: ${{ github.workspace }}/nltk_data
needs: [ setup_ingest, lint ]
steps:
# actions/checkout MUST come before auth
Expand Down
24 changes: 18 additions & 6 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,20 +1,32 @@
## 0.14.10-dev13
## 0.14.11-dev3

### Enhancements

* **Update unstructured-client dependency** Change unstructured-client dependency pin back to
greater than min version and updated tests that were failing given the update.
* **`.doc` files are now supported in the `arm64` image.**. `libreoffice24` is added to the `arm64` image, meaning `.doc` files are now supported. We have follow on work planned to investigate adding `.ppt` support for `arm64` as well.
* Add table detection metrics: recall, precision and f1
* Remove unused `_with_spans` metrics
* **Refine HTML parser to accommodate block element nested in phrasing.** HTML parser no longer raises on a block element (e.g. `<p>`, `<div>`) nested inside a phrasing element (e.g. `<strong>` or `<cite>`). Instead it breaks the phrasing run (and therefore element) at the block-item start and begins a new phrasing run after the block-item. This is consistent with how the browser determines element boundaries in this situation.
* **Use (number of actual table) weighted average for table metrics** In evaluating table metrics the mean aggregation now uses the actual number of tables in a document to weight the metric scores

### Features

### Fixes

## 0.14.10

### Enhancements

* **Update unstructured-client dependency** Change unstructured-client dependency pin back to greater than min version and updated tests that were failing given the update.
* **`.doc` files are now supported in the `arm64` image.**. `libreoffice24` is added to the `arm64` image, meaning `.doc` files are now supported. We have follow on work planned to investigate adding `.ppt` support for `arm64` as well.
* **Add table detection metrics: recall, precision and f1.**
* **Remove unused _with_spans metrics.**

### Features

**Add Object Detection Metrics to CI** Add object detection metrics (average precision, precision, recall and f1-score) implementations.

### Fixes

* **Fix counting false negatives and false positives in table structure evaluation**
* **Fix Slack CI test** Change channel that Slack test is pointing to because previous test bot expired
* **Remove NLTK download** Removes `nltk.download` in favor of downloading from an S3 bucket we host to mitigate CVE-2024-39705

## 0.14.9

Expand Down
5 changes: 2 additions & 3 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
FROM quay.io/unstructured-io/base-images:wolfi-base-d46498e@sha256:3db0544df1d8d9989cd3c3b28670d8b81351dfdc1d9129004c71ff05996fd51e as base
FROM quay.io/unstructured-io/base-images:wolfi-base-e48da6b@sha256:8ad3479e5dc87a86e4794350cca6385c01c6d110902c5b292d1a62e231be711b as base

USER root

Expand All @@ -18,8 +18,7 @@ USER notebook-user

RUN find requirements/ -type f -name "*.txt" -exec pip3.11 install --no-cache-dir --user -r '{}' ';' && \
pip3.11 install unstructured.paddlepaddle && \
python3.11 -c "import nltk; nltk.download('punkt')" && \
python3.11 -c "import nltk; nltk.download('averaged_perceptron_tagger')" && \
python3.11 -c "from unstructured.nlp.tokenize import download_nltk_packages; download_nltk_packages()" && \
python3.11 -c "from unstructured.partition.model_init import initialize; initialize()" && \
python3.11 -c "from unstructured_inference.models.tables import UnstructuredTableTransformerModel; model = UnstructuredTableTransformerModel(); model.initialize('microsoft/table-transformer-structure-recognition')"

Expand Down
14 changes: 10 additions & 4 deletions test_unstructured/nlp/test_tokenize.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,22 +2,28 @@
from unittest.mock import patch

import nltk
import pytest

from test_unstructured.nlp.mock_nltk import mock_sent_tokenize, mock_word_tokenize
from unstructured.nlp import tokenize


def test_error_raised_on_nltk_download():
with pytest.raises(ValueError):
tokenize.nltk.download("tokenizers/punkt")


def test_nltk_packages_download_if_not_present():
with patch.object(nltk, "find", side_effect=LookupError):
with patch.object(nltk, "download") as mock_download:
tokenize._download_nltk_package_if_not_present("fake_package", "tokenizers")
with patch.object(tokenize, "download_nltk_packages") as mock_download:
tokenize._download_nltk_packages_if_not_present()

mock_download.assert_called_with("fake_package")
mock_download.assert_called_once()


def test_nltk_packages_do_not_download_if():
with patch.object(nltk, "find"), patch.object(nltk, "download") as mock_download:
tokenize._download_nltk_package_if_not_present("fake_package", "tokenizers")
tokenize._download_nltk_packages_if_not_present()

mock_download.assert_not_called()

Expand Down
Loading

0 comments on commit d5b233d

Please sign in to comment.