feat: method to catch and classify overlapping bounding boxes (#1803)

We have established that overlapping bounding boxes does not have a one-fits-all solution, so different cases need to be handled differently to avoid information loss. We have manually identified the cases/categories of overlapping. Now we need a method to programmatically classify overlapping-bboxes cases within detected elements in a document, and return a report about it (list of cases with metadata). This fits two purposes: - **Evaluation**: We can have a pipeline using the DVC data registry that assess the performance of a detection model against a set of documents (PDF/Images), by analysing the overlapping-bboxes cases it has. The metadata in the output can be used for generating metrics for this. - **Scope overlapping cases**: Manual inspection give us a clue about currently present cases of overlapping bboxes. We need to propose solutions to fix those on code. This method generates a report by analysing several aspects of two overlapping regions. This data can be used to profile and specify the necessary changes that will fix each case. - **Fix overlapping cases**: We could introduce this functionality in the flow of a partition method (such as `partition_pdf`, to handle the calls to post-processing methods to fix overlapping. Tested on ~331 documents, the worst time per page is around 5ms. For a document such as `layout-parser-paper.pdf` it takes 4.46 ms. Introduces functionality to take a list of unstructured elements (which contain bounding boxes) and identify pairs of bounding boxes which overlap and which case is pertinent to the pairing. This PR includes the following methods in `utils.py`: - **`ngrams(s, n)`**: Generate n-grams from a string - **`calculate_shared_ngram_percentage(string_A, string_B, n)`**: Calculate the percentage of `common_ngrams` between `string_A` and `string_B` with reference to the total number of ngrams in `string_A`. - **`calculate_largest_ngram_percentage(string_A, string_B)`**: Iteratively call `calculate_shared_ngram_percentage` starting from the biggest ngram possible until the shared percentage is >0.0% - **`is_parent_box(parent_target, child_target, add=0)`**: True if the `child_target` bounding box is nested in the `parent_target` Box format: [`x_bottom_left`, `y_bottom_left`, `x_top_right`, `y_top_right`]. The parameter 'add' is the pixel error tolerance for extra pixels outside the parent region - **`calculate_overlap_percentage(box1, box2, intersection_ratio_method="total")`**: Box format: [`x_bottom_left`, `y_bottom_left`, `x_top_right`, `y_top_right`]. Calculates the percentage of overlapped region with reference to biggest element-region (`intersection_ratio_method="parent"`), the smallest element-region (`intersection_ratio_method="partial"`), or to the disjunctive union region (`intersection_ratio_method="total"`). - **`identify_overlapping_or_nesting_case`**: Identify if there are nested or overlapping elements. If overlapping is present, it identifies the case calling the method `identify_overlapping_case`. - **`identify_overlapping_case`**: Classifies the overlapping case for an element_pair input in one of 5 categories of overlapping. - **`catch_overlapping_and_nested_bboxes`**: Catch overlapping and nested bounding boxes cases across a list of elements. The params `nested_error_tolerance_px` and `sm_overlap_threshold` help controling the separation of the cases. The overlapping/nested elements cases that are being caught are: 1. **Nested elements** 2. **Small partial overlap** 3. **Partial overlap with empty content** 4. **Partial overlap with duplicate text (sharing 100% of the text)** 5. **Partial overlap without sharing text** 6. **Partial overlap sharing** {`calculate_largest_ngram_percentage(...)`}% **of the text** Here is a snippet to test it: ``` from unstructured.partition.auto import partition model_name = "yolox_quantized" target = "sample-docs/layout-parser-paper-fast.pdf" elements = partition(filename=file_path_i, strategy='hi_res', model_name=model_name) overlapping_flag, overlapping_cases = catch_overlapping_bboxes(elements) for case in overlapping_cases: print(case, "\n") ``` Here is a screenshot of a json built with the output list `overlapping_cases`: <img width="377" alt="image" src="https://github.com/Unstructured-IO/unstructured/assets/38184042/a6fea64b-d40a-4e01-beda-27840f4f4b3a">
Unstructured-IO · Oct 25, 2023 · c11a2ff · c11a2ff
1 parent d8241cb
commit c11a2ff
Show file tree

Hide file tree

Showing 3 changed files with 622 additions and 1 deletion.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -6,6 +6,8 @@
 
 ### Features
 
+* **Functionality to catch and classify overlapping/nested elements** Method to identify overlapping-bboxes cases within detected elements in a document. It returns two values: a boolean defining if there are overlapping elements present, and a list reporting them with relevant metadata. The output includes information about the `overlapping_elements`, `overlapping_case`, `overlapping_percentage`, `largest_ngram_percentage`, `overlap_percentage_total`, `max_area`, `min_area`, and `total_area`. 
+* **Add Local connector source metadata** python's os module used to pull stats from local file when processing via the local connector and populates fields such as last modified time, created time.
 * **Add Local connector source metadata.** python's os module used to pull stats from local file when processing via the local connector and populates fields such as last modified time, created time.
 
 ### Fixes

diff --git a/test_unstructured/test_utils.py b/test_unstructured/test_utils.py
@@ -4,6 +4,8 @@
 import pytest
 
 from unstructured import utils
+from unstructured.documents.coordinates import PixelSpace
+from unstructured.documents.elements import ElementMetadata, NarrativeText, Title
 
 
 @pytest.fixture()
@@ -110,3 +112,218 @@ def test_only_raises_when_len_more_than_1(iterator):
 def test_only_raises_if_empty(iterator):
     with pytest.raises(ValueError):
         utils.only(iterator)
+
+
+@pytest.mark.parametrize(
+    ("elements", "nested_error_tolerance_px", "sm_overlap_threshold", "expectation"),
+    [
+        (
+            [
+                Title(
+                    text="Some lovely title",
+                    coordinates=((4, 5), (4, 8), (7, 8), (7, 5)),
+                    coordinate_system=PixelSpace(width=20, height=20),
+                    metadata=ElementMetadata(page_number=1),
+                ),
+                NarrativeText(
+                    text="Some lovely text",
+                    coordinates=((2, 3), (2, 6), (5, 6), (5, 3)),
+                    coordinate_system=PixelSpace(width=20, height=20),
+                    metadata=ElementMetadata(page_number=1),
+                ),
+            ],
+            5,
+            10.0,
+            (
+                True,
+                [
+                    {
+                        "overlapping_elements": ["Title(ix=0)", "NarrativeText(ix=1)"],
+                        "overlapping_case": "nested NarrativeText in Title",
+                        "overlap_percentage": "100%",
+                        "metadata": {
+                            "largest_ngram_percentage": None,
+                            "overlap_percentage_total": "5.88%",
+                            "max_area": "9pxˆ2",
+                            "min_area": "9pxˆ2",
+                            "total_area": "18pxˆ2",
+                        },
+                    },
+                ],
+            ),
+        ),
+        (
+            [
+                Title(
+                    text="Some lovely title",
+                    coordinates=((4, 5), (4, 8), (7, 8), (7, 5)),
+                    coordinate_system=PixelSpace(width=20, height=20),
+                    metadata=ElementMetadata(page_number=1),
+                ),
+                NarrativeText(
+                    text="Some lovely text",
+                    coordinates=((2, 3), (2, 6), (5, 6), (5, 3)),
+                    coordinate_system=PixelSpace(width=20, height=20),
+                    metadata=ElementMetadata(page_number=1),
+                ),
+            ],
+            1,
+            10.0,
+            (
+                True,
+                [
+                    {
+                        "overlapping_elements": ["0. Title(ix=0)", "1. NarrativeText(ix=1)"],
+                        "overlapping_case": "partial overlap sharing 50.0% of the text from1. "
+                        "NarrativeText(2-gram)",
+                        "overlap_percentage": "11.11%",
+                        "metadata": {
+                            "largest_ngram_percentage": 50.0,
+                            "overlap_percentage_total": "5.88%",
+                            "max_area": "9pxˆ2",
+                            "min_area": "9pxˆ2",
+                            "total_area": "18pxˆ2",
+                        },
+                    },
+                ],
+            ),
+        ),
+        (
+            [
+                Title(
+                    text="Some lovely title",
+                    coordinates=((4, 5), (4, 8), (7, 8), (7, 5)),
+                    coordinate_system=PixelSpace(width=20, height=20),
+                    metadata=ElementMetadata(page_number=1),
+                ),
+                NarrativeText(
+                    text="Some lovely title",
+                    coordinates=((2, 3), (2, 6), (5, 6), (5, 3)),
+                    coordinate_system=PixelSpace(width=20, height=20),
+                    metadata=ElementMetadata(page_number=1),
+                ),
+            ],
+            1,
+            10.0,
+            (
+                True,
+                [
+                    {
+                        "overlapping_elements": ["0. Title(ix=0)", "1. NarrativeText(ix=1)"],
+                        "overlapping_case": "partial overlap with duplicate text",
+                        "overlap_percentage": "11.11%",
+                        "metadata": {
+                            "largest_ngram_percentage": None,
+                            "overlap_percentage_total": "5.88%",
+                            "max_area": "9pxˆ2",
+                            "min_area": "9pxˆ2",
+                            "total_area": "18pxˆ2",
+                        },
+                    },
+                ],
+            ),
+        ),
+        (
+            [
+                Title(
+                    text="Some lovely title",
+                    coordinates=((4, 5), (4, 8), (7, 8), (7, 5)),
+                    coordinate_system=PixelSpace(width=20, height=20),
+                    metadata=ElementMetadata(page_number=1),
+                ),
+                NarrativeText(
+                    text="Something totally different here",
+                    coordinates=((2, 3), (2, 6), (5, 6), (5, 3)),
+                    coordinate_system=PixelSpace(width=20, height=20),
+                    metadata=ElementMetadata(page_number=1),
+                ),
+            ],
+            1,
+            10.0,
+            (
+                True,
+                [
+                    {
+                        "overlapping_elements": ["0. Title(ix=0)", "1. NarrativeText(ix=1)"],
+                        "overlapping_case": "partial overlap without sharing text",
+                        "overlap_percentage": "11.11%",
+                        "metadata": {
+                            "largest_ngram_percentage": 0,
+                            "overlap_percentage_total": "5.88%",
+                            "max_area": "9pxˆ2",
+                            "min_area": "9pxˆ2",
+                            "total_area": "18pxˆ2",
+                        },
+                    },
+                ],
+            ),
+        ),
+        (
+            [
+                Title(
+                    text="Some lovely title",
+                    coordinates=((5, 6), (5, 10), (8, 10), (8, 6)),
+                    coordinate_system=PixelSpace(width=20, height=20),
+                    metadata=ElementMetadata(page_number=1),
+                ),
+                NarrativeText(
+                    text="Some lovely text",
+                    coordinates=((1, 3), (2, 7), (6, 7), (5, 3)),
+                    coordinate_system=PixelSpace(width=20, height=20),
+                    metadata=ElementMetadata(page_number=1),
+                ),
+            ],
+            1,
+            10.0,
+            (
+                True,
+                [
+                    {
+                        "overlapping_elements": ["0. Title(ix=0)", "1. NarrativeText(ix=1)"],
+                        "overlapping_case": "Small partial overlap",
+                        "overlap_percentage": "8.33%",
+                        "metadata": {
+                            "largest_ngram_percentage": None,
+                            "overlap_percentage_total": "3.23%",
+                            "max_area": "20pxˆ2",
+                            "min_area": "12pxˆ2",
+                            "total_area": "32pxˆ2",
+                        },
+                    },
+                ],
+            ),
+        ),
+        (
+            [
+                Title(
+                    text="Some lovely title",
+                    coordinates=((4, 6), (4, 7), (7, 7), (7, 6)),
+                    coordinate_system=PixelSpace(width=20, height=20),
+                    metadata=ElementMetadata(page_number=1),
+                ),
+                NarrativeText(
+                    text="Some lovely text",
+                    coordinates=((6, 8), (6, 9), (9, 9), (9, 8)),
+                    coordinate_system=PixelSpace(width=20, height=20),
+                    metadata=ElementMetadata(page_number=1),
+                ),
+            ],
+            1,
+            10.0,
+            (False, []),
+        ),
+    ],
+)
+def test_catch_overlapping_and_nested_bboxes(
+    elements,
+    expectation,
+    nested_error_tolerance_px,
+    sm_overlap_threshold,
+):
+    overlapping_flag, overlapping_cases = utils.catch_overlapping_and_nested_bboxes(
+        elements,
+        nested_error_tolerance_px,
+        sm_overlap_threshold,
+    )
+    assert overlapping_flag == expectation[0]
+    assert overlapping_cases == expectation[1]