fix: sectioner dissociated titles from their chunk (#1861)

### disassociated-titles **Executive Summary**. Section titles are often combined with the prior section and then missing from the section they belong to. _Chunk combination_ is a behavior in which two succesive small chunks are combined into a single chunk that better fills the chunk window. Chunking can be and by default is configured to combine sequential small chunks that will together fit within the full chunk window (default 500 chars). Combination is only valid for "whole" chunks. The current implementation attempts to combine at the element level (in the sectioner), meaning a small initial element (such as a `Title`) is combined with the prior section without considering the remaining length of the section that title belongs to. This frequently causes a title element to be removed from the chunk it belongs to and added to the prior, otherwise unrelated, chunk. Example: ```python elements: List[Element] = [ Title("Lorem Ipsum"), # 11 Text("Lorem ipsum dolor sit amet consectetur adipiscing elit."), # 55 Title("Rhoncus"), # 7 Text("In rhoncus ipsum sed lectus porta volutpat. Ut fermentum."), # 57 ] chunks = chunk_by_title(elements, max_characters=80, combine_text_under_n_chars=80) # -- want -------------------- CompositeElement('Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.') CompositeElement('Rhoncus\n\nIn rhoncus ipsum sed lectus porta volutpat. Ut fermentum.') # -- got --------------------- CompositeElement('Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.\n\nRhoncus') CompositeElement('In rhoncus ipsum sed lectus porta volutpat. Ut fermentum.') ``` **Technical Summary.** Combination cannot be effectively performed at the element level, at least not without complicating things with arbitrary look-ahead into future elements. Much more straightforward is to combine sections once they have been formed from the element stream. **Fix.** Introduce an intermediate stream processor that accepts a stream of sections and emits a stream of sometimes-combined sections. The solution implemented in this PR builds upon introducing `_Section` objects to replace the `List[Element]` primitive used previously: - `_TextSection` gets the `.combine()` method and `.text_length` property which allows a combining client to produce a combined section (only text-sections are ever combined). - `_SectionCombiner` is introduced to encapsulate the logic of combination, acting as a "filter", accepting a stream of sections and emitting the same type, just with some resulting from two or more combined input sections: `(Iterable[_Section]) -> Iterator[_Section]`. - `_TextSectionAccumulator` is a helper to `_SectionCombiner` that takes responsibility for repeatedly accumulating sections, characterizing their length and doing the actual combining (calling `_Section.combine(other_section)`) when instructed. Very similar in concept to `_TextSectionBuilder`, just at the section level instead of element level. - Remove attempts to combine sections at the element level from `_split_elements_by_title_and_table()` and install `_SectionCombiner` as filter between sectioner and chunker.
Unstructured-IO · Oct 30, 2023 · 7373391 · 7373391
1 parent 76213ec
commit 7373391
Show file tree

Hide file tree

Showing 3 changed files with 488 additions and 20 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -13,6 +13,7 @@
 * **Fix wrong logger for paddle info** Replace the logger from unstructured-inference with the logger from unstructured for paddle_ocr.py module.
 * **Fix ingest pipeline to be able to use chunking and embedding together** Problem: When ingest pipeline was using chunking and embedding together, embedding outputs were empty and the outputs of chunking couldn't be re-read into memory and be forwarded to embeddings. Fix: Added CompositeElement type to TYPE_TO_TEXT_ELEMENT_MAP to be able to process CompositeElements with unstructured.staging.base.isd_to_elements
 * **Fix unnecessary mid-text chunk-splitting.** The "pre-chunker" did not consider separator blank-line ("\n\n") length when grouping elements for a single chunk. As a result, sections were frequently over-populated producing a over-sized chunk that required mid-text splitting.
+* **Fix frequent dissociation of title from chunk.** The sectioning algorithm included the title of the next section with the prior section whenever it would fit, frequently producing association of a section title with the prior section and dissociating it from its actual section. Fix this by performing combination of whole sections only.
 
 ## 0.10.27
 

diff --git a/test_unstructured/chunking/test_title.py b/test_unstructured/chunking/test_title.py
@@ -6,9 +6,11 @@
 
 from unstructured.chunking.title import (
     _NonTextSection,
+    _SectionCombiner,
     _split_elements_by_title_and_table,
     _TableSection,
     _TextSection,
+    _TextSectionAccumulator,
     _TextSectionBuilder,
     chunk_by_title,
 )
@@ -199,7 +201,6 @@ def test_split_elements_by_title_and_table():
     sections = _split_elements_by_title_and_table(
         elements,
         multipage_sections=True,
-        combine_text_under_n_chars=0,
         new_after_n_chars=500,
         max_characters=500,
     )
@@ -734,7 +735,7 @@ def it_provides_access_to_its_elements(self):
 
 
 class Describe_TextSectionBuilder:
-    """Unit-test suite for `unstructured.chunking.title._TextSection objects."""
+    """Unit-test suite for `unstructured.chunking.title._TextSectionBuilder`."""
 
     def it_is_empty_on_construction(self):
         builder = _TextSectionBuilder(maxlen=50)
@@ -802,3 +803,347 @@ def it_considers_separator_length_when_computing_text_length_and_remaining_space
         # -- between the current text and that of the next element if one was added.
         # -- So 50 - 12 - 2 = 36 here, not 50 - 12 = 38
         assert builder.remaining_space == 36
+
+
+# == SectionCombiner =============================================================================
+
+
+class Describe_SectionCombiner:
+    """Unit-test suite for `unstructured.chunking.title._SectionCombiner`."""
+
+    def it_combines_sequential_small_text_sections(self):
+        sections = [
+            _TextSection(
+                [
+                    Title("Lorem Ipsum"),  # 11
+                    Text("Lorem ipsum dolor sit amet consectetur adipiscing elit."),  # 55
+                ]
+            ),
+            _TextSection(
+                [
+                    Title("Mauris Nec"),  # 10
+                    Text("Mauris nec urna non augue vulputate consequat eget et nisi."),  # 59
+                ]
+            ),
+            _TextSection(
+                [
+                    Title("Sed Orci"),  # 8
+                    Text("Sed orci quam, eleifend sit amet vehicula, elementum ultricies."),  # 63
+                ]
+            ),
+        ]
+
+        section_iter = _SectionCombiner(
+            sections, maxlen=250, combine_text_under_n_chars=250
+        ).iter_combined_sections()
+
+        section = next(section_iter)
+        assert isinstance(section, _TextSection)
+        assert section._elements == [
+            Title("Lorem Ipsum"),
+            Text("Lorem ipsum dolor sit amet consectetur adipiscing elit."),
+            Title("Mauris Nec"),
+            Text("Mauris nec urna non augue vulputate consequat eget et nisi."),
+            Title("Sed Orci"),
+            Text("Sed orci quam, eleifend sit amet vehicula, elementum ultricies."),
+        ]
+        with pytest.raises(StopIteration):
+            next(section_iter)
+
+    def but_it_does_not_combine_table_or_non_text_sections(self):
+        sections = [
+            _TextSection(
+                [
+                    Title("Lorem Ipsum"),
+                    Text("Lorem ipsum dolor sit amet consectetur adipiscing elit."),
+                ]
+            ),
+            _TableSection(Table("<table></table>")),
+            _TextSection(
+                [
+                    Title("Mauris Nec"),
+                    Text("Mauris nec urna non augue vulputate consequat eget et nisi."),
+                ]
+            ),
+            _NonTextSection(CheckBox()),
+            _TextSection(
+                [
+                    Title("Sed Orci"),
+                    Text("Sed orci quam, eleifend sit amet vehicula, elementum ultricies."),
+                ]
+            ),
+        ]
+
+        section_iter = _SectionCombiner(
+            sections, maxlen=250, combine_text_under_n_chars=250
+        ).iter_combined_sections()
+
+        section = next(section_iter)
+        assert isinstance(section, _TextSection)
+        assert section._elements == [
+            Title("Lorem Ipsum"),
+            Text("Lorem ipsum dolor sit amet consectetur adipiscing elit."),
+        ]
+        # --
+        section = next(section_iter)
+        assert isinstance(section, _TableSection)
+        assert section.table == Table("<table></table>")
+        # --
+        section = next(section_iter)
+        assert isinstance(section, _TextSection)
+        assert section._elements == [
+            Title("Mauris Nec"),
+            Text("Mauris nec urna non augue vulputate consequat eget et nisi."),
+        ]
+        # --
+        section = next(section_iter)
+        assert isinstance(section, _NonTextSection)
+        assert section.element == CheckBox()
+        # --
+        section = next(section_iter)
+        assert isinstance(section, _TextSection)
+        assert section._elements == [
+            Title("Sed Orci"),
+            Text("Sed orci quam, eleifend sit amet vehicula, elementum ultricies."),
+        ]
+        # --
+        with pytest.raises(StopIteration):
+            next(section_iter)
+
+    def it_respects_the_specified_combination_threshold(self):
+        sections = [
+            _TextSection(  # 68
+                [
+                    Title("Lorem Ipsum"),  # 11
+                    Text("Lorem ipsum dolor sit amet consectetur adipiscing elit."),  # 55
+                ]
+            ),
+            _TextSection(  # 71
+                [
+                    Title("Mauris Nec"),  # 10
+                    Text("Mauris nec urna non augue vulputate consequat eget et nisi."),  # 59
+                ]
+            ),
+            # -- len == 139
+            _TextSection(
+                [
+                    Title("Sed Orci"),  # 8
+                    Text("Sed orci quam, eleifend sit amet vehicula, elementum ultricies."),  # 63
+                ]
+            ),
+        ]
+
+        section_iter = _SectionCombiner(
+            sections, maxlen=250, combine_text_under_n_chars=80
+        ).iter_combined_sections()
+
+        section = next(section_iter)
+        assert isinstance(section, _TextSection)
+        assert section._elements == [
+            Title("Lorem Ipsum"),
+            Text("Lorem ipsum dolor sit amet consectetur adipiscing elit."),
+            Title("Mauris Nec"),
+            Text("Mauris nec urna non augue vulputate consequat eget et nisi."),
+        ]
+        # --
+        section = next(section_iter)
+        assert isinstance(section, _TextSection)
+        assert section._elements == [
+            Title("Sed Orci"),
+            Text("Sed orci quam, eleifend sit amet vehicula, elementum ultricies."),
+        ]
+        # --
+        with pytest.raises(StopIteration):
+            next(section_iter)
+
+    def it_respects_the_hard_maximum_window_length(self):
+        sections = [
+            _TextSection(  # 68
+                [
+                    Title("Lorem Ipsum"),  # 11
+                    Text("Lorem ipsum dolor sit amet consectetur adipiscing elit."),  # 55
+                ]
+            ),
+            _TextSection(  # 71
+                [
+                    Title("Mauris Nec"),  # 10
+                    Text("Mauris nec urna non augue vulputate consequat eget et nisi."),  # 59
+                ]
+            ),
+            # -- len == 139
+            _TextSection(
+                [
+                    Title("Sed Orci"),  # 8
+                    Text("Sed orci quam, eleifend sit amet vehicula, elementum ultricies."),  # 63
+                ]
+            ),
+            # -- len == 214
+        ]
+
+        section_iter = _SectionCombiner(
+            sections, maxlen=200, combine_text_under_n_chars=200
+        ).iter_combined_sections()
+
+        section = next(section_iter)
+        assert isinstance(section, _TextSection)
+        assert section._elements == [
+            Title("Lorem Ipsum"),
+            Text("Lorem ipsum dolor sit amet consectetur adipiscing elit."),
+            Title("Mauris Nec"),
+            Text("Mauris nec urna non augue vulputate consequat eget et nisi."),
+        ]
+        # --
+        section = next(section_iter)
+        assert isinstance(section, _TextSection)
+        assert section._elements == [
+            Title("Sed Orci"),
+            Text("Sed orci quam, eleifend sit amet vehicula, elementum ultricies."),
+        ]
+        # --
+        with pytest.raises(StopIteration):
+            next(section_iter)
+
+    def it_accommodates_and_isolates_an_oversized_section(self):
+        """Such as occurs when a single element exceeds the window size."""
+
+        sections = [
+            _TextSection([Title("Lorem Ipsum")]),
+            _TextSection(  # 179
+                [
+                    Text(
+                        "Lorem ipsum dolor sit amet consectetur adipiscing elit."  # 55
+                        " Mauris nec urna non augue vulputate consequat eget et nisi."  # 60
+                        " Sed orci quam, eleifend sit amet vehicula, elementum ultricies."  # 64
+                    )
+                ]
+            ),
+            _TextSection([Title("Vulputate Consequat")]),
+        ]
+
+        section_iter = _SectionCombiner(
+            sections, maxlen=150, combine_text_under_n_chars=150
+        ).iter_combined_sections()
+
+        section = next(section_iter)
+        assert isinstance(section, _TextSection)
+        assert section._elements == [Title("Lorem Ipsum")]
+        # --
+        section = next(section_iter)
+        assert isinstance(section, _TextSection)
+        assert section._elements == [
+            Text(
+                "Lorem ipsum dolor sit amet consectetur adipiscing elit."
+                " Mauris nec urna non augue vulputate consequat eget et nisi."
+                " Sed orci quam, eleifend sit amet vehicula, elementum ultricies."
+            )
+        ]
+        # --
+        section = next(section_iter)
+        assert isinstance(section, _TextSection)
+        assert section._elements == [Title("Vulputate Consequat")]
+        # --
+        with pytest.raises(StopIteration):
+            next(section_iter)
+
+
+class Describe_TextSectionAccumulator:
+    """Unit-test suite for `unstructured.chunking.title._TextSectionAccumulator`."""
+
+    def it_is_empty_on_construction(self):
+        accum = _TextSectionAccumulator(maxlen=100)
+
+        assert accum.text_length == 0
+        assert accum.remaining_space == 100
+
+    def it_accumulates_sections_added_to_it(self):
+        accum = _TextSectionAccumulator(maxlen=500)
+
+        accum.add_section(
+            _TextSection(
+                [
+                    Title("Lorem Ipsum"),
+                    Text("Lorem ipsum dolor sit amet consectetur adipiscing elit."),
+                ]
+            )
+        )
+        assert accum.text_length == 68
+        assert accum.remaining_space == 430
+
+        accum.add_section(
+            _TextSection(
+                [
+                    Title("Mauris Nec"),
+                    Text("Mauris nec urna non augue vulputate consequat eget et nisi."),
+                ]
+            )
+        )
+        assert accum.text_length == 141
+        assert accum.remaining_space == 357
+
+    def it_generates_a_TextSection_when_flushed_and_resets_itself_to_empty(self):
+        accum = _TextSectionAccumulator(maxlen=150)
+        accum.add_section(
+            _TextSection(
+                [
+                    Title("Lorem Ipsum"),
+                    Text("Lorem ipsum dolor sit amet consectetur adipiscing elit."),
+                ]
+            )
+        )
+        accum.add_section(
+            _TextSection(
+                [
+                    Title("Mauris Nec"),
+                    Text("Mauris nec urna non augue vulputate consequat eget et nisi."),
+                ]
+            )
+        )
+        accum.add_section(
+            _TextSection(
+                [
+                    Title("Sed Orci"),
+                    Text("Sed orci quam, eleifend sit amet vehicula, elementum ultricies quam."),
+                ]
+            )
+        )
+
+        section_iter = accum.flush()
+
+        # -- iterator generates exactly one section --
+        section = next(section_iter)
+        with pytest.raises(StopIteration):
+            next(section_iter)
+        # -- and it is a _TextSection containing all the elements --
+        assert isinstance(section, _TextSection)
+        assert section._elements == [
+            Title("Lorem Ipsum"),
+            Text("Lorem ipsum dolor sit amet consectetur adipiscing elit."),
+            Title("Mauris Nec"),
+            Text("Mauris nec urna non augue vulputate consequat eget et nisi."),
+            Title("Sed Orci"),
+            Text("Sed orci quam, eleifend sit amet vehicula, elementum ultricies quam."),
+        ]
+        assert accum.text_length == 0
+        assert accum.remaining_space == 150
+
+    def but_it_does_not_generate_a_TextSection_on_flush_when_empty(self):
+        accum = _TextSectionAccumulator(maxlen=150)
+
+        sections = list(accum.flush())
+
+        assert sections == []
+        assert accum.text_length == 0
+        assert accum.remaining_space == 150
+
+    def it_considers_separator_length_when_computing_text_length_and_remaining_space(self):
+        accum = _TextSectionAccumulator(maxlen=100)
+        accum.add_section(_TextSection([Text("abcde")]))
+        accum.add_section(_TextSection([Text("fghij")]))
+
+        # -- .text_length includes a separator ("\n\n", len==2) between each text-segment,
+        # -- so 5 + 2 + 5 = 12 here, not 5 + 5 = 10
+        assert accum.text_length == 12
+        # -- .remaining_space is reduced by the length (2) of the trailing separator which would
+        # -- go between the current text and that of the next section if one was added.
+        # -- So 100 - 12 - 2 = 86 here, not 100 - 12 = 88
+        assert accum.remaining_space == 86