Skip to content

Commit

Permalink
fix: sectioner dissociated titles from their chunk (#1861)
Browse files Browse the repository at this point in the history
### disassociated-titles

**Executive Summary**. Section titles are often combined with the prior
section and then missing from the section they belong to.

_Chunk combination_ is a behavior in which two succesive small chunks
are combined into a single chunk that better fills the chunk window.
Chunking can be and by default is configured to combine sequential small
chunks that will together fit within the full chunk window (default 500
chars).

Combination is only valid for "whole" chunks. The current implementation
attempts to combine at the element level (in the sectioner), meaning a
small initial element (such as a `Title`) is combined with the prior
section without considering the remaining length of the section that
title belongs to. This frequently causes a title element to be removed
from the chunk it belongs to and added to the prior, otherwise
unrelated, chunk.

Example:
```python
elements: List[Element] = [
    Title("Lorem Ipsum"),  # 11
    Text("Lorem ipsum dolor sit amet consectetur adipiscing elit."),  # 55
    Title("Rhoncus"),  # 7
    Text("In rhoncus ipsum sed lectus porta volutpat. Ut fermentum."),  # 57
]

chunks = chunk_by_title(elements, max_characters=80, combine_text_under_n_chars=80)

# -- want --------------------
CompositeElement('Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.')
CompositeElement('Rhoncus\n\nIn rhoncus ipsum sed lectus porta volutpat. Ut fermentum.')

# -- got ---------------------
CompositeElement('Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.\n\nRhoncus')
CompositeElement('In rhoncus ipsum sed lectus porta volutpat. Ut fermentum.')
```

**Technical Summary.** Combination cannot be effectively performed at
the element level, at least not without complicating things with
arbitrary look-ahead into future elements. Much more straightforward is
to combine sections once they have been formed from the element stream.

**Fix.** Introduce an intermediate stream processor that accepts a
stream of sections and emits a stream of sometimes-combined sections.

The solution implemented in this PR builds upon introducing `_Section`
objects to replace the `List[Element]` primitive used previously:

- `_TextSection` gets the `.combine()` method and `.text_length`
property which allows a combining client to produce a combined section
(only text-sections are ever combined).
- `_SectionCombiner` is introduced to encapsulate the logic of
combination, acting as a "filter", accepting a stream of sections and
emitting the same type, just with some resulting from two or more
combined input sections: `(Iterable[_Section]) -> Iterator[_Section]`.
- `_TextSectionAccumulator` is a helper to `_SectionCombiner` that takes
responsibility for repeatedly accumulating sections, characterizing
their length and doing the actual combining (calling
`_Section.combine(other_section)`) when instructed. Very similar in
concept to `_TextSectionBuilder`, just at the section level instead of
element level.
- Remove attempts to combine sections at the element level from
`_split_elements_by_title_and_table()` and install `_SectionCombiner` as
filter between sectioner and chunker.
  • Loading branch information
scanny authored Oct 30, 2023
1 parent 76213ec commit 7373391
Show file tree
Hide file tree
Showing 3 changed files with 488 additions and 20 deletions.
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
* **Fix wrong logger for paddle info** Replace the logger from unstructured-inference with the logger from unstructured for paddle_ocr.py module.
* **Fix ingest pipeline to be able to use chunking and embedding together** Problem: When ingest pipeline was using chunking and embedding together, embedding outputs were empty and the outputs of chunking couldn't be re-read into memory and be forwarded to embeddings. Fix: Added CompositeElement type to TYPE_TO_TEXT_ELEMENT_MAP to be able to process CompositeElements with unstructured.staging.base.isd_to_elements
* **Fix unnecessary mid-text chunk-splitting.** The "pre-chunker" did not consider separator blank-line ("\n\n") length when grouping elements for a single chunk. As a result, sections were frequently over-populated producing a over-sized chunk that required mid-text splitting.
* **Fix frequent dissociation of title from chunk.** The sectioning algorithm included the title of the next section with the prior section whenever it would fit, frequently producing association of a section title with the prior section and dissociating it from its actual section. Fix this by performing combination of whole sections only.

## 0.10.27

Expand Down
349 changes: 347 additions & 2 deletions test_unstructured/chunking/test_title.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,11 @@

from unstructured.chunking.title import (
_NonTextSection,
_SectionCombiner,
_split_elements_by_title_and_table,
_TableSection,
_TextSection,
_TextSectionAccumulator,
_TextSectionBuilder,
chunk_by_title,
)
Expand Down Expand Up @@ -199,7 +201,6 @@ def test_split_elements_by_title_and_table():
sections = _split_elements_by_title_and_table(
elements,
multipage_sections=True,
combine_text_under_n_chars=0,
new_after_n_chars=500,
max_characters=500,
)
Expand Down Expand Up @@ -734,7 +735,7 @@ def it_provides_access_to_its_elements(self):


class Describe_TextSectionBuilder:
"""Unit-test suite for `unstructured.chunking.title._TextSection objects."""
"""Unit-test suite for `unstructured.chunking.title._TextSectionBuilder`."""

def it_is_empty_on_construction(self):
builder = _TextSectionBuilder(maxlen=50)
Expand Down Expand Up @@ -802,3 +803,347 @@ def it_considers_separator_length_when_computing_text_length_and_remaining_space
# -- between the current text and that of the next element if one was added.
# -- So 50 - 12 - 2 = 36 here, not 50 - 12 = 38
assert builder.remaining_space == 36


# == SectionCombiner =============================================================================


class Describe_SectionCombiner:
"""Unit-test suite for `unstructured.chunking.title._SectionCombiner`."""

def it_combines_sequential_small_text_sections(self):
sections = [
_TextSection(
[
Title("Lorem Ipsum"), # 11
Text("Lorem ipsum dolor sit amet consectetur adipiscing elit."), # 55
]
),
_TextSection(
[
Title("Mauris Nec"), # 10
Text("Mauris nec urna non augue vulputate consequat eget et nisi."), # 59
]
),
_TextSection(
[
Title("Sed Orci"), # 8
Text("Sed orci quam, eleifend sit amet vehicula, elementum ultricies."), # 63
]
),
]

section_iter = _SectionCombiner(
sections, maxlen=250, combine_text_under_n_chars=250
).iter_combined_sections()

section = next(section_iter)
assert isinstance(section, _TextSection)
assert section._elements == [
Title("Lorem Ipsum"),
Text("Lorem ipsum dolor sit amet consectetur adipiscing elit."),
Title("Mauris Nec"),
Text("Mauris nec urna non augue vulputate consequat eget et nisi."),
Title("Sed Orci"),
Text("Sed orci quam, eleifend sit amet vehicula, elementum ultricies."),
]
with pytest.raises(StopIteration):
next(section_iter)

def but_it_does_not_combine_table_or_non_text_sections(self):
sections = [
_TextSection(
[
Title("Lorem Ipsum"),
Text("Lorem ipsum dolor sit amet consectetur adipiscing elit."),
]
),
_TableSection(Table("<table></table>")),
_TextSection(
[
Title("Mauris Nec"),
Text("Mauris nec urna non augue vulputate consequat eget et nisi."),
]
),
_NonTextSection(CheckBox()),
_TextSection(
[
Title("Sed Orci"),
Text("Sed orci quam, eleifend sit amet vehicula, elementum ultricies."),
]
),
]

section_iter = _SectionCombiner(
sections, maxlen=250, combine_text_under_n_chars=250
).iter_combined_sections()

section = next(section_iter)
assert isinstance(section, _TextSection)
assert section._elements == [
Title("Lorem Ipsum"),
Text("Lorem ipsum dolor sit amet consectetur adipiscing elit."),
]
# --
section = next(section_iter)
assert isinstance(section, _TableSection)
assert section.table == Table("<table></table>")
# --
section = next(section_iter)
assert isinstance(section, _TextSection)
assert section._elements == [
Title("Mauris Nec"),
Text("Mauris nec urna non augue vulputate consequat eget et nisi."),
]
# --
section = next(section_iter)
assert isinstance(section, _NonTextSection)
assert section.element == CheckBox()
# --
section = next(section_iter)
assert isinstance(section, _TextSection)
assert section._elements == [
Title("Sed Orci"),
Text("Sed orci quam, eleifend sit amet vehicula, elementum ultricies."),
]
# --
with pytest.raises(StopIteration):
next(section_iter)

def it_respects_the_specified_combination_threshold(self):
sections = [
_TextSection( # 68
[
Title("Lorem Ipsum"), # 11
Text("Lorem ipsum dolor sit amet consectetur adipiscing elit."), # 55
]
),
_TextSection( # 71
[
Title("Mauris Nec"), # 10
Text("Mauris nec urna non augue vulputate consequat eget et nisi."), # 59
]
),
# -- len == 139
_TextSection(
[
Title("Sed Orci"), # 8
Text("Sed orci quam, eleifend sit amet vehicula, elementum ultricies."), # 63
]
),
]

section_iter = _SectionCombiner(
sections, maxlen=250, combine_text_under_n_chars=80
).iter_combined_sections()

section = next(section_iter)
assert isinstance(section, _TextSection)
assert section._elements == [
Title("Lorem Ipsum"),
Text("Lorem ipsum dolor sit amet consectetur adipiscing elit."),
Title("Mauris Nec"),
Text("Mauris nec urna non augue vulputate consequat eget et nisi."),
]
# --
section = next(section_iter)
assert isinstance(section, _TextSection)
assert section._elements == [
Title("Sed Orci"),
Text("Sed orci quam, eleifend sit amet vehicula, elementum ultricies."),
]
# --
with pytest.raises(StopIteration):
next(section_iter)

def it_respects_the_hard_maximum_window_length(self):
sections = [
_TextSection( # 68
[
Title("Lorem Ipsum"), # 11
Text("Lorem ipsum dolor sit amet consectetur adipiscing elit."), # 55
]
),
_TextSection( # 71
[
Title("Mauris Nec"), # 10
Text("Mauris nec urna non augue vulputate consequat eget et nisi."), # 59
]
),
# -- len == 139
_TextSection(
[
Title("Sed Orci"), # 8
Text("Sed orci quam, eleifend sit amet vehicula, elementum ultricies."), # 63
]
),
# -- len == 214
]

section_iter = _SectionCombiner(
sections, maxlen=200, combine_text_under_n_chars=200
).iter_combined_sections()

section = next(section_iter)
assert isinstance(section, _TextSection)
assert section._elements == [
Title("Lorem Ipsum"),
Text("Lorem ipsum dolor sit amet consectetur adipiscing elit."),
Title("Mauris Nec"),
Text("Mauris nec urna non augue vulputate consequat eget et nisi."),
]
# --
section = next(section_iter)
assert isinstance(section, _TextSection)
assert section._elements == [
Title("Sed Orci"),
Text("Sed orci quam, eleifend sit amet vehicula, elementum ultricies."),
]
# --
with pytest.raises(StopIteration):
next(section_iter)

def it_accommodates_and_isolates_an_oversized_section(self):
"""Such as occurs when a single element exceeds the window size."""

sections = [
_TextSection([Title("Lorem Ipsum")]),
_TextSection( # 179
[
Text(
"Lorem ipsum dolor sit amet consectetur adipiscing elit." # 55
" Mauris nec urna non augue vulputate consequat eget et nisi." # 60
" Sed orci quam, eleifend sit amet vehicula, elementum ultricies." # 64
)
]
),
_TextSection([Title("Vulputate Consequat")]),
]

section_iter = _SectionCombiner(
sections, maxlen=150, combine_text_under_n_chars=150
).iter_combined_sections()

section = next(section_iter)
assert isinstance(section, _TextSection)
assert section._elements == [Title("Lorem Ipsum")]
# --
section = next(section_iter)
assert isinstance(section, _TextSection)
assert section._elements == [
Text(
"Lorem ipsum dolor sit amet consectetur adipiscing elit."
" Mauris nec urna non augue vulputate consequat eget et nisi."
" Sed orci quam, eleifend sit amet vehicula, elementum ultricies."
)
]
# --
section = next(section_iter)
assert isinstance(section, _TextSection)
assert section._elements == [Title("Vulputate Consequat")]
# --
with pytest.raises(StopIteration):
next(section_iter)


class Describe_TextSectionAccumulator:
"""Unit-test suite for `unstructured.chunking.title._TextSectionAccumulator`."""

def it_is_empty_on_construction(self):
accum = _TextSectionAccumulator(maxlen=100)

assert accum.text_length == 0
assert accum.remaining_space == 100

def it_accumulates_sections_added_to_it(self):
accum = _TextSectionAccumulator(maxlen=500)

accum.add_section(
_TextSection(
[
Title("Lorem Ipsum"),
Text("Lorem ipsum dolor sit amet consectetur adipiscing elit."),
]
)
)
assert accum.text_length == 68
assert accum.remaining_space == 430

accum.add_section(
_TextSection(
[
Title("Mauris Nec"),
Text("Mauris nec urna non augue vulputate consequat eget et nisi."),
]
)
)
assert accum.text_length == 141
assert accum.remaining_space == 357

def it_generates_a_TextSection_when_flushed_and_resets_itself_to_empty(self):
accum = _TextSectionAccumulator(maxlen=150)
accum.add_section(
_TextSection(
[
Title("Lorem Ipsum"),
Text("Lorem ipsum dolor sit amet consectetur adipiscing elit."),
]
)
)
accum.add_section(
_TextSection(
[
Title("Mauris Nec"),
Text("Mauris nec urna non augue vulputate consequat eget et nisi."),
]
)
)
accum.add_section(
_TextSection(
[
Title("Sed Orci"),
Text("Sed orci quam, eleifend sit amet vehicula, elementum ultricies quam."),
]
)
)

section_iter = accum.flush()

# -- iterator generates exactly one section --
section = next(section_iter)
with pytest.raises(StopIteration):
next(section_iter)
# -- and it is a _TextSection containing all the elements --
assert isinstance(section, _TextSection)
assert section._elements == [
Title("Lorem Ipsum"),
Text("Lorem ipsum dolor sit amet consectetur adipiscing elit."),
Title("Mauris Nec"),
Text("Mauris nec urna non augue vulputate consequat eget et nisi."),
Title("Sed Orci"),
Text("Sed orci quam, eleifend sit amet vehicula, elementum ultricies quam."),
]
assert accum.text_length == 0
assert accum.remaining_space == 150

def but_it_does_not_generate_a_TextSection_on_flush_when_empty(self):
accum = _TextSectionAccumulator(maxlen=150)

sections = list(accum.flush())

assert sections == []
assert accum.text_length == 0
assert accum.remaining_space == 150

def it_considers_separator_length_when_computing_text_length_and_remaining_space(self):
accum = _TextSectionAccumulator(maxlen=100)
accum.add_section(_TextSection([Text("abcde")]))
accum.add_section(_TextSection([Text("fghij")]))

# -- .text_length includes a separator ("\n\n", len==2) between each text-segment,
# -- so 5 + 2 + 5 = 12 here, not 5 + 5 = 10
assert accum.text_length == 12
# -- .remaining_space is reduced by the length (2) of the trailing separator which would
# -- go between the current text and that of the next section if one was added.
# -- So 100 - 12 - 2 = 86 here, not 100 - 12 = 88
assert accum.remaining_space == 86
Loading

0 comments on commit 7373391

Please sign in to comment.