Skip to content

Commit

Permalink
fix: split-chunks appear out-of-order (#1824)
Browse files Browse the repository at this point in the history
**Executive Summary.** Code inspection in preparation for adding the
chunk-overlap feature revealed a bug causing split-chunks to be inserted
out-of-order. For example, elements like this:

```
Text("One" + 400 chars)
Text("Two" + 400 chars)
Text("Three" + 600 chars)
Text("Four" + 400 chars)
Text("Five" + 600 chars)
```

Should produce chunks:
```
CompositeElement("One ...")           # (400 chars)
CompositeElement("Two ...")           # (400 chars)
CompositeElement("Three ...")         # (500 chars)
CompositeElement("rest of Three ...") # (100 chars)
CompositeElement("Four")              # (400 chars)
CompositeElement("Five ...")          # (500 chars)
CompositeElement("rest of Five ...")  # (100 chars)
```

but produced this instead:
```
CompositeElement("Five ...")          # (500 chars)
CompositeElement("rest of Five ...")  # (100 chars)
CompositeElement("Three ...")         # (500 chars)
CompositeElement("rest of Three ...") # (100 chars)
CompositeElement("One ...")           # (400 chars)
CompositeElement("Two ...")           # (400 chars)
CompositeElement("Four")              # (400 chars)
```

This PR fixes that behavior that was introduced on Oct 9 this year in
commit: f98d5e6 when adding chunk splitting.


**Technical Summary**

The essential transformation of chunking is:

```
elements          sections              chunks
List[Element] -> List[List[Element]] -> List[CompositeElement]
```

1. The _sectioner_ (`_split_elements_by_title_and_table()`) _groups_
semantically-related elements into _sections_ (`List[Element]`), in the
best case, that would be a title (heading) and the text that follows it
(until the next title). A heading and its text is often referred to as a
_section_ in publishing parlance, hence the name.

2. The _chunker_ (`chunk_by_title()` currently) does two things:
1. first it _consolidates_ the elements of each section into a single
`ConsolidatedElement` object (a "chunk"). This includes both joining the
element text into a single string as well as consolidating the metadata
of the section elements.
2. then if necessary it _splits_ the chunk into two or more
`ConsolidatedElement` objects when the consolidated text is too long to
fit in the specified window (`max_characters`).

Chunk splitting is only required when a single element (like a big
paragraph) has text longer than the specified window. Otherwise a
section and the chunk that derives from it reflects an even element
boundary.

`chunk_by_title()` was elaborated in commit f98d5e6 to add this
"chunk-splitting" behavior.

At the time there was some notion of wanting to "split from the end
backward" such that any small remainder chunk would appear first, and
could possibly be combined with a small prior chunk. To accomplish this,
split chunks were _inserted_ at the beginning of the list instead of
_appended_ to the end.

The `chunked_elements` variable (`List[CompositeElement]`) holds the
sequence of chunks that result from the chunking operation and is the
returned value for `chunk_by_title()`. This was the list
"split-from-the-end" chunks were inserted at the beginning of and that
unfortunately produces this out-of-order behavior because the insertion
was at the beginning of this "all-chunks-in-document" list, not a
sublist just for this chunk.

Further, the "split-from-the-end" behavior can produce no benefit
because chunks are never combined, only _elements_ are combined (across
semantic boundaries into a single section when a section is small) and
sectioning occurs _prior_ to chunking.

The fix is to rework the chunk-splitting passage to a straighforward
iterative algorithm that works both when a chunk must be split and when
it doesn't. This algorithm is also very easily extended to implement
split-chunk-overlap which is coming up in an immediately following PR.

```python
# -- split chunk into CompositeElements objects maxlen or smaller --
text_len = len(text)
start = 0
remaining = text_len

while remaining > 0:
    end = min(start + max_characters, text_len)
    chunked_elements.append(CompositeElement(text=text[start:end], metadata=chunk_meta))
    start = end - overlap
    remaining = text_len - end
```

*Forensic analysis*
The out-of-order-chunks behavior was introduced in commit 4ea7168 on
10/09/2023 in the same PR in which chunk-splitting was introduced.

---------

Co-authored-by: Shreya Nidadavolu <[email protected]>
Co-authored-by: shreyanid <[email protected]>
  • Loading branch information
3 people authored Oct 21, 2023
1 parent ce40cdc commit 82c8adb
Show file tree
Hide file tree
Showing 4 changed files with 32 additions and 23 deletions.
6 changes: 3 additions & 3 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
## 0.10.25-dev9
## 0.10.25

### Enhancements

Expand All @@ -19,10 +19,10 @@ ocr agent tesseract/paddle in environment variable `OCR_AGENT` for OCRing the en
* **Fix chunks breaking on regex-metadata matches.** Fixes "over-chunking" when `regex_metadata` was used, where every element that contained a regex-match would start a new chunk.
* **Fix regex-metadata match offsets not adjusted within chunk.** Fixes incorrect regex-metadata match start/stop offset in chunks where multiple elements are combined.
* **Map source cli command configs when destination set** Due to how the source connector is dynamically called when the destination connector is set via the CLI, the configs were being set incorrectoy, causing the source connector to break. The configs were fixed and updated to take into account Fsspec-specific connectors.
* **Fix metrics folder not discoverable** Fixes issue where unstructured/metrics folder is not discoverable on PyPI by adding
an `__init__.py` file under the folder.
* **Fix metrics folder not discoverable** Fixes issue where unstructured/metrics folder is not discoverable on PyPI by adding an `__init__.py` file under the folder.
* **Fix a bug when `parition_pdf` get `model_name=None`** In API usage the `model_name` value is `None` and the `cast` function in `partition_pdf` would return `None` and lead to attribution error. Now we use `str` function to explicit convert the content to string so it is garanteed to have `starts_with` and other string functions as attributes
* **Fix html partition fail on tables without `tbody` tag** HTML tables may sometimes just contain headers without body (`tbody` tag)
* **Fix out-of-order sequencing of split chunks.** Fixes behavior where "split" chunks were inserted at the beginning of the chunk sequence. This would produce a chunk sequence like [5a, 5b, 3a, 3b, 1, 2, 4] when sections 3 and 5 exceeded `max_characters`.

## 0.10.24

Expand Down
18 changes: 18 additions & 0 deletions test_unstructured/chunking/test_title.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,24 @@
from unstructured.partition.html import partition_html


def test_it_splits_a_large_section_into_multiple_chunks():
elements: List[Element] = [
Title("Introduction"),
Text(
"Lorem ipsum dolor sit amet consectetur adipiscing elit. In rhoncus ipsum sed lectus"
" porta volutpat."
),
]

chunks = chunk_by_title(elements, combine_text_under_n_chars=50, max_characters=50)

assert chunks == [
CompositeElement("Introduction"),
CompositeElement("Lorem ipsum dolor sit amet consectetur adipiscing "),
CompositeElement("elit. In rhoncus ipsum sed lectus porta volutpat."),
]


def test_split_elements_by_title_and_table():
elements: List[Element] = [
Title("A Great Day"),
Expand Down
2 changes: 1 addition & 1 deletion unstructured/__version__.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
__version__ = "0.10.25-dev9" # pragma: no cover
__version__ = "0.10.25" # pragma: no cover
29 changes: 10 additions & 19 deletions unstructured/chunking/title.py
Original file line number Diff line number Diff line change
Expand Up @@ -152,25 +152,16 @@ def chunk_by_title(
chunk_matches.extend(matches)
chunk_regex_metadata[regex_name] = chunk_matches

# Check if text exceeds max_characters
if len(text) > max_characters:
# Chunk the text from the end to the beginning
while len(text) > 0:
if len(text) <= max_characters:
# If the remaining text is shorter than max_characters
# create a chunk from the beginning
chunk_text = text
text = ""
else:
# Otherwise, create a chunk from the end
chunk_text = text[-max_characters:]
text = text[:-max_characters]

# Prepend the chunk to the beginning of the list
chunked_elements.insert(0, CompositeElement(text=chunk_text, metadata=metadata))
else:
# If it doesn't exceed, create a single CompositeElement
chunked_elements.append(CompositeElement(text=text, metadata=metadata))
# -- split chunk into CompositeElements objects maxlen or smaller --
text_len = len(text)
start = 0
remaining = text_len

while remaining > 0:
end = min(start + max_characters, text_len)
chunked_elements.append(CompositeElement(text=text[start:end], metadata=metadata))
start = end
remaining = text_len - end

return chunked_elements

Expand Down

0 comments on commit 82c8adb

Please sign in to comment.