fix: chunks break on regex-meta changes and regex-meta start/stop not adjusted #1779

scanny · 2023-10-17T22:34:16Z

Executive Summary. Introducing strict type-checking as preparation for adding the chunk-overlap feature revealed a type mismatch for regex-metadata between chunking tests and the (authoritative) ElementMetadata definition. The implementation of regex-metadata aspects of chunking passed the tests but did not produce the appropriate behaviors in production where the actual data-structure was different. This PR fixes these two bugs.

Over-chunking. The presence of regex-metadata in an element was incorrectly being interpreted as a semantic boundary, leading to such elements being isolated in their own chunks.
Discarded regex-metadata. regex-metadata present on the second or later elements in a section (chunk) was discarded.

Technical Summary

The type of ElementMetadata.regex_metadata is Dict[str, List[RegexMetadata]]. RegexMetadata is a TypedDict like {"text": "this matched", "start": 7, "end": 19}.

Multiple regexes can be specified, each with a name like "mail-stop", "version", etc. Each of those may produce its own set of matches, like:

>>> element.regex_metadata
{
    "mail-stop": [{"text": "MS-107", "start": 18, "end": 24}],
    "version": [
        {"text": "current: v1.7.2", "start": 7, "end": 21},
        {"text": "supersedes: v1.7.0", "start": 22, "end": 40},
    ],
}

Forensic analysis

The regex-metadata feature was added by Matt Robinson on 06/16/2023 commit: 4ea7168. The regex_metadata data structure is the same as when it was added.
The chunk-by-title feature was added by Matt Robinson on 08/29/2023 commit: f6a745a. The mistaken regex-metadata data structure in the tests is present in that commit.

Looks to me like a mis-remembering of the regex-metadata data-structure and insufficient type-checking rigor (type-checker strictness level set too low) to warn of the mistake.

Over-chunking Behavior

The over-chunking looked like this:

Chunking three elements with regex metadata should combine them into a single chunk (CompositeElement object), subject to maximum size rules (default 500 chars).

elements: List[Element] = [
    Title(
        "Lorem Ipsum",
        metadata=ElementMetadata(
            regex_metadata={"ipsum": [RegexMetadata(text="Ipsum", start=6, end=11)]}
        ),
    ),
    Text(
        "Lorem ipsum dolor sit amet consectetur adipiscing elit.",
        metadata=ElementMetadata(
            regex_metadata={"dolor": [RegexMetadata(text="dolor", start=12, end=17)]}
        ),
    ),
    Text(
        "In rhoncus ipsum sed lectus porta volutpat.",
        metadata=ElementMetadata(
            regex_metadata={"ipsum": [RegexMetadata(text="ipsum", start=11, end=16)]}
        ),
    ),
]

chunks = chunk_by_title(elements)

assert chunks == [
    CompositeElement(
        "Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.\n\nIn rhoncus"
        " ipsum sed lectus porta volutpat."
    )
]

Observed behavior looked like this:

chunks => [
    CompositeElement('Lorem Ipsum')
    CompositeElement('Lorem ipsum dolor sit amet consectetur adipiscing elit.')
    CompositeElement('In rhoncus ipsum sed lectus porta volutpat.')
]

The fix changed the approach from breaking on any metadata field not in a specified group (regex_metadata was missing from this group) to only breaking on specified fields (whitelisting instead of blacklisting). This avoids overchunking every time we add a new metadata field and is also simpler and easier to understand. This change in approach is discussed in more detail here #1790.

Dropping regex-metadata Behavior

Chunking this section:

elements: List[Element] = [
    Title(
        "Lorem Ipsum",
        metadata=ElementMetadata(
            regex_metadata={"ipsum": [RegexMetadata(text="Ipsum", start=6, end=11)]}
        ),
    ),
    Text(
        "Lorem ipsum dolor sit amet consectetur adipiscing elit.",
        metadata=ElementMetadata(
            regex_metadata={
                "dolor": [RegexMetadata(text="dolor", start=12, end=17)],
                "ipsum": [RegexMetadata(text="ipsum", start=6, end=11)],
            }
        ),
    ),
    Text(
        "In rhoncus ipsum sed lectus porta volutpat.",
        metadata=ElementMetadata(
            regex_metadata={"ipsum": [RegexMetadata(text="ipsum", start=11, end=16)]}
        ),
    ),
]

..should produce this regex_metadata on the single produced chunk:

assert chunk == CompositeElement(
    "Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.\n\nIn rhoncus"
    " ipsum sed lectus porta volutpat."
)
assert chunk.metadata.regex_metadata == {
    "dolor": [RegexMetadata(text="dolor", start=25, end=30)],
    "ipsum": [
        RegexMetadata(text="Ipsum", start=6, end=11),
        RegexMetadata(text="ipsum", start=19, end=24),
        RegexMetadata(text="ipsum", start=81, end=86),
    ],
}

but instead produced this:

regex_metadata == {"ipsum": [{"text": "Ipsum", "start": 6, "end": 11}]}

Which is the regex-metadata from the first element only.

The fix was to remove the consolidation+adjustment process from inside the "list-attribute-processing" loop (because regex-metadata is not a list) and process regex metadata separately.

cragwolfe · 2023-10-17T22:53:19Z

test_unstructured/chunking/test_title.py

@@ -91,12 +99,12 @@ def test_chunk_by_title():

    assert chunks[0].metadata == ElementMetadata(emphasized_text_contents=["Day", "day"])
    assert chunks[3].metadata == ElementMetadata(
-        regex_metadata=[{"text": "A", "start": 11, "end": 12}],
+        regex_metadata={"a": [RegexMetadata(text="A", start=11, end=12)]}


why did the data structure of the value change here?

would this get json serialized / deserialized correctly?

The type of ElementMetadata.regex_metadata is Dict[str, List[RegexMetadata]]. RegexMetadata is a TypedDict like {"text": "this matched", "start": 7, "end": 19}.

My assumption is you can specify multiple regexes, each with a name like "mail-stop", "version", etc., and each of those may produce its own set of matches, like:

>>> element.regex_metadata { "mail-stop": [{"text": "MS-107", "start": 18, "end": 24}], "version": [ {"text": "current: v1.7.2", "start": 7, "end": 21}, {"text": "supersedes: v1.7.0", "start": 22, "end": 40}, ], }

My guess is that this "multiple-regexes" behavior was originally "single-regex" and when it was changed the tests and chunking code wasn't updated. A "single-regex" behavior would work fine with List[RegexMetadata] and that matches what the tests were before.

But this account is just my guess. Not sure who might remember about that.

On the serializing/deserializing question, I don't see why it wouldn't work fine, these are all primitive data structures and JSON serializable, but let me take a closer look.

Okay, looking at the serde code, it looks to me like this should round-trip fine. However, there are no tests confirming this that I can find, at least not in test_unstructured/documents/test_elements.py. If you want I can ticket this and write a more thorough test. I'll whip up an informal one here just to satisfy myself it actually works, but I'm thinking that's probably outside the scope of this PR to actually add it in, do you agree?

Note that the changes in this PR don't affect the type of ElementMetadata.regex_metadata or its serde in any way, they just turned up the type error when I cleaned up typing in this neighborhood and that's what led to these two changes.

They would however (indirectly) affect the serde of the chunks (CompositeElement objects mostly) in that regex meta would have been stripped from second and later section elements, so the CompositeElement would only contain matches present in the first element. Not exactly serde-related I suppose; I'm sure it round-tripped fine, just missing things to begin with :)

Yep, round-trips fine in my test.

I added a test to confirm JSON round-trip of ElementMetadata.regex_metadata.

But this account is just my guess. Not sure who might remember about that.

git blame can provide that info :)

@cragwolfe okay, I did the forensics, turns out it was just a mistake from the start. Good advertisement for strict data typing :)

06/16/2023 4ea7168 - regex_metadata feature was added by Matt Robinson in substantially its current form: (Dict[str, List[RegexMetadata]]).

08/29/2023 f6a745a - chunk by titles was added by Matt with the wrong type for regex_metadata in the tests, as we found it.

unstructured/chunking/title.py

scanny · 2023-10-18T02:32:50Z

Ok Crag, all the fixes are in for this. Let me know if there's anything else.

cragwolfe

Great description and added tests. Thank you!

The sectioner (`_split_elements_by_title_and_table()`) detects semantic unit boundaries and places those elements in separate sections. One aspect of boundary detection is `_metadata_matches()` which returns False when a difference in `metadata.regex_metadata` is detected between elements. The failure of this test demonstrates this over-chunking behavior. The undetected introduction of this behavior can be attributed to the blacklist vs. whitelist approach discussed in issue #1790. Testing for this is not reliable because new metadata fields are of course not known in advance.

Remove overly complex and risky "black-listing" approach to metadata change detection and just check the metadata fields that indicate a semantic boundary. In a way this is making the default for new metadata fields that they do not trigger a semantic boundary.

Closely related to the same mis-typing of `regex_metadata` in the test, the code that passed the old test does not consolidate regex-metadata across section elements and all regex-metadata except that in the first element of the section is dropped.

The implementation of adjusting regex-metadata match-offsets assumed the wrong data-type so while it passed the tests, in production it dropped all regex_metadata except that in the first section. In fairness, this never actually happened because the overchunking fixed in the previous commit made any element that had regex matches show up in its own single-element chunk. Reimplement for regex-metadata of type `Dict[str, List[RegexMetadata]]` rather than `List[RegexMetadata]`.

scanny marked this pull request as ready for review October 17, 2023 22:34

scanny requested a review from amanda103 October 17, 2023 22:43

cragwolfe reviewed Oct 17, 2023

View reviewed changes

unstructured/chunking/title.py Outdated Show resolved Hide resolved

cragwolfe reviewed Oct 17, 2023

View reviewed changes

unstructured/chunking/title.py Show resolved Hide resolved

cragwolfe reviewed Oct 17, 2023

View reviewed changes

unstructured/chunking/title.py Outdated Show resolved Hide resolved

scanny force-pushed the scanny/fix-chunks-break-on-regex branch 2 times, most recently from 6860acc to 4bb57bf Compare October 18, 2023 22:03

scanny mentioned this pull request Oct 18, 2023

detect semantic unit on metadata-change with whitelist vs blacklist #1790

Closed

scanny force-pushed the scanny/fix-chunks-break-on-regex branch from cd1fab9 to 0e3e806 Compare October 19, 2023 16:58

cragwolfe approved these changes Oct 19, 2023

View reviewed changes

qued added this pull request to the merge queue Oct 19, 2023

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Oct 19, 2023

scanny added 8 commits October 19, 2023 15:16

rfctr: chunking type-checks clean

0886c5d

changelog: add CHANGELOG entry for this PR

ba02937

fix: address PR comments

e378ce8

ingest: disable failing tests while Roman is working on them

1fc6a87

scanny force-pushed the scanny/fix-chunks-break-on-regex branch from c8476ed to 1fc6a87 Compare October 19, 2023 22:19

qued added this pull request to the merge queue Oct 19, 2023

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Oct 20, 2023

qued merged commit d9c2516 into main Oct 20, 2023
39 checks passed

qued deleted the scanny/fix-chunks-break-on-regex branch October 20, 2023 03:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: chunks break on regex-meta changes and regex-meta start/stop not adjusted #1779

fix: chunks break on regex-meta changes and regex-meta start/stop not adjusted #1779

scanny commented Oct 17, 2023 •

edited

Loading

cragwolfe Oct 17, 2023

scanny Oct 17, 2023 •

edited

Loading

scanny Oct 17, 2023

scanny Oct 18, 2023 •

edited

Loading

scanny Oct 18, 2023

scanny Oct 18, 2023

cragwolfe Oct 18, 2023

scanny Oct 18, 2023

scanny commented Oct 18, 2023

cragwolfe left a comment

fix: chunks break on regex-meta changes and regex-meta start/stop not adjusted #1779

fix: chunks break on regex-meta changes and regex-meta start/stop not adjusted #1779

Conversation

scanny commented Oct 17, 2023 • edited Loading

cragwolfe Oct 17, 2023

Choose a reason for hiding this comment

scanny Oct 17, 2023 • edited Loading

Choose a reason for hiding this comment

scanny Oct 17, 2023

Choose a reason for hiding this comment

scanny Oct 18, 2023 • edited Loading

Choose a reason for hiding this comment

scanny Oct 18, 2023

Choose a reason for hiding this comment

scanny Oct 18, 2023

Choose a reason for hiding this comment

cragwolfe Oct 18, 2023

Choose a reason for hiding this comment

scanny Oct 18, 2023

Choose a reason for hiding this comment

scanny commented Oct 18, 2023

cragwolfe left a comment

Choose a reason for hiding this comment

scanny commented Oct 17, 2023 •

edited

Loading

scanny Oct 17, 2023 •

edited

Loading

scanny Oct 18, 2023 •

edited

Loading