Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: chunks break on regex-meta changes and regex-meta start/stop not adjusted #1779

Merged
merged 8 commits into from
Oct 20, 2023

Conversation

scanny
Copy link
Collaborator

@scanny scanny commented Oct 17, 2023

Executive Summary. Introducing strict type-checking as preparation for adding the chunk-overlap feature revealed a type mismatch for regex-metadata between chunking tests and the (authoritative) ElementMetadata definition. The implementation of regex-metadata aspects of chunking passed the tests but did not produce the appropriate behaviors in production where the actual data-structure was different. This PR fixes these two bugs.

  1. Over-chunking. The presence of regex-metadata in an element was incorrectly being interpreted as a semantic boundary, leading to such elements being isolated in their own chunks.

  2. Discarded regex-metadata. regex-metadata present on the second or later elements in a section (chunk) was discarded.

Technical Summary

The type of ElementMetadata.regex_metadata is Dict[str, List[RegexMetadata]]. RegexMetadata is a TypedDict like {"text": "this matched", "start": 7, "end": 19}.

Multiple regexes can be specified, each with a name like "mail-stop", "version", etc. Each of those may produce its own set of matches, like:

>>> element.regex_metadata
{
    "mail-stop": [{"text": "MS-107", "start": 18, "end": 24}],
    "version": [
        {"text": "current: v1.7.2", "start": 7, "end": 21},
        {"text": "supersedes: v1.7.0", "start": 22, "end": 40},
    ],
}

Forensic analysis

  • The regex-metadata feature was added by Matt Robinson on 06/16/2023 commit: 4ea7168. The regex_metadata data structure is the same as when it was added.

  • The chunk-by-title feature was added by Matt Robinson on 08/29/2023 commit: f6a745a. The mistaken regex-metadata data structure in the tests is present in that commit.

Looks to me like a mis-remembering of the regex-metadata data-structure and insufficient type-checking rigor (type-checker strictness level set too low) to warn of the mistake.

Over-chunking Behavior

The over-chunking looked like this:

Chunking three elements with regex metadata should combine them into a single chunk (CompositeElement object), subject to maximum size rules (default 500 chars).

elements: List[Element] = [
    Title(
        "Lorem Ipsum",
        metadata=ElementMetadata(
            regex_metadata={"ipsum": [RegexMetadata(text="Ipsum", start=6, end=11)]}
        ),
    ),
    Text(
        "Lorem ipsum dolor sit amet consectetur adipiscing elit.",
        metadata=ElementMetadata(
            regex_metadata={"dolor": [RegexMetadata(text="dolor", start=12, end=17)]}
        ),
    ),
    Text(
        "In rhoncus ipsum sed lectus porta volutpat.",
        metadata=ElementMetadata(
            regex_metadata={"ipsum": [RegexMetadata(text="ipsum", start=11, end=16)]}
        ),
    ),
]

chunks = chunk_by_title(elements)

assert chunks == [
    CompositeElement(
        "Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.\n\nIn rhoncus"
        " ipsum sed lectus porta volutpat."
    )
]

Observed behavior looked like this:

chunks => [
    CompositeElement('Lorem Ipsum')
    CompositeElement('Lorem ipsum dolor sit amet consectetur adipiscing elit.')
    CompositeElement('In rhoncus ipsum sed lectus porta volutpat.')
]

The fix changed the approach from breaking on any metadata field not in a specified group (regex_metadata was missing from this group) to only breaking on specified fields (whitelisting instead of blacklisting). This avoids overchunking every time we add a new metadata field and is also simpler and easier to understand. This change in approach is discussed in more detail here #1790.

Dropping regex-metadata Behavior

Chunking this section:

elements: List[Element] = [
    Title(
        "Lorem Ipsum",
        metadata=ElementMetadata(
            regex_metadata={"ipsum": [RegexMetadata(text="Ipsum", start=6, end=11)]}
        ),
    ),
    Text(
        "Lorem ipsum dolor sit amet consectetur adipiscing elit.",
        metadata=ElementMetadata(
            regex_metadata={
                "dolor": [RegexMetadata(text="dolor", start=12, end=17)],
                "ipsum": [RegexMetadata(text="ipsum", start=6, end=11)],
            }
        ),
    ),
    Text(
        "In rhoncus ipsum sed lectus porta volutpat.",
        metadata=ElementMetadata(
            regex_metadata={"ipsum": [RegexMetadata(text="ipsum", start=11, end=16)]}
        ),
    ),
]

..should produce this regex_metadata on the single produced chunk:

assert chunk == CompositeElement(
    "Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.\n\nIn rhoncus"
    " ipsum sed lectus porta volutpat."
)
assert chunk.metadata.regex_metadata == {
    "dolor": [RegexMetadata(text="dolor", start=25, end=30)],
    "ipsum": [
        RegexMetadata(text="Ipsum", start=6, end=11),
        RegexMetadata(text="ipsum", start=19, end=24),
        RegexMetadata(text="ipsum", start=81, end=86),
    ],
}

but instead produced this:

regex_metadata == {"ipsum": [{"text": "Ipsum", "start": 6, "end": 11}]}

Which is the regex-metadata from the first element only.

The fix was to remove the consolidation+adjustment process from inside the "list-attribute-processing" loop (because regex-metadata is not a list) and process regex metadata separately.

@scanny scanny marked this pull request as ready for review October 17, 2023 22:34
@scanny scanny requested a review from amanda103 October 17, 2023 22:43
@@ -91,12 +99,12 @@ def test_chunk_by_title():

assert chunks[0].metadata == ElementMetadata(emphasized_text_contents=["Day", "day"])
assert chunks[3].metadata == ElementMetadata(
regex_metadata=[{"text": "A", "start": 11, "end": 12}],
regex_metadata={"a": [RegexMetadata(text="A", start=11, end=12)]}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did the data structure of the value change here?

would this get json serialized / deserialized correctly?

Copy link
Collaborator Author

@scanny scanny Oct 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type of ElementMetadata.regex_metadata is Dict[str, List[RegexMetadata]]. RegexMetadata is a TypedDict like {"text": "this matched", "start": 7, "end": 19}.

My assumption is you can specify multiple regexes, each with a name like "mail-stop", "version", etc., and each of those may produce its own set of matches, like:

>>> element.regex_metadata
{
    "mail-stop": [{"text": "MS-107", "start": 18, "end": 24}],
    "version": [
        {"text": "current: v1.7.2", "start": 7, "end": 21},
        {"text": "supersedes: v1.7.0", "start": 22, "end": 40},
    ],
}

My guess is that this "multiple-regexes" behavior was originally "single-regex" and when it was changed the tests and chunking code wasn't updated. A "single-regex" behavior would work fine with List[RegexMetadata] and that matches what the tests were before.

But this account is just my guess. Not sure who might remember about that.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the serializing/deserializing question, I don't see why it wouldn't work fine, these are all primitive data structures and JSON serializable, but let me take a closer look.

Copy link
Collaborator Author

@scanny scanny Oct 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, looking at the serde code, it looks to me like this should round-trip fine. However, there are no tests confirming this that I can find, at least not in test_unstructured/documents/test_elements.py. If you want I can ticket this and write a more thorough test. I'll whip up an informal one here just to satisfy myself it actually works, but I'm thinking that's probably outside the scope of this PR to actually add it in, do you agree?

Note that the changes in this PR don't affect the type of ElementMetadata.regex_metadata or its serde in any way, they just turned up the type error when I cleaned up typing in this neighborhood and that's what led to these two changes.

They would however (indirectly) affect the serde of the chunks (CompositeElement objects mostly) in that regex meta would have been stripped from second and later section elements, so the CompositeElement would only contain matches present in the first element. Not exactly serde-related I suppose; I'm sure it round-tripped fine, just missing things to begin with :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, round-trips fine in my test.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a test to confirm JSON round-trip of ElementMetadata.regex_metadata.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this account is just my guess. Not sure who might remember about that.

git blame can provide that info :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cragwolfe okay, I did the forensics, turns out it was just a mistake from the start. Good advertisement for strict data typing :)

  • 06/16/2023 4ea7168 - regex_metadata feature was added by Matt Robinson in substantially its current form: (Dict[str, List[RegexMetadata]]).
  • 08/29/2023 f6a745a - chunk by titles was added by Matt with the wrong type for regex_metadata in the tests, as we found it.

@scanny
Copy link
Collaborator Author

scanny commented Oct 18, 2023

Ok Crag, all the fixes are in for this. Let me know if there's anything else.

@scanny scanny force-pushed the scanny/fix-chunks-break-on-regex branch 2 times, most recently from 6860acc to 4bb57bf Compare October 18, 2023 22:03
@scanny scanny force-pushed the scanny/fix-chunks-break-on-regex branch from cd1fab9 to 0e3e806 Compare October 19, 2023 16:58
Copy link
Contributor

@cragwolfe cragwolfe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great description and added tests. Thank you!

@qued qued added this pull request to the merge queue Oct 19, 2023
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Oct 19, 2023
The sectioner (`_split_elements_by_title_and_table()`) detects semantic
unit boundaries and places those elements in separate sections. One
aspect of boundary detection is `_metadata_matches()` which returns
False when a difference in `metadata.regex_metadata` is detected between
elements.

The failure of this test demonstrates this over-chunking behavior.

The undetected introduction of this behavior can be attributed to the
blacklist vs. whitelist approach discussed in issue #1790. Testing for
this is not reliable because new metadata fields are of course not known
in advance.
Remove overly complex and risky "black-listing" approach to metadata
change detection and just check the metadata fields that indicate a
semantic boundary.

In a way this is making the default for new metadata fields that they do
not trigger a semantic boundary.
Closely related to the same mis-typing of `regex_metadata` in the test,
the code that passed the old test does not consolidate regex-metadata
across section elements and all regex-metadata except that in the first
element of the section is dropped.
The implementation of adjusting regex-metadata match-offsets assumed
the wrong data-type so while it passed the tests, in production it
dropped all regex_metadata except that in the first section.

In fairness, this never actually happened because the overchunking fixed
in the previous commit made any element that had regex matches show up
in its own single-element chunk.

Reimplement for regex-metadata of type `Dict[str, List[RegexMetadata]]`
rather than `List[RegexMetadata]`.
@scanny scanny force-pushed the scanny/fix-chunks-break-on-regex branch from c8476ed to 1fc6a87 Compare October 19, 2023 22:19
@qued qued added this pull request to the merge queue Oct 19, 2023
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Oct 20, 2023
@qued qued merged commit d9c2516 into main Oct 20, 2023
39 checks passed
@qued qued deleted the scanny/fix-chunks-break-on-regex branch October 20, 2023 03:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants