-
Notifications
You must be signed in to change notification settings - Fork 817
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: chunks break on regex-meta changes and regex-meta start/stop not adjusted #1779
Conversation
@@ -91,12 +99,12 @@ def test_chunk_by_title(): | |||
|
|||
assert chunks[0].metadata == ElementMetadata(emphasized_text_contents=["Day", "day"]) | |||
assert chunks[3].metadata == ElementMetadata( | |||
regex_metadata=[{"text": "A", "start": 11, "end": 12}], | |||
regex_metadata={"a": [RegexMetadata(text="A", start=11, end=12)]} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why did the data structure of the value change here?
would this get json serialized / deserialized correctly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The type of ElementMetadata.regex_metadata
is Dict[str, List[RegexMetadata]]
. RegexMetadata
is a TypedDict
like {"text": "this matched", "start": 7, "end": 19}
.
My assumption is you can specify multiple regexes, each with a name like "mail-stop", "version", etc., and each of those may produce its own set of matches, like:
>>> element.regex_metadata
{
"mail-stop": [{"text": "MS-107", "start": 18, "end": 24}],
"version": [
{"text": "current: v1.7.2", "start": 7, "end": 21},
{"text": "supersedes: v1.7.0", "start": 22, "end": 40},
],
}
My guess is that this "multiple-regexes" behavior was originally "single-regex" and when it was changed the tests and chunking code wasn't updated. A "single-regex" behavior would work fine with List[RegexMetadata]
and that matches what the tests were before.
But this account is just my guess. Not sure who might remember about that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the serializing/deserializing question, I don't see why it wouldn't work fine, these are all primitive data structures and JSON serializable, but let me take a closer look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, looking at the serde code, it looks to me like this should round-trip fine. However, there are no tests confirming this that I can find, at least not in test_unstructured/documents/test_elements.py
. If you want I can ticket this and write a more thorough test. I'll whip up an informal one here just to satisfy myself it actually works, but I'm thinking that's probably outside the scope of this PR to actually add it in, do you agree?
Note that the changes in this PR don't affect the type of ElementMetadata.regex_metadata
or its serde in any way, they just turned up the type error when I cleaned up typing in this neighborhood and that's what led to these two changes.
They would however (indirectly) affect the serde of the chunks (CompositeElement
objects mostly) in that regex meta would have been stripped from second and later section elements, so the CompositeElement
would only contain matches present in the first element. Not exactly serde-related I suppose; I'm sure it round-tripped fine, just missing things to begin with :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, round-trips fine in my test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a test to confirm JSON round-trip of ElementMetadata.regex_metadata
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But this account is just my guess. Not sure who might remember about that.
git blame can provide that info :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cragwolfe okay, I did the forensics, turns out it was just a mistake from the start. Good advertisement for strict data typing :)
Ok Crag, all the fixes are in for this. Let me know if there's anything else. |
6860acc
to
4bb57bf
Compare
cd1fab9
to
0e3e806
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great description and added tests. Thank you!
The sectioner (`_split_elements_by_title_and_table()`) detects semantic unit boundaries and places those elements in separate sections. One aspect of boundary detection is `_metadata_matches()` which returns False when a difference in `metadata.regex_metadata` is detected between elements. The failure of this test demonstrates this over-chunking behavior. The undetected introduction of this behavior can be attributed to the blacklist vs. whitelist approach discussed in issue #1790. Testing for this is not reliable because new metadata fields are of course not known in advance.
Remove overly complex and risky "black-listing" approach to metadata change detection and just check the metadata fields that indicate a semantic boundary. In a way this is making the default for new metadata fields that they do not trigger a semantic boundary.
Closely related to the same mis-typing of `regex_metadata` in the test, the code that passed the old test does not consolidate regex-metadata across section elements and all regex-metadata except that in the first element of the section is dropped.
The implementation of adjusting regex-metadata match-offsets assumed the wrong data-type so while it passed the tests, in production it dropped all regex_metadata except that in the first section. In fairness, this never actually happened because the overchunking fixed in the previous commit made any element that had regex matches show up in its own single-element chunk. Reimplement for regex-metadata of type `Dict[str, List[RegexMetadata]]` rather than `List[RegexMetadata]`.
c8476ed
to
1fc6a87
Compare
Executive Summary. Introducing strict type-checking as preparation for adding the chunk-overlap feature revealed a type mismatch for regex-metadata between chunking tests and the (authoritative) ElementMetadata definition. The implementation of regex-metadata aspects of chunking passed the tests but did not produce the appropriate behaviors in production where the actual data-structure was different. This PR fixes these two bugs.
Over-chunking. The presence of
regex-metadata
in an element was incorrectly being interpreted as a semantic boundary, leading to such elements being isolated in their own chunks.Discarded regex-metadata. regex-metadata present on the second or later elements in a section (chunk) was discarded.
Technical Summary
The type of
ElementMetadata.regex_metadata
isDict[str, List[RegexMetadata]]
.RegexMetadata
is aTypedDict
like{"text": "this matched", "start": 7, "end": 19}
.Multiple regexes can be specified, each with a name like "mail-stop", "version", etc. Each of those may produce its own set of matches, like:
Forensic analysis
The regex-metadata feature was added by Matt Robinson on 06/16/2023 commit: 4ea7168. The regex_metadata data structure is the same as when it was added.
The chunk-by-title feature was added by Matt Robinson on 08/29/2023 commit: f6a745a. The mistaken regex-metadata data structure in the tests is present in that commit.
Looks to me like a mis-remembering of the regex-metadata data-structure and insufficient type-checking rigor (type-checker strictness level set too low) to warn of the mistake.
Over-chunking Behavior
The over-chunking looked like this:
Chunking three elements with regex metadata should combine them into a single chunk (
CompositeElement
object), subject to maximum size rules (default 500 chars).Observed behavior looked like this:
The fix changed the approach from breaking on any metadata field not in a specified group (
regex_metadata
was missing from this group) to only breaking on specified fields (whitelisting instead of blacklisting). This avoids overchunking every time we add a new metadata field and is also simpler and easier to understand. This change in approach is discussed in more detail here #1790.Dropping regex-metadata Behavior
Chunking this section:
..should produce this regex_metadata on the single produced chunk:
but instead produced this:
Which is the regex-metadata from the first element only.
The fix was to remove the consolidation+adjustment process from inside the "list-attribute-processing" loop (because regex-metadata is not a list) and process regex metadata separately.