-
Notifications
You must be signed in to change notification settings - Fork 809
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better element IDs - deterministic and document-unique hashes #2673
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quite a few proposed mechanical fixes, but approach-wise I think this is probably the best we can do without cleaning up much more broadly into very old parts of the code (from back when Document
and Pages
were a thing) that I think no one wants to get involved with just now.
I think this direction will solve the immediate problem and not increase the internal complexity too much.
I really like that you've managed to narrow Element.id
to str
type, which neatly hides all the internal complexity from the user. That's a big improvement I'd say :)
assert element.metadata.page_number is not None | ||
assert element.metadata.index_on_page is not None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test seems a little weak. There's no indication that page_number is an int or index_on_page is not str or something. I think it would be better as something like:
page_index_pairs = [
(e.metadata.page_number, e.metadata.index_on_page)
for e in partition_html("example-docs/fake-html-duplicates.html")
]
assert page_index_pairs == [(1, 0), (1, 1), ...]
Also this give the user some concrete intuitive sense of how these values proceed along an element-stream.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 you're right, my implementation was quite naive and needs to be refactored the way you proposed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
outdated, please resolve
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few more remarks and insights after reflection.
255e9f9
to
4c393f4
Compare
9e2f2a7
to
f7a9af8
Compare
f9bdd43
to
43300f0
Compare
43300f0
to
af28f77
Compare
@@ -90,6 +90,7 @@ def convert_and_partition_docx( | |||
metadata_last_modified: Optional[str] = None, | |||
languages: Optional[List[str]] = ["auto"], | |||
detect_language_per_element: bool = False, | |||
starting_page_number: int = 1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I missed that parameter in the previous PR
@@ -212,26 +212,34 @@ def test_auto_partition_html_from_file_rb(): | |||
assert len(elements) > 0 | |||
|
|||
|
|||
def test_auto_partition_json_from_filename(): | |||
def test_auto_partitioned_json_output_maintains_consistency_with_fixture_elements(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO this test asked for more verbose name, I'm open to suggestions if there's a better one
"a4fd85d3f4141acae38c8f9c936ed2f3", | ||
"44ebaaf66640719c918246d4ccba1c45", | ||
"f36e8ebcb3b6a051940a168fe73cbc44", | ||
"532b395177652c7d61e1e4d855f1dc1d", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't we just care about the length of distinct element id's rather than the id's themselves? (not a blocker)
EDIT: maybe not for the "deterministic" part. we can leave this for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At some point I've been doing this in combination with:
assert len(ids) == len(set(ids))
Steve suggested that it's redundant since checking hashes explicitly already ensures their uniqueness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Mike, let's chat about this if we can. Ping when you get in, in case I'm still up.
A few changes needed I think and a couple cases that need to be covered by new tests.
# -- generate sequence number for each element on a page -- | ||
page_numbers = [e.metadata.page_number for e in elements] | ||
page_seq_pairs = [ | ||
seq_on_page for page, group in groupby(page_numbers) for seq_on_page, _ in enumerate(group) | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is going to require more testing. The following cases must be verified by tests:
- All
element.metadata.page_number
values areNone
, such as for file-formats that do not support pagination. That would include sometimes-HTML, sometimes-DOCX, probably CSV, maybe TXT, and others. - Some
element.metadata.page_number
values areNone
, likePageBreak("")
and maybePageNumber()
.
The latter case could cause false groups/restarts.
I think this is the logic we want:
def iter_element_sequence_number_pairs():
current_page_number, sequence_number = 1, 1
for e in elements:
page_number = e.metadata.page_number
# -- An occasional element such as PageBreak() that has no page-number assumes the current
# -- page number. An element stream with no page-numbers gets page_number=1 for each
# -- of its elements.
if page_number and page_number != current_page_number:
current_page_number = page_number
sequence_number = 1
yield e, sequence_number
sequence_number += 1
As you know I usually have a preference for comprehensions, but given the special-case logic required here I'm inclined to implement it as a generator like this (can be a function internal to assign_and_map_hash_ids()
). That way it can be tweaked and commented as required to deal with exception cases.
""" | ||
self._element_id = hashlib.sha256(self.text.encode()).hexdigest()[:32] | ||
data = f"{self.metadata.filename}{self.text}{self.metadata.page_number}{sequence_number}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we're going to need some similar solution here, aren't we? self.metadata.page_number
can be None
still, right? And I think we said it should be 1 in that case, but I think it actually needs to be a little more sophisticated to deal with the "occasional None" case; otherwise page_number could go ... 6, 6, 1, 6, 7, 8, 1, ...
I'm inclined to think this method needs to take page_number
as well as sequence_number
(which should probably be named page_sequence_number
btw). The nested function proposed above could just be elaborated to yield e, current_page_number, sequence_number
and maybe be named a little differently since it's generating triples now instead of pairs.
Unless we've somehow established that all elements always have .metadata.page_number
(and I don't think we have).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are you sure we need to solve this?
we should not be writing self.metadata.page_number
for this PR, only using it for hash calculations? and if it is None, just assume 1?
EDIT: It shouldn't matter for the purpose of the hash.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@crag: A couple things:
- This will be using "None" for the page-number, not "1".
- Therefore the "document-unique" promise is not guaranteed.
I had understood that the "document-unique" guarantee was a hard commit. But if we're satisfied with "usually mostly document-unique" then I suppose this will get it done :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if there are None page_number elements in a PDF, that sounds like a bug.
the biggest concern is PDF's. what case are you worried about?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code will be executed for all file-types, not just PDF, TIFF, and PPTX that "guarantee" page-numbers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So there are two problematic cases across file-types:
- No page numbers
- Occasional omitted page numbers.
No-elements-have-page-numbers happens in formats like TXT and I'm sure several others. Occasional-omitted-page-numbers are like PageBreak
and perhaps PageNumber
elements are not assigned page-number metadata.
These are the cases that the proposed generator above addresses.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No-elements-have-page-numbers happens in formats like TXT
but then each element should have a different sequence number, as if it was one giant page
Occasional-omitted-page-numbers are like PageBreak
if a PageBreak with no page number appears in a PDF or PPTX (or other "real page numbers" doc), imo that is a bug. the page number should be the page it is marking the end of.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No-elements-have-page-numbers happens in formats like TXT
but then each element should have a different sequence number, as if it was one giant page
Hmm, yes, good point. I suppose "None" is as good as "1" in that case, as long as we're consistent. I agree, that handles that case well enough since we aren't page-splitting those formats.
Occasional-omitted-page-numbers are like PageBreak
if a PageBreak with no page number appears in a PDF or PPTX (or other "real page numbers" doc), imo that is a bug.
Okay, that's fair enough. We can check those formats and make sure they assign at least page-number metadata to those elements.
remaining threads can be addressed in follow-on PRs
…tting (#400) This PR enables the Python and JS clients to partition PDF pages independently after splitting them on their side (`split_pdf_page=True`). Splitting is also supported by API itself - this makes sense when users send their requests without using our dedicated clients. Related to: * Unstructured-IO/unstructured#2842 * Unstructured-IO/unstructured#2673 It should be merged before these: * Unstructured-IO/unstructured-js-client#55 * Unstructured-IO/unstructured-python-client#72 **The tests for this PR won't pass until the related PRs are both merged.** ## How to test it locally Unfortunately the `pytest` test is not fully implemented, it fails - see [this comment](#400 (comment)) 1. Clone Python client and checkout to this PR: Unstructured-IO/unstructured-js-client#55 2. `cd unstructured-client; pip install --editable .` 3. `make run-web-app` 4. `python <script-below>.py` ```python from unstructured_client import UnstructuredClient from unstructured_client.models import shared from unstructured_client.models.errors import SDKError s = UnstructuredClient(api_key_auth=os.environ["UNS_API_KEY"], server_url="http://localhost:8000") # -- this file is included in this PR -- filename = "sample-docs/DA-1p-with-duplicate-pages.pdf" with open(filename, "rb") as f: files = shared.Files(content=f.read(), file_name=filename) req = shared.PartitionParameters( files=files, strategy="fast", languages=["eng"], split_pdf_page=False, # this forces splitting on API side (if parallelization is enabled) # split_pdf_page=True, # forces client-side splitting, implemented here: Unstructured-IO/unstructured-js-client#55 ) resp = s.general.partition(req) ids = [e["element_id"] for e in resp.elements] page_numbers = [e["metadata"]["page_number"] for e in resp.elements] # this PDF contains 3 identical pages, 13 elements each assert page_numbers == [1,1,1,1,1,1,1,1,1,1,1,1,1, 2,2,2,2,2,2,2,2,2,2,2,2,2, 3,3,3,3,3,3,3,3,3,3,3,3,3] assert len(ids) == len(set(ids)), "Element IDs are not unique" ``` --------- Co-authored-by: cragwolfe <[email protected]> Co-authored-by: Austin Walker <[email protected]>
Part two of: #2842
Main changes compared to part one:
This PR addresses the following issue: #2461