Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better element IDs - deterministic and document-unique hashes #2673

Merged
merged 201 commits into from
Apr 24, 2024

Conversation

micmarty-deepsense
Copy link
Contributor

@micmarty-deepsense micmarty-deepsense commented Mar 20, 2024

Part two of: #2842

Main changes compared to part one:

  • hash computation includes element's sequence number on page, page number, document filename and its text
  • there are more test for deterministic behavior of IDs returned by partitioning functions + their uniqueness (guaranteed at the document level, and high probability across multiple documents)

This PR addresses the following issue: #2461

@micmarty-deepsense micmarty-deepsense self-assigned this Mar 21, 2024
Copy link
Collaborator

@scanny scanny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quite a few proposed mechanical fixes, but approach-wise I think this is probably the best we can do without cleaning up much more broadly into very old parts of the code (from back when Document and Pages were a thing) that I think no one wants to get involved with just now.

I think this direction will solve the immediate problem and not increase the internal complexity too much.

I really like that you've managed to narrow Element.id to str type, which neatly hides all the internal complexity from the user. That's a big improvement I'd say :)

unstructured/documents/elements.py Outdated Show resolved Hide resolved
unstructured/documents/elements.py Outdated Show resolved Hide resolved
unstructured/documents/elements.py Outdated Show resolved Hide resolved
unstructured/documents/elements.py Outdated Show resolved Hide resolved
unstructured/documents/elements.py Outdated Show resolved Hide resolved
test_unstructured/partition/test_html_partition.py Outdated Show resolved Hide resolved
Comment on lines 741 to 742
assert element.metadata.page_number is not None
assert element.metadata.index_on_page is not None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test seems a little weak. There's no indication that page_number is an int or index_on_page is not str or something. I think it would be better as something like:

page_index_pairs = [
    (e.metadata.page_number, e.metadata.index_on_page)
    for e in partition_html("example-docs/fake-html-duplicates.html")
]
assert page_index_pairs == [(1, 0), (1, 1), ...]

Also this give the user some concrete intuitive sense of how these values proceed along an element-stream.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 you're right, my implementation was quite naive and needs to be refactored the way you proposed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

outdated, please resolve

test_unstructured/partition/test_html_partition.py Outdated Show resolved Hide resolved
test_unstructured/partition/test_html_partition.py Outdated Show resolved Hide resolved
test_unstructured/partition/test_html_partition.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@scanny scanny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more remarks and insights after reflection.

unstructured/documents/elements.py Outdated Show resolved Hide resolved
unstructured/documents/elements.py Outdated Show resolved Hide resolved
unstructured/documents/elements.py Outdated Show resolved Hide resolved
unstructured/documents/elements.py Outdated Show resolved Hide resolved
@micmarty-deepsense micmarty-deepsense force-pushed the CORE-3587/better-element-ids branch from 255e9f9 to 4c393f4 Compare March 28, 2024 11:44
@micmarty-deepsense micmarty-deepsense force-pushed the CORE-3587/better-element-ids branch from f9bdd43 to 43300f0 Compare April 22, 2024 15:35
@@ -90,6 +90,7 @@ def convert_and_partition_docx(
metadata_last_modified: Optional[str] = None,
languages: Optional[List[str]] = ["auto"],
detect_language_per_element: bool = False,
starting_page_number: int = 1,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed that parameter in the previous PR

@@ -212,26 +212,34 @@ def test_auto_partition_html_from_file_rb():
assert len(elements) > 0


def test_auto_partition_json_from_filename():
def test_auto_partitioned_json_output_maintains_consistency_with_fixture_elements():
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this test asked for more verbose name, I'm open to suggestions if there's a better one

"a4fd85d3f4141acae38c8f9c936ed2f3",
"44ebaaf66640719c918246d4ccba1c45",
"f36e8ebcb3b6a051940a168fe73cbc44",
"532b395177652c7d61e1e4d855f1dc1d",
Copy link
Contributor

@cragwolfe cragwolfe Apr 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't we just care about the length of distinct element id's rather than the id's themselves? (not a blocker)

EDIT: maybe not for the "deterministic" part. we can leave this for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At some point I've been doing this in combination with:

assert len(ids) == len(set(ids))

Steve suggested that it's redundant since checking hashes explicitly already ensures their uniqueness.

scanny
scanny previously requested changes Apr 24, 2024
Copy link
Collaborator

@scanny scanny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Mike, let's chat about this if we can. Ping when you get in, in case I'm still up.

A few changes needed I think and a couple cases that need to be covered by new tests.

Comment on lines +513 to +517
# -- generate sequence number for each element on a page --
page_numbers = [e.metadata.page_number for e in elements]
page_seq_pairs = [
seq_on_page for page, group in groupby(page_numbers) for seq_on_page, _ in enumerate(group)
]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is going to require more testing. The following cases must be verified by tests:

  • All element.metadata.page_number values are None, such as for file-formats that do not support pagination. That would include sometimes-HTML, sometimes-DOCX, probably CSV, maybe TXT, and others.
  • Some element.metadata.page_number values are None, like PageBreak("") and maybe PageNumber().

The latter case could cause false groups/restarts.

I think this is the logic we want:

    def iter_element_sequence_number_pairs():
        current_page_number, sequence_number = 1, 1
        for e in elements:
            page_number = e.metadata.page_number
            # -- An occasional element such as PageBreak() that has no page-number assumes the current
            # -- page number. An element stream with no page-numbers gets page_number=1 for each
            # -- of its elements.
            if page_number and page_number != current_page_number:
                current_page_number = page_number
                sequence_number = 1
            yield e, sequence_number
            sequence_number += 1

As you know I usually have a preference for comprehensions, but given the special-case logic required here I'm inclined to implement it as a generator like this (can be a function internal to assign_and_map_hash_ids()). That way it can be tweaked and commented as required to deal with exception cases.

"""
self._element_id = hashlib.sha256(self.text.encode()).hexdigest()[:32]
data = f"{self.metadata.filename}{self.text}{self.metadata.page_number}{sequence_number}"
Copy link
Collaborator

@scanny scanny Apr 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're going to need some similar solution here, aren't we? self.metadata.page_number can be None still, right? And I think we said it should be 1 in that case, but I think it actually needs to be a little more sophisticated to deal with the "occasional None" case; otherwise page_number could go ... 6, 6, 1, 6, 7, 8, 1, ...

I'm inclined to think this method needs to take page_number as well as sequence_number (which should probably be named page_sequence_number btw). The nested function proposed above could just be elaborated to yield e, current_page_number, sequence_number and maybe be named a little differently since it's generating triples now instead of pairs.

Unless we've somehow established that all elements always have .metadata.page_number (and I don't think we have).

Copy link
Contributor

@cragwolfe cragwolfe Apr 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you sure we need to solve this?
we should not be writing self.metadata.page_number for this PR, only using it for hash calculations? and if it is None, just assume 1?
EDIT: It shouldn't matter for the purpose of the hash.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@crag: A couple things:

  • This will be using "None" for the page-number, not "1".
  • Therefore the "document-unique" promise is not guaranteed.

I had understood that the "document-unique" guarantee was a hard commit. But if we're satisfied with "usually mostly document-unique" then I suppose this will get it done :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if there are None page_number elements in a PDF, that sounds like a bug.
the biggest concern is PDF's. what case are you worried about?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code will be executed for all file-types, not just PDF, TIFF, and PPTX that "guarantee" page-numbers.

Copy link
Collaborator

@scanny scanny Apr 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So there are two problematic cases across file-types:

  • No page numbers
  • Occasional omitted page numbers.

No-elements-have-page-numbers happens in formats like TXT and I'm sure several others. Occasional-omitted-page-numbers are like PageBreak and perhaps PageNumber elements are not assigned page-number metadata.

These are the cases that the proposed generator above addresses.

Copy link
Contributor

@cragwolfe cragwolfe Apr 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No-elements-have-page-numbers happens in formats like TXT

but then each element should have a different sequence number, as if it was one giant page

Occasional-omitted-page-numbers are like PageBreak

if a PageBreak with no page number appears in a PDF or PPTX (or other "real page numbers" doc), imo that is a bug. the page number should be the page it is marking the end of.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No-elements-have-page-numbers happens in formats like TXT

but then each element should have a different sequence number, as if it was one giant page

Hmm, yes, good point. I suppose "None" is as good as "1" in that case, as long as we're consistent. I agree, that handles that case well enough since we aren't page-splitting those formats.

Occasional-omitted-page-numbers are like PageBreak

if a PageBreak with no page number appears in a PDF or PPTX (or other "real page numbers" doc), imo that is a bug.

Okay, that's fair enough. We can check those formats and make sure they assign at least page-number metadata to those elements.

@cragwolfe cragwolfe merged commit 2d1923a into main Apr 24, 2024
42 checks passed
@cragwolfe cragwolfe deleted the CORE-3587/better-element-ids branch April 24, 2024 07:05
awalker4 added a commit to Unstructured-IO/unstructured-api that referenced this pull request Apr 30, 2024
…tting (#400)

This PR enables the Python and JS clients to partition PDF pages
independently after splitting them on their side
(`split_pdf_page=True`). Splitting is also supported by API itself -
this makes sense when users send their requests without using our
dedicated clients.

Related to: 
* Unstructured-IO/unstructured#2842
* Unstructured-IO/unstructured#2673

It should be merged before these:
* Unstructured-IO/unstructured-js-client#55
* Unstructured-IO/unstructured-python-client#72

**The tests for this PR won't pass until the related PRs are both
merged.**

## How to test it locally
Unfortunately the `pytest` test is not fully implemented, it fails - see
[this
comment](#400 (comment))
1. Clone Python client and checkout to this PR:
Unstructured-IO/unstructured-js-client#55
2. `cd unstructured-client; pip install --editable .`
3. `make run-web-app`
4.  `python <script-below>.py`

```python
from unstructured_client import UnstructuredClient
from unstructured_client.models import shared
from unstructured_client.models.errors import SDKError

s = UnstructuredClient(api_key_auth=os.environ["UNS_API_KEY"], server_url="http://localhost:8000")

# -- this file is included in this PR --
filename = "sample-docs/DA-1p-with-duplicate-pages.pdf"
with open(filename, "rb") as f:
    files = shared.Files(content=f.read(), file_name=filename)

req = shared.PartitionParameters(
    files=files,
    strategy="fast",
    languages=["eng"],
    split_pdf_page=False, # this forces splitting on API side (if parallelization is enabled)
    # split_pdf_page=True,  # forces client-side splitting, implemented here: Unstructured-IO/unstructured-js-client#55
)
resp = s.general.partition(req)
ids = [e["element_id"] for e in resp.elements]
page_numbers = [e["metadata"]["page_number"] for e in resp.elements]
# this PDF contains 3 identical pages, 13 elements each
assert page_numbers == [1,1,1,1,1,1,1,1,1,1,1,1,1,
                        2,2,2,2,2,2,2,2,2,2,2,2,2,
                        3,3,3,3,3,3,3,3,3,3,3,3,3]
assert len(ids) == len(set(ids)), "Element IDs are not unique"
```

---------

Co-authored-by: cragwolfe <[email protected]>
Co-authored-by: Austin Walker <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants