Better element IDs - deterministic and document-unique hashes #2673

micmarty-deepsense · 2024-03-20T16:37:24Z

Part two of: #2842

Main changes compared to part one:

hash computation includes element's sequence number on page, page number, document filename and its text
there are more test for deterministic behavior of IDs returned by partitioning functions + their uniqueness (guaranteed at the document level, and high probability across multiple documents)

This PR addresses the following issue: #2461

unstructured/partition/common.py

unstructured/partition/pdf.py

…nt ids

unstructured/documents/elements.py

scanny

Quite a few proposed mechanical fixes, but approach-wise I think this is probably the best we can do without cleaning up much more broadly into very old parts of the code (from back when Document and Pages were a thing) that I think no one wants to get involved with just now.

I think this direction will solve the immediate problem and not increase the internal complexity too much.

I really like that you've managed to narrow Element.id to str type, which neatly hides all the internal complexity from the user. That's a big improvement I'd say :)

unstructured/documents/elements.py

test_unstructured/partition/test_html_partition.py

scanny · 2024-03-26T22:10:10Z

test_unstructured/partition/test_html_partition.py

+        assert element.metadata.page_number is not None
+        assert element.metadata.index_on_page is not None


This test seems a little weak. There's no indication that page_number is an int or index_on_page is not str or something. I think it would be better as something like:

page_index_pairs = [ (e.metadata.page_number, e.metadata.index_on_page) for e in partition_html("example-docs/fake-html-duplicates.html") ] assert page_index_pairs == [(1, 0), (1, 1), ...]

Also this give the user some concrete intuitive sense of how these values proceed along an element-stream.

👍 you're right, my implementation was quite naive and needs to be refactored the way you proposed

outdated, please resolve

test_unstructured/partition/test_html_partition.py

scanny

A few more remarks and insights after reflection.

unstructured/documents/elements.py

micmarty-deepsense · 2024-04-22T19:59:46Z

unstructured/partition/docx.py

@@ -90,6 +90,7 @@ def convert_and_partition_docx(
    metadata_last_modified: Optional[str] = None,
    languages: Optional[List[str]] = ["auto"],
    detect_language_per_element: bool = False,
+    starting_page_number: int = 1,


I missed that parameter in the previous PR

micmarty-deepsense · 2024-04-22T20:01:28Z

test_unstructured/partition/test_auto.py

@@ -212,26 +212,34 @@ def test_auto_partition_html_from_file_rb():
    assert len(elements) > 0


-def test_auto_partition_json_from_filename():
+def test_auto_partitioned_json_output_maintains_consistency_with_fixture_elements():


IMO this test asked for more verbose name, I'm open to suggestions if there's a better one

cragwolfe · 2024-04-23T05:34:49Z

test_unstructured/partition/docx/test_docx.py

+        "a4fd85d3f4141acae38c8f9c936ed2f3",
+        "44ebaaf66640719c918246d4ccba1c45",
+        "f36e8ebcb3b6a051940a168fe73cbc44",
+        "532b395177652c7d61e1e4d855f1dc1d",


don't we just care about the length of distinct element id's rather than the id's themselves? (not a blocker)

EDIT: maybe not for the "deterministic" part. we can leave this for now.

At some point I've been doing this in combination with:

assert len(ids) == len(set(ids))

Steve suggested that it's redundant since checking hashes explicitly already ensures their uniqueness.

scanny

Hi Mike, let's chat about this if we can. Ping when you get in, in case I'm still up.

A few changes needed I think and a couple cases that need to be covered by new tests.

scanny · 2024-04-24T03:18:27Z

unstructured/documents/elements.py

+    # -- generate sequence number for each element on a page --
+    page_numbers = [e.metadata.page_number for e in elements]
+    page_seq_pairs = [
+        seq_on_page for page, group in groupby(page_numbers) for seq_on_page, _ in enumerate(group)
+    ]


I think this is going to require more testing. The following cases must be verified by tests:

All element.metadata.page_number values are None, such as for file-formats that do not support pagination. That would include sometimes-HTML, sometimes-DOCX, probably CSV, maybe TXT, and others.

Some element.metadata.page_number values are None, like PageBreak("") and maybe PageNumber().

The latter case could cause false groups/restarts.

I think this is the logic we want:

def iter_element_sequence_number_pairs(): current_page_number, sequence_number = 1, 1 for e in elements: page_number = e.metadata.page_number # -- An occasional element such as PageBreak() that has no page-number assumes the current # -- page number. An element stream with no page-numbers gets page_number=1 for each # -- of its elements. if page_number and page_number != current_page_number: current_page_number = page_number sequence_number = 1 yield e, sequence_number sequence_number += 1

As you know I usually have a preference for comprehensions, but given the special-case logic required here I'm inclined to implement it as a generator like this (can be a function internal to assign_and_map_hash_ids()). That way it can be tweaked and commented as required to deal with exception cases.

scanny · 2024-04-24T03:36:56Z

unstructured/documents/elements.py

        """
-        self._element_id = hashlib.sha256(self.text.encode()).hexdigest()[:32]
+        data = f"{self.metadata.filename}{self.text}{self.metadata.page_number}{sequence_number}"


I think we're going to need some similar solution here, aren't we? self.metadata.page_number can be None still, right? And I think we said it should be 1 in that case, but I think it actually needs to be a little more sophisticated to deal with the "occasional None" case; otherwise page_number could go ... 6, 6, 1, 6, 7, 8, 1, ...

I'm inclined to think this method needs to take page_number as well as sequence_number (which should probably be named page_sequence_number btw). The nested function proposed above could just be elaborated to yield e, current_page_number, sequence_number and maybe be named a little differently since it's generating triples now instead of pairs.

Unless we've somehow established that all elements always have .metadata.page_number (and I don't think we have).

are you sure we need to solve this?
we should not be writing self.metadata.page_number for this PR, only using it for hash calculations? and if it is None, just assume 1?
EDIT: It shouldn't matter for the purpose of the hash.

@crag: A couple things:

This will be using "None" for the page-number, not "1".

Therefore the "document-unique" promise is not guaranteed.

I had understood that the "document-unique" guarantee was a hard commit. But if we're satisfied with "usually mostly document-unique" then I suppose this will get it done :)

if there are None page_number elements in a PDF, that sounds like a bug.
the biggest concern is PDF's. what case are you worried about?

This code will be executed for all file-types, not just PDF, TIFF, and PPTX that "guarantee" page-numbers.

So there are two problematic cases across file-types:

No page numbers

Occasional omitted page numbers.

No-elements-have-page-numbers happens in formats like TXT and I'm sure several others. Occasional-omitted-page-numbers are like PageBreak and perhaps PageNumber elements are not assigned page-number metadata.

These are the cases that the proposed generator above addresses.

No-elements-have-page-numbers happens in formats like TXT

but then each element should have a different sequence number, as if it was one giant page

Occasional-omitted-page-numbers are like PageBreak

if a PageBreak with no page number appears in a PDF or PPTX (or other "real page numbers" doc), imo that is a bug. the page number should be the page it is marking the end of.

No-elements-have-page-numbers happens in formats like TXT

but then each element should have a different sequence number, as if it was one giant page

Hmm, yes, good point. I suppose "None" is as good as "1" in that case, as long as we're consistent. I agree, that handles that case well enough since we aren't page-splitting those formats.

Occasional-omitted-page-numbers are like PageBreak

if a PageBreak with no page number appears in a PDF or PPTX (or other "real page numbers" doc), imo that is a bug.

Okay, that's fair enough. We can check those formats and make sure they assign at least page-number metadata to those elements.

remaining threads can be addressed in follow-on PRs

…tting (#400) This PR enables the Python and JS clients to partition PDF pages independently after splitting them on their side (`split_pdf_page=True`). Splitting is also supported by API itself - this makes sense when users send their requests without using our dedicated clients. Related to: * Unstructured-IO/unstructured#2842 * Unstructured-IO/unstructured#2673 It should be merged before these: * Unstructured-IO/unstructured-js-client#55 * Unstructured-IO/unstructured-python-client#72 **The tests for this PR won't pass until the related PRs are both merged.** ## How to test it locally Unfortunately the `pytest` test is not fully implemented, it fails - see [this comment](#400 (comment)) 1. Clone Python client and checkout to this PR: Unstructured-IO/unstructured-js-client#55 2. `cd unstructured-client; pip install --editable .` 3. `make run-web-app` 4. `python <script-below>.py` ```python from unstructured_client import UnstructuredClient from unstructured_client.models import shared from unstructured_client.models.errors import SDKError s = UnstructuredClient(api_key_auth=os.environ["UNS_API_KEY"], server_url="http://localhost:8000") # -- this file is included in this PR -- filename = "sample-docs/DA-1p-with-duplicate-pages.pdf" with open(filename, "rb") as f: files = shared.Files(content=f.read(), file_name=filename) req = shared.PartitionParameters( files=files, strategy="fast", languages=["eng"], split_pdf_page=False, # this forces splitting on API side (if parallelization is enabled) # split_pdf_page=True, # forces client-side splitting, implemented here: Unstructured-IO/unstructured-js-client#55 ) resp = s.general.partition(req) ids = [e["element_id"] for e in resp.elements] page_numbers = [e["metadata"]["page_number"] for e in resp.elements] # this PDF contains 3 identical pages, 13 elements each assert page_numbers == [1,1,1,1,1,1,1,1,1,1,1,1,1, 2,2,2,2,2,2,2,2,2,2,2,2,2, 3,3,3,3,3,3,3,3,3,3,3,3,3] assert len(ids) == len(set(ids)), "Element IDs are not unique" ``` --------- Co-authored-by: cragwolfe <[email protected]> Co-authored-by: Austin Walker <[email protected]>

micmarty-deepsense added 5 commits March 20, 2024 17:34

prototype solution for PDF files

e68a7f5

add basic tests for element IDs

f3f3321

recalculate ID based on metadata (if present)

3398be3

add more unit tests

76cbaef

add HashValue class to identify when ID recalculation is required

272f4a6

micmarty-deepsense self-assigned this Mar 21, 2024

micmarty-deepsense commented Mar 21, 2024

View reviewed changes

unstructured/partition/common.py Outdated Show resolved Hide resolved

micmarty-deepsense commented Mar 21, 2024

View reviewed changes

unstructured/partition/pdf.py Outdated Show resolved Hide resolved

micmarty-deepsense added 4 commits March 25, 2024 13:36

Merge branch 'main' into CORE-3587/better-element-ids

3375f63

add test and a set of fixtures for unique and deterministic pdf eleme…

b71369c

…nt ids

update hash computation so it allows for appending other data

e5d90ab

add given when then comments

608bdbe

micmarty-deepsense commented Mar 26, 2024

View reviewed changes

unstructured/documents/elements.py Outdated Show resolved Hide resolved

micmarty-deepsense added 3 commits March 26, 2024 14:52

add docstring

4c139de

add html tests

02ca092

revert unused change

0d33608

scanny requested changes Mar 26, 2024

View reviewed changes

scanny reviewed Mar 27, 2024

View reviewed changes

unstructured/documents/elements.py Outdated Show resolved Hide resolved

unstructured/documents/elements.py Outdated Show resolved Hide resolved

unstructured/documents/elements.py Outdated Show resolved Hide resolved

unstructured/documents/elements.py Outdated Show resolved Hide resolved

micmarty-deepsense added 11 commits March 27, 2024 14:28

remove Text element tests for page_number and index_on_page

e813b89

recalculate_ids outside of the Text class

3f87ad2

get rid of index_on_page

e2eea35

revert _id to id

d14a47e

simplify hash calculation function

bc29126

remove uuid.UUID from type hints for self.id

cd9b9b3

quickfix calculate_hash function call

f49c68d

update PPTX test

eef4264

add docx test

b6e850f

refactor ids recalculation by moving it to process_metadata decorator

7efe3a4

remove unused code

4c393f4

micmarty-deepsense force-pushed the CORE-3587/better-element-ids branch from 255e9f9 to 4c393f4 Compare March 28, 2024 11:44

micmarty-deepsense force-pushed the CORE-3587/better-element-ids branch 2 times, most recently from 9e2f2a7 to f7a9af8 Compare April 22, 2024 13:29

micmarty-deepsense had a problem deploying to ci April 22, 2024 13:35 — with GitHub Actions Error

micmarty-deepsense had a problem deploying to ci April 22, 2024 13:35 — with GitHub Actions Failure

update version

779db46

micmarty-deepsense force-pushed the CORE-3587/better-element-ids branch from f9bdd43 to 43300f0 Compare April 22, 2024 15:35

refactor 1 test in test_auto.py

af28f77

micmarty-deepsense force-pushed the CORE-3587/better-element-ids branch from 43300f0 to af28f77 Compare April 22, 2024 19:57

micmarty-deepsense temporarily deployed to ci April 22, 2024 19:59 — with GitHub Actions Inactive

micmarty-deepsense commented Apr 22, 2024

View reviewed changes

cragwolfe reviewed Apr 23, 2024

View reviewed changes

micmarty-deepsense and others added 2 commits April 23, 2024 08:41

remove changelong entry duplicate

ca13902

Merge branch 'main' into CORE-3587/better-element-ids

f4fd49a

micmarty-deepsense temporarily deployed to ci April 23, 2024 09:04 — with GitHub Actions Inactive

micmarty-deepsense requested a review from scanny April 23, 2024 14:16

scanny previously requested changes Apr 24, 2024

View reviewed changes

Merge branch 'main' into CORE-3587/better-element-ids

4a0b27d

cragwolfe temporarily deployed to ci April 24, 2024 05:38 — with GitHub Actions Inactive

cragwolfe temporarily deployed to ci April 24, 2024 06:37 — with GitHub Actions Inactive

cragwolfe approved these changes Apr 24, 2024

View reviewed changes

cragwolfe merged commit 2d1923a into main Apr 24, 2024
42 checks passed

cragwolfe deleted the CORE-3587/better-element-ids branch April 24, 2024 07:05

MthwRobinson mentioned this pull request May 1, 2024

Shared concepts Unstructured-IO/docs#24

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better element IDs - deterministic and document-unique hashes #2673

Better element IDs - deterministic and document-unique hashes #2673

micmarty-deepsense commented Mar 20, 2024 •

edited

Loading

scanny left a comment

scanny Mar 26, 2024

micmarty-deepsense Mar 27, 2024

micmarty-deepsense Apr 4, 2024

scanny left a comment

micmarty-deepsense Apr 22, 2024

micmarty-deepsense Apr 22, 2024

cragwolfe Apr 23, 2024 •

edited

Loading

micmarty-deepsense Apr 23, 2024

scanny left a comment

scanny Apr 24, 2024

scanny Apr 24, 2024 •

edited

Loading

cragwolfe Apr 24, 2024 •

edited

Loading

scanny Apr 24, 2024

cragwolfe Apr 24, 2024

scanny Apr 24, 2024

scanny Apr 24, 2024 •

edited

Loading

cragwolfe Apr 25, 2024 •

edited

Loading

scanny Apr 25, 2024

		assert element.metadata.page_number is not None
		assert element.metadata.index_on_page is not None

Better element IDs - deterministic and document-unique hashes #2673

Better element IDs - deterministic and document-unique hashes #2673

Conversation

micmarty-deepsense commented Mar 20, 2024 • edited Loading

scanny left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

scanny left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cragwolfe Apr 23, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

scanny left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

scanny Apr 24, 2024 • edited Loading

Choose a reason for hiding this comment

cragwolfe Apr 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

scanny Apr 24, 2024 • edited Loading

Choose a reason for hiding this comment

cragwolfe Apr 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

micmarty-deepsense commented Mar 20, 2024 •

edited

Loading

cragwolfe Apr 23, 2024 •

edited

Loading

scanny Apr 24, 2024 •

edited

Loading

cragwolfe Apr 24, 2024 •

edited

Loading

scanny Apr 24, 2024 •

edited

Loading

cragwolfe Apr 25, 2024 •

edited

Loading