Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better element IDs - deterministic and document-unique hashes #2673

Merged
merged 201 commits into from
Apr 24, 2024
Merged
Show file tree
Hide file tree
Changes from 198 commits
Commits
Show all changes
201 commits
Select commit Hold shift + click to select a range
e68a7f5
prototype solution for PDF files
micmarty-deepsense Mar 20, 2024
f3f3321
add basic tests for element IDs
micmarty-deepsense Mar 21, 2024
3398be3
recalculate ID based on metadata (if present)
micmarty-deepsense Mar 21, 2024
76cbaef
add more unit tests
micmarty-deepsense Mar 21, 2024
272f4a6
add HashValue class to identify when ID recalculation is required
micmarty-deepsense Mar 21, 2024
3375f63
Merge branch 'main' into CORE-3587/better-element-ids
micmarty-deepsense Mar 25, 2024
b71369c
add test and a set of fixtures for unique and deterministic pdf eleme…
micmarty-deepsense Mar 26, 2024
e5d90ab
update hash computation so it allows for appending other data
micmarty-deepsense Mar 26, 2024
608bdbe
add given when then comments
micmarty-deepsense Mar 26, 2024
4c139de
add docstring
micmarty-deepsense Mar 26, 2024
02ca092
add html tests
micmarty-deepsense Mar 26, 2024
0d33608
revert unused change
micmarty-deepsense Mar 26, 2024
e813b89
remove Text element tests for page_number and index_on_page
micmarty-deepsense Mar 27, 2024
3f87ad2
recalculate_ids outside of the Text class
micmarty-deepsense Mar 27, 2024
e2eea35
get rid of index_on_page
micmarty-deepsense Mar 27, 2024
d14a47e
revert _id to id
micmarty-deepsense Mar 27, 2024
bc29126
simplify hash calculation function
micmarty-deepsense Mar 27, 2024
cd9b9b3
remove uuid.UUID from type hints for self.id
micmarty-deepsense Mar 27, 2024
f49c68d
quickfix calculate_hash function call
micmarty-deepsense Mar 27, 2024
eef4264
update PPTX test
micmarty-deepsense Mar 27, 2024
b6e850f
add docx test
micmarty-deepsense Mar 27, 2024
7efe3a4
refactor ids recalculation by moving it to process_metadata decorator
micmarty-deepsense Mar 28, 2024
4c393f4
remove unused code
micmarty-deepsense Mar 28, 2024
8f9c445
revert isinstance statement
micmarty-deepsense Mar 28, 2024
d33c86c
revert inline return statement
micmarty-deepsense Mar 28, 2024
3becd44
add tests for calculating hash and recalculatind ids
micmarty-deepsense Mar 28, 2024
183c38b
do dont mutate, but copy elements
micmarty-deepsense Mar 28, 2024
01f16c8
update docs hashes
micmarty-deepsense Mar 28, 2024
fbdaefe
add doc tests
micmarty-deepsense Mar 28, 2024
03fc126
Merge branch 'main' into CORE-3587/better-element-ids
micmarty-deepsense Mar 28, 2024
289b1b3
refactor recalculate_ids so it updates parent_id's correctly
micmarty-deepsense Mar 29, 2024
56783b5
rename calculate_hash into id_to_hash and make it a method
micmarty-deepsense Mar 29, 2024
5509bb5
revert existing logic of assigning id's at construction-time
micmarty-deepsense Mar 29, 2024
82efb28
remove unused code
micmarty-deepsense Mar 29, 2024
1812435
apply code review suggestions in tests
micmarty-deepsense Mar 29, 2024
bc35458
rename "recalculate_ids" with "assign_hash_ids"
micmarty-deepsense Mar 29, 2024
cdac860
remove test which is no longer relevant
micmarty-deepsense Mar 29, 2024
dfe446d
update html test file and test itself
micmarty-deepsense Mar 29, 2024
1d01dfb
add test_id_to_hash
micmarty-deepsense Mar 29, 2024
9350866
handle edge case for xlsx files
micmarty-deepsense Mar 29, 2024
dfba0fb
update file name
micmarty-deepsense Mar 29, 2024
75f3e88
revert original id in test
micmarty-deepsense Mar 29, 2024
e179a55
use deepcopy in test to compare if ids have changed
micmarty-deepsense Mar 29, 2024
6d93e0b
revert to construction-time UUIDs
micmarty-deepsense Mar 29, 2024
baa0540
explicit warning in assign_hash_id
micmarty-deepsense Apr 2, 2024
d51b8c1
add dummy copy of id_to_hash to class "Name(EmailElement)"
micmarty-deepsense Apr 2, 2024
1e0a4a5
update hashes in tests
micmarty-deepsense Apr 2, 2024
1f64d46
adjust hash values for pptx hierarchy test
micmarty-deepsense Apr 2, 2024
4b5b84c
remove unused file
micmarty-deepsense Apr 2, 2024
49d899d
adjust pdf hashes in a test
micmarty-deepsense Apr 2, 2024
24b7b5b
Merge branch 'main' into CORE-3587/better-element-ids
micmarty-deepsense Apr 2, 2024
8911fa3
update overview.rst
micmarty-deepsense Apr 2, 2024
5d0ed03
remove deprecated test
micmarty-deepsense Apr 2, 2024
6fe739a
raise if element_id is not a string or NoId
micmarty-deepsense Apr 2, 2024
135f8af
update CHANGELOG
micmarty-deepsense Apr 2, 2024
c578aa7
quickfix ruff warnings
micmarty-deepsense Apr 2, 2024
dd0b949
quickfix changelog
micmarty-deepsense Apr 2, 2024
ca53a97
update __version__
micmarty-deepsense Apr 2, 2024
a6cae7b
Better element IDs <- Ingest test fixtures update (#2832)
ryannikolaidis Apr 2, 2024
6b2cffa
use hash for label studio annotations
micmarty-deepsense Apr 2, 2024
fa4cb39
Merge branch 'CORE-3587/better-element-ids' of https://github.com/Uns…
micmarty-deepsense Apr 2, 2024
0126788
adjust email test
micmarty-deepsense Apr 2, 2024
237636a
improve email element design
micmarty-deepsense Apr 2, 2024
007b733
fix chunking
micmarty-deepsense Apr 2, 2024
23dbbb1
update the docstring for assign_hash_ids
micmarty-deepsense Apr 3, 2024
3a6d04a
remove try except
micmarty-deepsense Apr 3, 2024
515cb52
don't call id_to_uuid, elements already have UUIDs
micmarty-deepsense Apr 3, 2024
652d6c2
move id_to_hash from Text to Element
micmarty-deepsense Apr 3, 2024
6ed2c7e
reorder methods to alphabetical order
micmarty-deepsense Apr 3, 2024
452e3cd
remove unused id_to_uuid
micmarty-deepsense Apr 3, 2024
86023f8
update hashes in tests
micmarty-deepsense Apr 3, 2024
3bd0745
Merge branch 'main' into CORE-3587/better-element-ids
micmarty-deepsense Apr 3, 2024
810dce1
Better element IDs <- Ingest test fixtures update (#2839)
ryannikolaidis Apr 3, 2024
5a58acd
remove unused imports
micmarty-deepsense Apr 3, 2024
a4e654f
Merge branch 'CORE-3587/better-element-ids' of https://github.com/Uns…
micmarty-deepsense Apr 3, 2024
298fad1
update hashes
micmarty-deepsense Apr 3, 2024
339a440
refactor one test in test_email_elements.py
micmarty-deepsense Apr 3, 2024
0d65b02
fix KeyErrors for stanley-cups
micmarty-deepsense Apr 3, 2024
9dbf6ae
merge 2 tests into 1
micmarty-deepsense Apr 3, 2024
a2e4302
update pdf hashes
micmarty-deepsense Apr 3, 2024
cd3cdc4
fix label studio tests
micmarty-deepsense Apr 3, 2024
2d27057
fix baseplate tests
micmarty-deepsense Apr 3, 2024
d1ecb40
add element ID design principles section in the documentation
micmarty-deepsense Apr 3, 2024
fd3b55a
Better element IDs <- Ingest test fixtures update (#2840)
ryannikolaidis Apr 3, 2024
ebb1209
update Element docstrings
micmarty-deepsense Apr 3, 2024
2c398e6
change num of expected files in local ingest from 12 to 13
micmarty-deepsense Apr 3, 2024
6b5dddf
Merge branch 'CORE-3587/better-element-ids' of https://github.com/Uns…
micmarty-deepsense Apr 3, 2024
639423a
modify default behavior of Element, Text, and Name class
micmarty-deepsense Apr 3, 2024
ad9f58f
quickfix id initialization in Element
micmarty-deepsense Apr 3, 2024
ee90096
move id initialization to Element
micmarty-deepsense Apr 3, 2024
2d9a127
refactor id assertions in test_elements.py
micmarty-deepsense Apr 3, 2024
f5c650a
quickfix bug, forgot to remove invalid assignment
micmarty-deepsense Apr 3, 2024
11c4041
add changelog entry
micmarty-deepsense Apr 3, 2024
20d7c2f
adjust email tests
micmarty-deepsense Apr 3, 2024
4aadf22
fix chunking
micmarty-deepsense Apr 3, 2024
45973ee
remove unnecessary enumeration and remove argument to id_to_hash
micmarty-deepsense Apr 3, 2024
3f51745
remove unused import
micmarty-deepsense Apr 3, 2024
bc93f54
quickfix support for | operand in 3.9
micmarty-deepsense Apr 3, 2024
e9a5dcf
add design principles in overview.rst
micmarty-deepsense Apr 3, 2024
f36f76c
fix staging test by using deterministic hashes
micmarty-deepsense Apr 3, 2024
71e70d7
fix tests that were failing due to invalid text_as_html consolidation
micmarty-deepsense Apr 3, 2024
17d6585
add empty lines
micmarty-deepsense Apr 3, 2024
511cc05
quickfix typo
micmarty-deepsense Apr 3, 2024
96e5b67
parametrize test_text_uuid
micmarty-deepsense Apr 3, 2024
d555736
Preparing the ground for better element IDs <- Ingest test fixtures u…
ryannikolaidis Apr 3, 2024
1a38d70
adjust ingestion chunking config
micmarty-deepsense Apr 3, 2024
920837a
Merge branch 'mike/preparing-ground-for-better-element-ids' of https:…
micmarty-deepsense Apr 3, 2024
933cef0
adjust ingestion chunking config
micmarty-deepsense Apr 3, 2024
53599d3
Preparing the ground for better element IDs <- Ingest test fixtures u…
ryannikolaidis Apr 3, 2024
9507a96
Better element IDs - deterministic and document-unique hashes <- Inge…
ryannikolaidis Apr 3, 2024
39d7958
Merge branch 'main' into mike/preparing-ground-for-better-element-ids
micmarty-deepsense Apr 4, 2024
939f54d
use hashes in partitioner
micmarty-deepsense Apr 4, 2024
c4c91a5
Merge branch 'CORE-3587/better-element-ids' of https://github.com/Uns…
micmarty-deepsense Apr 4, 2024
793ef37
Merge branch 'main' into CORE-3587/better-element-ids
micmarty-deepsense Apr 4, 2024
36aeefd
Merge branch 'main' into mike/preparing-ground-for-better-element-ids
micmarty-deepsense Apr 4, 2024
f0e0149
remove unused import
micmarty-deepsense Apr 4, 2024
1c35139
Merge branch 'mike/preparing-ground-for-better-element-ids' of https:…
micmarty-deepsense Apr 4, 2024
bec0b90
move id_to_hash to interfaces.py
micmarty-deepsense Apr 4, 2024
b5e53bb
Better element IDs - deterministic and document-unique hashes <- Inge…
ryannikolaidis Apr 4, 2024
38cd4fa
ignore mongodb.sh in test-ingest-src.sh
micmarty-deepsense Apr 5, 2024
ac94588
remove redundant loop with id_to_hash
micmarty-deepsense Apr 5, 2024
52e830b
Merge branch 'CORE-3587/better-element-ids' of https://github.com/Uns…
micmarty-deepsense Apr 5, 2024
7052896
Merge branch 'main' into CORE-3587/better-element-ids
micmarty-deepsense Apr 5, 2024
c5b16f3
update changelog and sync version
micmarty-deepsense Apr 5, 2024
af879ab
Better element IDs - deterministic and document-unique hashes <- Inge…
ryannikolaidis Apr 5, 2024
82546ed
revert ignoring mongodb.sh
micmarty-deepsense Apr 5, 2024
79afe97
Merge branch 'CORE-3587/better-element-ids' of https://github.com/Uns…
micmarty-deepsense Apr 5, 2024
ea6b881
Merge branch 'main' into mike/preparing-ground-for-better-element-ids
micmarty-deepsense Apr 8, 2024
4d9dbc2
rename assign_hash_ids to assign_and_map_hash_ids
micmarty-deepsense Apr 8, 2024
adb4592
change expected argument type for element_id in CheckBox
micmarty-deepsense Apr 8, 2024
836e514
add a test utility for assigning hash ids
micmarty-deepsense Apr 8, 2024
c0add80
more detailed element test
micmarty-deepsense Apr 8, 2024
e20666d
rename test
micmarty-deepsense Apr 8, 2024
cf62230
remove redundant line
micmarty-deepsense Apr 8, 2024
178bf57
bump version
micmarty-deepsense Apr 8, 2024
73d8edd
Merge branch 'mike/preparing-ground-for-better-element-ids' into CORE…
micmarty-deepsense Apr 8, 2024
170141f
update test name
micmarty-deepsense Apr 8, 2024
3937ac4
quickfix amgiguity in hash assigning function calls
micmarty-deepsense Apr 8, 2024
3ee35ce
update CHANGELOG
micmarty-deepsense Apr 8, 2024
5017e6c
remove unused import
micmarty-deepsense Apr 8, 2024
f0def7b
adjust hashes in test
micmarty-deepsense Apr 8, 2024
560d2cd
fix missing argument to id_to_hash
micmarty-deepsense Apr 8, 2024
bde2907
update hash in test
micmarty-deepsense Apr 8, 2024
ba28243
update email tests
micmarty-deepsense Apr 8, 2024
ca5861c
Better element IDs - deterministic and document-unique hashes <- Inge…
ryannikolaidis Apr 8, 2024
2a9f0b8
fix a bug: sharing one memory address
micmarty-deepsense Apr 8, 2024
21914f2
refactor assign_and_map_hash_ids according to review sugestions
micmarty-deepsense Apr 8, 2024
6e80fb4
Merge branch 'CORE-3587/better-element-ids' of https://github.com/Uns…
micmarty-deepsense Apr 8, 2024
3916d0d
make pytest.mark.parametrize body compact
micmarty-deepsense Apr 8, 2024
3250abe
add 2 example docs and adjust related tests
micmarty-deepsense Apr 8, 2024
fe7fa00
move assign_hash_ids from test_utils to unit_utils
micmarty-deepsense Apr 8, 2024
66c3f23
apply other minor review suggestions
micmarty-deepsense Apr 8, 2024
8fa2666
remove unused import
micmarty-deepsense Apr 8, 2024
aab6bad
add pdf with duplicate page and refactor related test
micmarty-deepsense Apr 8, 2024
4fd7d62
quickfix importing assign_hash_ids
micmarty-deepsense Apr 8, 2024
90a1880
remove unused imports
micmarty-deepsense Apr 8, 2024
ae2cd30
get rid of List type
micmarty-deepsense Apr 8, 2024
c0d1bb1
Merge branch 'main' into CORE-3587/better-element-ids
micmarty-deepsense Apr 8, 2024
bdc0c3b
Merge branch 'CORE-3587/better-element-ids' of https://github.com/Uns…
micmarty-deepsense Apr 8, 2024
461f9b9
remove unused imports
micmarty-deepsense Apr 8, 2024
624ba1a
remove unused imports
micmarty-deepsense Apr 8, 2024
8e100f7
Merge branch 'CORE-3587/better-element-ids' of https://github.com/Uns…
micmarty-deepsense Apr 8, 2024
4ef4821
Better element IDs - deterministic and document-unique hashes <- Inge…
ryannikolaidis Apr 9, 2024
a3e5d60
remove unused argument
micmarty-deepsense Apr 9, 2024
562df25
Merge branch 'CORE-3587/better-element-ids' of https://github.com/Uns…
micmarty-deepsense Apr 9, 2024
b17c80f
Merge branch 'main' into CORE-3587/better-element-ids
micmarty-deepsense Apr 18, 2024
89c7f27
clean up after resolving conflicts
micmarty-deepsense Apr 18, 2024
e2f4c3c
update hash ids for test
micmarty-deepsense Apr 18, 2024
3c07881
use seq_on_page in hash calculation
micmarty-deepsense Apr 18, 2024
e0b02ec
support for starting_page_number in ODT files
micmarty-deepsense Apr 18, 2024
cf45f7a
update hashes for doc and docx tests, remove redundant assertion
micmarty-deepsense Apr 18, 2024
ff2fd2f
include filename in hash calculation
micmarty-deepsense Apr 18, 2024
eeb1ea6
fix bug of sharing one metadata object by multiple elements for msg f…
micmarty-deepsense Apr 18, 2024
33ae279
update hashes in tests and refactor them slightly
micmarty-deepsense Apr 18, 2024
f6ec6a0
adjust pptx test cases
micmarty-deepsense Apr 18, 2024
9f4aded
update hashes for staging tests
micmarty-deepsense Apr 18, 2024
3dad5ae
update hashes for PDF tests
micmarty-deepsense Apr 18, 2024
4463830
fix line too long
micmarty-deepsense Apr 18, 2024
a93a156
reformat elements.py and add more comments
micmarty-deepsense Apr 18, 2024
c8b7c66
Better element IDs - deterministic and document-unique hashes <- Inge…
ryannikolaidis Apr 18, 2024
d8e9a2f
update changelog and version
micmarty-deepsense Apr 18, 2024
ee0392a
Merge branch 'CORE-3587/better-element-ids' of https://github.com/Uns…
micmarty-deepsense Apr 18, 2024
e578aba
update overview.rst
micmarty-deepsense Apr 18, 2024
b9baf8a
make tests more compact
micmarty-deepsense Apr 18, 2024
a2227fc
update html hashes
micmarty-deepsense Apr 18, 2024
b22784c
remove redundant uniqueness assertion
micmarty-deepsense Apr 18, 2024
68f93bd
update almost all hashes in spring-weather (there are still problemat…
micmarty-deepsense Apr 19, 2024
f50c13c
update hashes for spring-weather
micmarty-deepsense Apr 19, 2024
9ac682f
revert spring water example doc to original
micmarty-deepsense Apr 19, 2024
b10b3c1
Merge branch 'main' into CORE-3587/better-element-ids
micmarty-deepsense Apr 19, 2024
1178ca0
assign hash ids when doing ingestion
micmarty-deepsense Apr 19, 2024
232c405
revert all changes to test_unstructured_ingest
micmarty-deepsense Apr 19, 2024
973dc29
Better element IDs - deterministic and document-unique hashes <- Inge…
ryannikolaidis Apr 19, 2024
3280625
increase num of expected files in local.sh
micmarty-deepsense Apr 19, 2024
22991c8
Merge branch 'CORE-3587/better-element-ids' of https://github.com/Uns…
micmarty-deepsense Apr 19, 2024
cc8be15
Merge branch 'main' into CORE-3587/better-element-ids
micmarty-deepsense Apr 22, 2024
779db46
update version
micmarty-deepsense Apr 22, 2024
af28f77
refactor 1 test in test_auto.py
micmarty-deepsense Apr 22, 2024
ca13902
remove changelong entry duplicate
micmarty-deepsense Apr 23, 2024
f4fd49a
Merge branch 'main' into CORE-3587/better-element-ids
micmarty-deepsense Apr 23, 2024
4a0b27d
Merge branch 'main' into CORE-3587/better-element-ids
cragwolfe Apr 24, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 11 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,17 @@

## 0.13.4-dev0

### Enhancements
* **Unique and deterministic hash IDs for elements** Element IDs produced by any partitioning function are now deterministic and unique at the document level by default. Before, hashes were based only on text; however, they now also take into account the element's sequence number on a page, the page's number in the document, and the document's file name.

### Features

### Fixes

## 0.13.3

### Enhancements
* **Unique and deterministic hash IDs for elements** Element IDs produced by any partitioning function are now deterministic and unique at the document level by default. Before, hashes were based only on text; however, they now also take into account the element's sequence number on a page, the page's number in the document, and the document's file name.

* **Remove duplicate image elements**. Remove image elements identified by PDFMiner that have similar bounding boxes and the same text.
* **Add support for `start_index` in `html` links extraction**
Expand Down
8 changes: 3 additions & 5 deletions docs/source/introduction/overview.rst
Original file line number Diff line number Diff line change
Expand Up @@ -141,10 +141,8 @@ a list of elements from JSON, as seen in the snippet below
Unique Element IDs
******************

By default, the element ID is a SHA-256 hash of the element text. This is to ensure that
the ID is deterministic. One downside is that the ID is not guaranteed to be unique.
Different elements with the same text will have the same ID, and there could also
be hash collisions. To use UUIDs in the output instead, you can pass
By default, the element ID is a SHA-256 hash of the element's text, its position on the page, page number it's on, and the name of the document file - this is to ensure that the ID is deterministic and unique at the document level.
To obtain globally unique IDs in the output (UUIDs), you can pass
``unique_element_ids=True`` into any of the partition functions. This can be helpful
if you'd like to use the IDs as a primary key in a database, for example.

Expand All @@ -161,7 +159,7 @@ Element ID Design Principles
#. A partitioning function can assign only one of two available ID types to a returned element: a hash or a UUID.
#. All elements that are returned come with an ID, which is never None.
#. No matter which type of ID is used, it will always be in string format.
#. Partitioning a document returns elements with hashes as their default IDs.
#. Partitioning a document returns elements with hashes as their default IDs, ensuring they are both deterministic and unique within a document.

For creating elements independently of partitioning functions, refer to the `Element` class documentation in the source code (`unstructured/documents/elements.py`).

Expand Down
Binary file added example-docs/duplicate-paragraphs.doc
Binary file not shown.
Binary file added example-docs/duplicate-paragraphs.docx
Binary file not shown.
23 changes: 23 additions & 0 deletions example-docs/fake-html-with-duplicate-elements.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
<!DOCTYPE html>
<html>

<head>
<title>Simple Nested HTML</title>
</strong>

<body>
<h1>Example heading.</h1>
<div>
<span>This is a span.</span>
<span>This is another span.</span>
</div>
<br>
<h1>Example heading.</h1>
<div>
<span>This is a span.</span>
<span>This is another span.</span>
</div>

</body>

</html>
Binary file added example-docs/fake-memo-with-duplicate-page.pdf
Binary file not shown.
2 changes: 1 addition & 1 deletion example-docs/spring-weather.html.json
Original file line number Diff line number Diff line change
Expand Up @@ -223,4 +223,4 @@
"page_number": 1
}
}
]
]
114 changes: 96 additions & 18 deletions test_unstructured/documents/test_elements.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@

from __future__ import annotations

import copy
import json
import pathlib
from functools import partial
Expand All @@ -28,9 +29,29 @@
RegexMetadata,
Text,
Title,
assign_and_map_hash_ids,
)


@pytest.mark.parametrize("element", [Element(), Text(text=""), CheckBox()])
def test_Element_autoassigns_a_UUID_then_becomes_an_idempotent_and_deterministic_hash(
element: Element,
):
# -- element self-assigns itself a UUID --
assert isinstance(element.id, str)
assert len(element.id) == 36
assert element.id.count("-") == 4

expected_hash = "5336294a19f32ff03ef80066fbc3e0f7"
# -- calling `.id_to_hash()` changes the element's id-type to hash --
assert element.id_to_hash(0) == expected_hash
assert element.id == expected_hash

# -- `.id_to_hash()` is idempotent --
assert element.id_to_hash(0) == expected_hash
assert element.id == expected_hash


def test_Text_is_JSON_serializable():
# -- This shold run without an error --
json.dumps(Text(text="hello there!", element_id=None).to_dict())
Expand All @@ -45,25 +66,11 @@ def test_Text_is_JSON_serializable():
CheckBox(),
],
)
def test_Element_autoassigns_a_UUID_then_becomes_an_idempotent_and_deterministic_hash(
element: Element,
):
assert element._element_id is None, "Element should not have an ID yet"

# -- element self-assigns itself a UUID only when the ID is requested --
def test_Element_self_assigns_itself_a_UUID_id(element: Element):
assert isinstance(element.id, str)
assert len(element.id) == 36
assert element.id.count("-") == 4

expected_hash = "e3b0c44298fc1c149afbf4c8996fb924"
# -- calling `.id_to_hash()` changes the element's id-type to hash --
assert element.id_to_hash() == expected_hash
assert element.id == expected_hash

# -- `.id_to_hash()` is idempotent --
assert element.id_to_hash() == expected_hash
assert element.id == expected_hash


def test_text_element_apply_cleaners():
text_element = Text(text="[1] A Textbook on Crocodile Habitats")
Expand Down Expand Up @@ -408,9 +415,10 @@ def and_it_serializes_an_orig_elements_sub_object_to_base64_when_it_is_present(s
assert meta.to_dict() == {
"category_depth": 1,
"orig_elements": (
"eJyFzcsKwjAQheFXKVm7yDS3xjcQXNaViKTJjBR6o46glr67zVI3Lmf4Dv95EdhhjwNf2yT2hYDGUaWt"
"JVm5WDoqNUL0UoJrqtLHJHaF6JFDChw2v6zbzfjkvD2OM/YZ8GvC/Khb7lBs5LcilUwRyCsblQYTiBQp"
"ZRxYZcCA/1spDtP98dU6DTEw3sa5fWOqs10vH0cLQn0="
"eJyFzcsKwjAQheFXKVm7MGkzbXwDocu6EpFcTqTQG3UEtfTdbZa"
"6cTnDd/jPi0CHHgNf2yAOmXCljjqXoErKoIw3hqJRXlPuyphrEr"
"tM9GAbLNvNL+t2M56ctvU4o0+AXxPSo2m5g9jIb6VwBE0VBSujp"
"1LJ6EiRLpwiSBf3fyvZcbo/vlqnwVvGbZzbN0KT7Hr5AG/eQyM="
),
"page_number": 2,
}
Expand Down Expand Up @@ -666,3 +674,73 @@ def it_can_find_the_consolidation_strategy_for_each_of_its_known_fields(self):
f"ElementMetadata field `.{field_name}` does not have a consolidation strategy."
f" Add one in `ConsolidationStrategy.field_consolidation_strategies()."
)


def test_hash_ids_are_unique_for_duplicate_elements():
# GIVEN
parent = Text(text="Parent", metadata=ElementMetadata(page_number=1))
elements = [
parent,
Text(text="Element", metadata=ElementMetadata(page_number=1, parent_id=parent.id)),
Text(text="Element", metadata=ElementMetadata(page_number=1, parent_id=parent.id)),
]

# WHEN
updated_elements = assign_and_map_hash_ids(copy.deepcopy(elements))
ids = [element.id for element in updated_elements]

# THEN
assert len(ids) == len(set(ids)), "Recalculated IDs must be unique."
assert elements[1].metadata.parent_id == elements[2].metadata.parent_id

for idx, updated_element in enumerate(updated_elements):
assert updated_element.id != elements[idx].id, "IDs haven't changed after recalculation"
if updated_element.metadata.parent_id is not None:
assert updated_element.metadata.parent_id in ids, "Parent ID not in the list of IDs"
assert (
updated_element.metadata.parent_id != elements[idx].metadata.parent_id
), "Parent ID hasn't changed after recalculation"


def test_hash_ids_are_deterministic():
parent = Text(text="Parent", metadata=ElementMetadata(page_number=1))
elements = [
parent,
Text(text="Element", metadata=ElementMetadata(page_number=1, parent_id=parent.id)),
Text(text="Element", metadata=ElementMetadata(page_number=1, parent_id=parent.id)),
]

updated_elements = assign_and_map_hash_ids(elements)
ids = [element.id for element in updated_elements]
parent_ids = [element.metadata.parent_id for element in updated_elements]

assert ids == [
"ea9eb7e80383c190f8cafce1ad666624",
"4112a8d24886276e18e759d06956021b",
"eba84bbe7f03e8b91a1527323040ee3d",
]
assert parent_ids == [
None,
"ea9eb7e80383c190f8cafce1ad666624",
"ea9eb7e80383c190f8cafce1ad666624",
]


@pytest.mark.parametrize(
("text", "sequence_number", "filename", "page_number", "expected_hash"),
[
# -- pdf files support page numbers --
("foo", 1, "foo.pdf", 1, "4bb264eb23ceb44cd8fcc5af44f8dc71"),
("foo", 2, "foo.pdf", 1, "75fc1de48cf724ec00aa8d1c5a0d3758"),
# -- txt files don't have a page number --
("some text", 0, "some.txt", None, "1a2627b5760c06b1440102f11a1edb0f"),
("some text", 1, "some.txt", None, "e3fd10d867c4a1c0264dde40e3d7e45a"),
],
)
def test_id_to_hash_calculates(text, sequence_number, filename, page_number, expected_hash):
element = Text(
text=text,
metadata=ElementMetadata(filename=filename, page_number=page_number),
)
assert element.id_to_hash(sequence_number) == expected_hash, "Returned ID does not match"
assert element.id == expected_hash, "ID should be set"
30 changes: 25 additions & 5 deletions test_unstructured/documents/test_email_elements.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,21 +4,41 @@

from unstructured.cleaners.core import clean_prefix
from unstructured.cleaners.translate import translate_text
from unstructured.documents.email_elements import EmailElement, Name
from unstructured.documents.email_elements import EmailElement, Name, Subject


@pytest.mark.parametrize(
"element", [EmailElement(text=""), Name(text="", name=""), Subject(text="")]
)
def test_EmailElement_autoassigns_a_UUID_then_becomes_an_idempotent_and_deterministic_hash(
element: EmailElement,
):
# -- element self-assigns itself a UUID --
assert isinstance(element.id, str)
assert len(element.id) == 36
assert element.id.count("-") == 4

expected_hash = "5336294a19f32ff03ef80066fbc3e0f7"
# -- calling `.id_to_hash()` changes the element's id-type to hash --
assert element.id_to_hash(0) == expected_hash
assert element.id == expected_hash

# -- `.id_to_hash()` is idempotent --
assert element.id_to_hash(0) == expected_hash


def test_Name_should_assign_a_deterministic_and_an_idempotent_hash():
element = Name(name="Example", text="hello there!")
expected_hash = "c69509590d81db2f37f9d75480c8efed"
expected_hash = "7d191bcecf80c122578c497de5f0dae7"

assert element._element_id is None, "Element should not have an ID yet"

# -- calculating hash for the first time --
assert element.id_to_hash() == expected_hash
assert element.id_to_hash(0) == expected_hash
assert element.id == expected_hash

# -- `.id_to_hash()` is idempotent --
assert element.id_to_hash() == expected_hash
assert element.id_to_hash(0) == expected_hash
assert element.id == expected_hash


Expand All @@ -30,7 +50,7 @@ def test_Name_should_assign_a_deterministic_and_an_idempotent_hash():
Name(name="Example", text="hello there!", element_id=None),
],
)
def test_EmailElement_should_assign_a_UUID_only_once_and_only_at_the_first_id_request(
def test_EmailElement_assigns_a_UUID_only_once_and_only_at_the_first_id_request(
element: EmailElement,
):
assert element._element_id is None, "Element should not have an ID yet"
Expand Down
14 changes: 14 additions & 0 deletions test_unstructured/partition/docx/test_doc.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,20 @@
from unstructured.partition.docx import partition_docx


def test_partition_doc_for_deterministic_and_unique_ids():
ids = [element.id for element in partition_doc("example-docs/duplicate-paragraphs.doc")]

assert ids == [
"ade273c622c48d67a7be7b3816d5b4d8",
"7d0b32fdf169f9578723486cb4bc1235",
"1feb6e8e9c1662cfaef75907aeeb0900",
"aa2a8ac10143b12f0fe2087837ea11d2",
"da31ba7ed3919067d2c6572dc1617271",
"1914359c179a160df921b769acf8c353",
"f9d0d379fc791bae487b7a45f65caa50",
]


@pytest.fixture()
def mock_document():
document = docx.Document()
Expand Down
15 changes: 15 additions & 0 deletions test_unstructured/partition/docx/test_docx.py
Original file line number Diff line number Diff line change
Expand Up @@ -866,3 +866,18 @@ def mock_document_file_path(mock_document: Document, tmp_path: pathlib.Path) ->
filename = str(tmp_path / "mock_document.docx")
mock_document.save(filename)
return filename


def test_ids_are_unique_and_deterministic():
elements = partition_docx("example-docs/duplicate-paragraphs.docx")

ids = [e.id for e in elements]
assert ids == [
"2f22d82eea1faf5f40dac60cef52700e",
"ca9e1f448e531a5152d960e14eefc360",
"9ddeacb172ac17fb45e6f3f15f3c703d",
"a4fd85d3f4141acae38c8f9c936ed2f3",
"44ebaaf66640719c918246d4ccba1c45",
"f36e8ebcb3b6a051940a168fe73cbc44",
"532b395177652c7d61e1e4d855f1dc1d",
Copy link
Contributor

@cragwolfe cragwolfe Apr 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't we just care about the length of distinct element id's rather than the id's themselves? (not a blocker)

EDIT: maybe not for the "deterministic" part. we can leave this for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At some point I've been doing this in combination with:

assert len(ids) == len(set(ids))

Steve suggested that it's redundant since checking hashes explicitly already ensures their uniqueness.

], "IDs are not deterministic"
68 changes: 68 additions & 0 deletions test_unstructured/partition/pdf_image/test_pdf.py
Original file line number Diff line number Diff line change
Expand Up @@ -1240,3 +1240,71 @@ def test_partition_pdf_always_keep_all_image_elements(
)
image_elements = [el for el in elements if el.category == ElementType.IMAGE]
assert len(image_elements) == 3


@pytest.fixture()
def expected_element_ids_for_fast_strategy():
return [
"27a6cb3e5a4ad399b2f865729bbd3840",
"a90a54baba0093296a013d26b7acbc17",
"9be424e2d151dac4b5f36a85e9bbfe65",
"4631da875fb4996c63b2d80cea6b588e",
"6264f4eda97a049f4710f9bea0c01cbd",
"abded7b2ff3a5542c88b4a831755ec24",
"b781ea5123cb31e0571391b7b42cac75",
"033f27d2618ba4cda9068b267b5a731e",
"8982a12fcced30dd12ccbf61d14f30bf",
"41af2fd5df0cf47aa7e8ecca200d3ac6",
]


@pytest.fixture()
def expected_element_ids_for_hi_res_strategy():
return [
"27a6cb3e5a4ad399b2f865729bbd3840",
"a90a54baba0093296a013d26b7acbc17",
"9be424e2d151dac4b5f36a85e9bbfe65",
"4631da875fb4996c63b2d80cea6b588e",
"6264f4eda97a049f4710f9bea0c01cbd",
"abded7b2ff3a5542c88b4a831755ec24",
"b781ea5123cb31e0571391b7b42cac75",
"033f27d2618ba4cda9068b267b5a731e",
"8982a12fcced30dd12ccbf61d14f30bf",
"41af2fd5df0cf47aa7e8ecca200d3ac6",
]


@pytest.fixture()
def expected_element_ids_for_ocr_strategy():
return [
"272ab65cbe81795161128aea59599d83",
"b38affd7bbbb3dddf5c85ba8b14d380d",
"65903214d456b8b3cba6faa6714bd9ba",
"5b41ceae05dcfaeeac32ff8e82dc2ff1",
"6582fc6c6c595225feeddcc3263f0ae3",
"64b610c8f4274f1ce2175bf30814409d",
"8edde8bf2d3a68370dc4bd142c408ca4",
"a052bc17696043efce2e4f4f28393a83",
]


@pytest.fixture()
def expected_ids(request):
return request.getfixturevalue(request.param)


@pytest.mark.parametrize(
("strategy", "expected_ids"),
[
(PartitionStrategy.FAST, "expected_element_ids_for_fast_strategy"),
(PartitionStrategy.HI_RES, "expected_element_ids_for_hi_res_strategy"),
(PartitionStrategy.OCR_ONLY, "expected_element_ids_for_ocr_strategy"),
],
indirect=["expected_ids"],
)
def test_unique_and_deterministic_element_ids(strategy, expected_ids):
elements = pdf.partition_pdf(
"example-docs/fake-memo-with-duplicate-page.pdf", strategy=strategy, starting_page_number=2
)
ids = [element.id for element in elements]
assert ids == expected_ids, "Element IDs do not match expected IDs"
Loading
Loading