-
Notifications
You must be signed in to change notification settings - Fork 813
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better element IDs - deterministic and document-unique hashes #2673
Merged
Merged
Changes from 198 commits
Commits
Show all changes
201 commits
Select commit
Hold shift + click to select a range
e68a7f5
prototype solution for PDF files
micmarty-deepsense f3f3321
add basic tests for element IDs
micmarty-deepsense 3398be3
recalculate ID based on metadata (if present)
micmarty-deepsense 76cbaef
add more unit tests
micmarty-deepsense 272f4a6
add HashValue class to identify when ID recalculation is required
micmarty-deepsense 3375f63
Merge branch 'main' into CORE-3587/better-element-ids
micmarty-deepsense b71369c
add test and a set of fixtures for unique and deterministic pdf eleme…
micmarty-deepsense e5d90ab
update hash computation so it allows for appending other data
micmarty-deepsense 608bdbe
add given when then comments
micmarty-deepsense 4c139de
add docstring
micmarty-deepsense 02ca092
add html tests
micmarty-deepsense 0d33608
revert unused change
micmarty-deepsense e813b89
remove Text element tests for page_number and index_on_page
micmarty-deepsense 3f87ad2
recalculate_ids outside of the Text class
micmarty-deepsense e2eea35
get rid of index_on_page
micmarty-deepsense d14a47e
revert _id to id
micmarty-deepsense bc29126
simplify hash calculation function
micmarty-deepsense cd9b9b3
remove uuid.UUID from type hints for self.id
micmarty-deepsense f49c68d
quickfix calculate_hash function call
micmarty-deepsense eef4264
update PPTX test
micmarty-deepsense b6e850f
add docx test
micmarty-deepsense 7efe3a4
refactor ids recalculation by moving it to process_metadata decorator
micmarty-deepsense 4c393f4
remove unused code
micmarty-deepsense 8f9c445
revert isinstance statement
micmarty-deepsense d33c86c
revert inline return statement
micmarty-deepsense 3becd44
add tests for calculating hash and recalculatind ids
micmarty-deepsense 183c38b
do dont mutate, but copy elements
micmarty-deepsense 01f16c8
update docs hashes
micmarty-deepsense fbdaefe
add doc tests
micmarty-deepsense 03fc126
Merge branch 'main' into CORE-3587/better-element-ids
micmarty-deepsense 289b1b3
refactor recalculate_ids so it updates parent_id's correctly
micmarty-deepsense 56783b5
rename calculate_hash into id_to_hash and make it a method
micmarty-deepsense 5509bb5
revert existing logic of assigning id's at construction-time
micmarty-deepsense 82efb28
remove unused code
micmarty-deepsense 1812435
apply code review suggestions in tests
micmarty-deepsense bc35458
rename "recalculate_ids" with "assign_hash_ids"
micmarty-deepsense cdac860
remove test which is no longer relevant
micmarty-deepsense dfe446d
update html test file and test itself
micmarty-deepsense 1d01dfb
add test_id_to_hash
micmarty-deepsense 9350866
handle edge case for xlsx files
micmarty-deepsense dfba0fb
update file name
micmarty-deepsense 75f3e88
revert original id in test
micmarty-deepsense e179a55
use deepcopy in test to compare if ids have changed
micmarty-deepsense 6d93e0b
revert to construction-time UUIDs
micmarty-deepsense baa0540
explicit warning in assign_hash_id
micmarty-deepsense d51b8c1
add dummy copy of id_to_hash to class "Name(EmailElement)"
micmarty-deepsense 1e0a4a5
update hashes in tests
micmarty-deepsense 1f64d46
adjust hash values for pptx hierarchy test
micmarty-deepsense 4b5b84c
remove unused file
micmarty-deepsense 49d899d
adjust pdf hashes in a test
micmarty-deepsense 24b7b5b
Merge branch 'main' into CORE-3587/better-element-ids
micmarty-deepsense 8911fa3
update overview.rst
micmarty-deepsense 5d0ed03
remove deprecated test
micmarty-deepsense 6fe739a
raise if element_id is not a string or NoId
micmarty-deepsense 135f8af
update CHANGELOG
micmarty-deepsense c578aa7
quickfix ruff warnings
micmarty-deepsense dd0b949
quickfix changelog
micmarty-deepsense ca53a97
update __version__
micmarty-deepsense a6cae7b
Better element IDs <- Ingest test fixtures update (#2832)
ryannikolaidis 6b2cffa
use hash for label studio annotations
micmarty-deepsense fa4cb39
Merge branch 'CORE-3587/better-element-ids' of https://github.com/Uns…
micmarty-deepsense 0126788
adjust email test
micmarty-deepsense 237636a
improve email element design
micmarty-deepsense 007b733
fix chunking
micmarty-deepsense 23dbbb1
update the docstring for assign_hash_ids
micmarty-deepsense 3a6d04a
remove try except
micmarty-deepsense 515cb52
don't call id_to_uuid, elements already have UUIDs
micmarty-deepsense 652d6c2
move id_to_hash from Text to Element
micmarty-deepsense 6ed2c7e
reorder methods to alphabetical order
micmarty-deepsense 452e3cd
remove unused id_to_uuid
micmarty-deepsense 86023f8
update hashes in tests
micmarty-deepsense 3bd0745
Merge branch 'main' into CORE-3587/better-element-ids
micmarty-deepsense 810dce1
Better element IDs <- Ingest test fixtures update (#2839)
ryannikolaidis 5a58acd
remove unused imports
micmarty-deepsense a4e654f
Merge branch 'CORE-3587/better-element-ids' of https://github.com/Uns…
micmarty-deepsense 298fad1
update hashes
micmarty-deepsense 339a440
refactor one test in test_email_elements.py
micmarty-deepsense 0d65b02
fix KeyErrors for stanley-cups
micmarty-deepsense 9dbf6ae
merge 2 tests into 1
micmarty-deepsense a2e4302
update pdf hashes
micmarty-deepsense cd3cdc4
fix label studio tests
micmarty-deepsense 2d27057
fix baseplate tests
micmarty-deepsense d1ecb40
add element ID design principles section in the documentation
micmarty-deepsense fd3b55a
Better element IDs <- Ingest test fixtures update (#2840)
ryannikolaidis ebb1209
update Element docstrings
micmarty-deepsense 2c398e6
change num of expected files in local ingest from 12 to 13
micmarty-deepsense 6b5dddf
Merge branch 'CORE-3587/better-element-ids' of https://github.com/Uns…
micmarty-deepsense 639423a
modify default behavior of Element, Text, and Name class
micmarty-deepsense ad9f58f
quickfix id initialization in Element
micmarty-deepsense ee90096
move id initialization to Element
micmarty-deepsense 2d9a127
refactor id assertions in test_elements.py
micmarty-deepsense f5c650a
quickfix bug, forgot to remove invalid assignment
micmarty-deepsense 11c4041
add changelog entry
micmarty-deepsense 20d7c2f
adjust email tests
micmarty-deepsense 4aadf22
fix chunking
micmarty-deepsense 45973ee
remove unnecessary enumeration and remove argument to id_to_hash
micmarty-deepsense 3f51745
remove unused import
micmarty-deepsense bc93f54
quickfix support for | operand in 3.9
micmarty-deepsense e9a5dcf
add design principles in overview.rst
micmarty-deepsense f36f76c
fix staging test by using deterministic hashes
micmarty-deepsense 71e70d7
fix tests that were failing due to invalid text_as_html consolidation
micmarty-deepsense 17d6585
add empty lines
micmarty-deepsense 511cc05
quickfix typo
micmarty-deepsense 96e5b67
parametrize test_text_uuid
micmarty-deepsense d555736
Preparing the ground for better element IDs <- Ingest test fixtures u…
ryannikolaidis 1a38d70
adjust ingestion chunking config
micmarty-deepsense 920837a
Merge branch 'mike/preparing-ground-for-better-element-ids' of https:…
micmarty-deepsense 933cef0
adjust ingestion chunking config
micmarty-deepsense 53599d3
Preparing the ground for better element IDs <- Ingest test fixtures u…
ryannikolaidis 9507a96
Better element IDs - deterministic and document-unique hashes <- Inge…
ryannikolaidis 39d7958
Merge branch 'main' into mike/preparing-ground-for-better-element-ids
micmarty-deepsense 939f54d
use hashes in partitioner
micmarty-deepsense c4c91a5
Merge branch 'CORE-3587/better-element-ids' of https://github.com/Uns…
micmarty-deepsense 793ef37
Merge branch 'main' into CORE-3587/better-element-ids
micmarty-deepsense 36aeefd
Merge branch 'main' into mike/preparing-ground-for-better-element-ids
micmarty-deepsense f0e0149
remove unused import
micmarty-deepsense 1c35139
Merge branch 'mike/preparing-ground-for-better-element-ids' of https:…
micmarty-deepsense bec0b90
move id_to_hash to interfaces.py
micmarty-deepsense b5e53bb
Better element IDs - deterministic and document-unique hashes <- Inge…
ryannikolaidis 38cd4fa
ignore mongodb.sh in test-ingest-src.sh
micmarty-deepsense ac94588
remove redundant loop with id_to_hash
micmarty-deepsense 52e830b
Merge branch 'CORE-3587/better-element-ids' of https://github.com/Uns…
micmarty-deepsense 7052896
Merge branch 'main' into CORE-3587/better-element-ids
micmarty-deepsense c5b16f3
update changelog and sync version
micmarty-deepsense af879ab
Better element IDs - deterministic and document-unique hashes <- Inge…
ryannikolaidis 82546ed
revert ignoring mongodb.sh
micmarty-deepsense 79afe97
Merge branch 'CORE-3587/better-element-ids' of https://github.com/Uns…
micmarty-deepsense ea6b881
Merge branch 'main' into mike/preparing-ground-for-better-element-ids
micmarty-deepsense 4d9dbc2
rename assign_hash_ids to assign_and_map_hash_ids
micmarty-deepsense adb4592
change expected argument type for element_id in CheckBox
micmarty-deepsense 836e514
add a test utility for assigning hash ids
micmarty-deepsense c0add80
more detailed element test
micmarty-deepsense e20666d
rename test
micmarty-deepsense cf62230
remove redundant line
micmarty-deepsense 178bf57
bump version
micmarty-deepsense 73d8edd
Merge branch 'mike/preparing-ground-for-better-element-ids' into CORE…
micmarty-deepsense 170141f
update test name
micmarty-deepsense 3937ac4
quickfix amgiguity in hash assigning function calls
micmarty-deepsense 3ee35ce
update CHANGELOG
micmarty-deepsense 5017e6c
remove unused import
micmarty-deepsense f0def7b
adjust hashes in test
micmarty-deepsense 560d2cd
fix missing argument to id_to_hash
micmarty-deepsense bde2907
update hash in test
micmarty-deepsense ba28243
update email tests
micmarty-deepsense ca5861c
Better element IDs - deterministic and document-unique hashes <- Inge…
ryannikolaidis 2a9f0b8
fix a bug: sharing one memory address
micmarty-deepsense 21914f2
refactor assign_and_map_hash_ids according to review sugestions
micmarty-deepsense 6e80fb4
Merge branch 'CORE-3587/better-element-ids' of https://github.com/Uns…
micmarty-deepsense 3916d0d
make pytest.mark.parametrize body compact
micmarty-deepsense 3250abe
add 2 example docs and adjust related tests
micmarty-deepsense fe7fa00
move assign_hash_ids from test_utils to unit_utils
micmarty-deepsense 66c3f23
apply other minor review suggestions
micmarty-deepsense 8fa2666
remove unused import
micmarty-deepsense aab6bad
add pdf with duplicate page and refactor related test
micmarty-deepsense 4fd7d62
quickfix importing assign_hash_ids
micmarty-deepsense 90a1880
remove unused imports
micmarty-deepsense ae2cd30
get rid of List type
micmarty-deepsense c0d1bb1
Merge branch 'main' into CORE-3587/better-element-ids
micmarty-deepsense bdc0c3b
Merge branch 'CORE-3587/better-element-ids' of https://github.com/Uns…
micmarty-deepsense 461f9b9
remove unused imports
micmarty-deepsense 624ba1a
remove unused imports
micmarty-deepsense 8e100f7
Merge branch 'CORE-3587/better-element-ids' of https://github.com/Uns…
micmarty-deepsense 4ef4821
Better element IDs - deterministic and document-unique hashes <- Inge…
ryannikolaidis a3e5d60
remove unused argument
micmarty-deepsense 562df25
Merge branch 'CORE-3587/better-element-ids' of https://github.com/Uns…
micmarty-deepsense b17c80f
Merge branch 'main' into CORE-3587/better-element-ids
micmarty-deepsense 89c7f27
clean up after resolving conflicts
micmarty-deepsense e2f4c3c
update hash ids for test
micmarty-deepsense 3c07881
use seq_on_page in hash calculation
micmarty-deepsense e0b02ec
support for starting_page_number in ODT files
micmarty-deepsense cf45f7a
update hashes for doc and docx tests, remove redundant assertion
micmarty-deepsense ff2fd2f
include filename in hash calculation
micmarty-deepsense eeb1ea6
fix bug of sharing one metadata object by multiple elements for msg f…
micmarty-deepsense 33ae279
update hashes in tests and refactor them slightly
micmarty-deepsense f6ec6a0
adjust pptx test cases
micmarty-deepsense 9f4aded
update hashes for staging tests
micmarty-deepsense 3dad5ae
update hashes for PDF tests
micmarty-deepsense 4463830
fix line too long
micmarty-deepsense a93a156
reformat elements.py and add more comments
micmarty-deepsense c8b7c66
Better element IDs - deterministic and document-unique hashes <- Inge…
ryannikolaidis d8e9a2f
update changelog and version
micmarty-deepsense ee0392a
Merge branch 'CORE-3587/better-element-ids' of https://github.com/Uns…
micmarty-deepsense e578aba
update overview.rst
micmarty-deepsense b9baf8a
make tests more compact
micmarty-deepsense a2227fc
update html hashes
micmarty-deepsense b22784c
remove redundant uniqueness assertion
micmarty-deepsense 68f93bd
update almost all hashes in spring-weather (there are still problemat…
micmarty-deepsense f50c13c
update hashes for spring-weather
micmarty-deepsense 9ac682f
revert spring water example doc to original
micmarty-deepsense b10b3c1
Merge branch 'main' into CORE-3587/better-element-ids
micmarty-deepsense 1178ca0
assign hash ids when doing ingestion
micmarty-deepsense 232c405
revert all changes to test_unstructured_ingest
micmarty-deepsense 973dc29
Better element IDs - deterministic and document-unique hashes <- Inge…
ryannikolaidis 3280625
increase num of expected files in local.sh
micmarty-deepsense 22991c8
Merge branch 'CORE-3587/better-element-ids' of https://github.com/Uns…
micmarty-deepsense cc8be15
Merge branch 'main' into CORE-3587/better-element-ids
micmarty-deepsense 779db46
update version
micmarty-deepsense af28f77
refactor 1 test in test_auto.py
micmarty-deepsense ca13902
remove changelong entry duplicate
micmarty-deepsense f4fd49a
Merge branch 'main' into CORE-3587/better-element-ids
micmarty-deepsense 4a0b27d
Merge branch 'main' into CORE-3587/better-element-ids
cragwolfe File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Binary file not shown.
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,23 @@ | ||
<!DOCTYPE html> | ||
<html> | ||
|
||
<head> | ||
<title>Simple Nested HTML</title> | ||
</strong> | ||
|
||
<body> | ||
<h1>Example heading.</h1> | ||
<div> | ||
<span>This is a span.</span> | ||
<span>This is another span.</span> | ||
</div> | ||
<br> | ||
<h1>Example heading.</h1> | ||
<div> | ||
<span>This is a span.</span> | ||
<span>This is another span.</span> | ||
</div> | ||
|
||
</body> | ||
|
||
</html> |
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -223,4 +223,4 @@ | |
"page_number": 1 | ||
} | ||
} | ||
] | ||
] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't we just care about the length of distinct element id's rather than the id's themselves? (not a blocker)
EDIT: maybe not for the "deterministic" part. we can leave this for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At some point I've been doing this in combination with:
Steve suggested that it's redundant since checking hashes explicitly already ensures their uniqueness.