Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better element IDs - deterministic and document-unique hashes #2673

Merged
merged 201 commits into from
Apr 24, 2024
Merged
Show file tree
Hide file tree
Changes from 12 commits
Commits
Show all changes
201 commits
Select commit Hold shift + click to select a range
e68a7f5
prototype solution for PDF files
micmarty-deepsense Mar 20, 2024
f3f3321
add basic tests for element IDs
micmarty-deepsense Mar 21, 2024
3398be3
recalculate ID based on metadata (if present)
micmarty-deepsense Mar 21, 2024
76cbaef
add more unit tests
micmarty-deepsense Mar 21, 2024
272f4a6
add HashValue class to identify when ID recalculation is required
micmarty-deepsense Mar 21, 2024
3375f63
Merge branch 'main' into CORE-3587/better-element-ids
micmarty-deepsense Mar 25, 2024
b71369c
add test and a set of fixtures for unique and deterministic pdf eleme…
micmarty-deepsense Mar 26, 2024
e5d90ab
update hash computation so it allows for appending other data
micmarty-deepsense Mar 26, 2024
608bdbe
add given when then comments
micmarty-deepsense Mar 26, 2024
4c139de
add docstring
micmarty-deepsense Mar 26, 2024
02ca092
add html tests
micmarty-deepsense Mar 26, 2024
0d33608
revert unused change
micmarty-deepsense Mar 26, 2024
e813b89
remove Text element tests for page_number and index_on_page
micmarty-deepsense Mar 27, 2024
3f87ad2
recalculate_ids outside of the Text class
micmarty-deepsense Mar 27, 2024
e2eea35
get rid of index_on_page
micmarty-deepsense Mar 27, 2024
d14a47e
revert _id to id
micmarty-deepsense Mar 27, 2024
bc29126
simplify hash calculation function
micmarty-deepsense Mar 27, 2024
cd9b9b3
remove uuid.UUID from type hints for self.id
micmarty-deepsense Mar 27, 2024
f49c68d
quickfix calculate_hash function call
micmarty-deepsense Mar 27, 2024
eef4264
update PPTX test
micmarty-deepsense Mar 27, 2024
b6e850f
add docx test
micmarty-deepsense Mar 27, 2024
7efe3a4
refactor ids recalculation by moving it to process_metadata decorator
micmarty-deepsense Mar 28, 2024
4c393f4
remove unused code
micmarty-deepsense Mar 28, 2024
8f9c445
revert isinstance statement
micmarty-deepsense Mar 28, 2024
d33c86c
revert inline return statement
micmarty-deepsense Mar 28, 2024
3becd44
add tests for calculating hash and recalculatind ids
micmarty-deepsense Mar 28, 2024
183c38b
do dont mutate, but copy elements
micmarty-deepsense Mar 28, 2024
01f16c8
update docs hashes
micmarty-deepsense Mar 28, 2024
fbdaefe
add doc tests
micmarty-deepsense Mar 28, 2024
03fc126
Merge branch 'main' into CORE-3587/better-element-ids
micmarty-deepsense Mar 28, 2024
289b1b3
refactor recalculate_ids so it updates parent_id's correctly
micmarty-deepsense Mar 29, 2024
56783b5
rename calculate_hash into id_to_hash and make it a method
micmarty-deepsense Mar 29, 2024
5509bb5
revert existing logic of assigning id's at construction-time
micmarty-deepsense Mar 29, 2024
82efb28
remove unused code
micmarty-deepsense Mar 29, 2024
1812435
apply code review suggestions in tests
micmarty-deepsense Mar 29, 2024
bc35458
rename "recalculate_ids" with "assign_hash_ids"
micmarty-deepsense Mar 29, 2024
cdac860
remove test which is no longer relevant
micmarty-deepsense Mar 29, 2024
dfe446d
update html test file and test itself
micmarty-deepsense Mar 29, 2024
1d01dfb
add test_id_to_hash
micmarty-deepsense Mar 29, 2024
9350866
handle edge case for xlsx files
micmarty-deepsense Mar 29, 2024
dfba0fb
update file name
micmarty-deepsense Mar 29, 2024
75f3e88
revert original id in test
micmarty-deepsense Mar 29, 2024
e179a55
use deepcopy in test to compare if ids have changed
micmarty-deepsense Mar 29, 2024
6d93e0b
revert to construction-time UUIDs
micmarty-deepsense Mar 29, 2024
baa0540
explicit warning in assign_hash_id
micmarty-deepsense Apr 2, 2024
d51b8c1
add dummy copy of id_to_hash to class "Name(EmailElement)"
micmarty-deepsense Apr 2, 2024
1e0a4a5
update hashes in tests
micmarty-deepsense Apr 2, 2024
1f64d46
adjust hash values for pptx hierarchy test
micmarty-deepsense Apr 2, 2024
4b5b84c
remove unused file
micmarty-deepsense Apr 2, 2024
49d899d
adjust pdf hashes in a test
micmarty-deepsense Apr 2, 2024
24b7b5b
Merge branch 'main' into CORE-3587/better-element-ids
micmarty-deepsense Apr 2, 2024
8911fa3
update overview.rst
micmarty-deepsense Apr 2, 2024
5d0ed03
remove deprecated test
micmarty-deepsense Apr 2, 2024
6fe739a
raise if element_id is not a string or NoId
micmarty-deepsense Apr 2, 2024
135f8af
update CHANGELOG
micmarty-deepsense Apr 2, 2024
c578aa7
quickfix ruff warnings
micmarty-deepsense Apr 2, 2024
dd0b949
quickfix changelog
micmarty-deepsense Apr 2, 2024
ca53a97
update __version__
micmarty-deepsense Apr 2, 2024
a6cae7b
Better element IDs <- Ingest test fixtures update (#2832)
ryannikolaidis Apr 2, 2024
6b2cffa
use hash for label studio annotations
micmarty-deepsense Apr 2, 2024
fa4cb39
Merge branch 'CORE-3587/better-element-ids' of https://github.com/Uns…
micmarty-deepsense Apr 2, 2024
0126788
adjust email test
micmarty-deepsense Apr 2, 2024
237636a
improve email element design
micmarty-deepsense Apr 2, 2024
007b733
fix chunking
micmarty-deepsense Apr 2, 2024
23dbbb1
update the docstring for assign_hash_ids
micmarty-deepsense Apr 3, 2024
3a6d04a
remove try except
micmarty-deepsense Apr 3, 2024
515cb52
don't call id_to_uuid, elements already have UUIDs
micmarty-deepsense Apr 3, 2024
652d6c2
move id_to_hash from Text to Element
micmarty-deepsense Apr 3, 2024
6ed2c7e
reorder methods to alphabetical order
micmarty-deepsense Apr 3, 2024
452e3cd
remove unused id_to_uuid
micmarty-deepsense Apr 3, 2024
86023f8
update hashes in tests
micmarty-deepsense Apr 3, 2024
3bd0745
Merge branch 'main' into CORE-3587/better-element-ids
micmarty-deepsense Apr 3, 2024
810dce1
Better element IDs <- Ingest test fixtures update (#2839)
ryannikolaidis Apr 3, 2024
5a58acd
remove unused imports
micmarty-deepsense Apr 3, 2024
a4e654f
Merge branch 'CORE-3587/better-element-ids' of https://github.com/Uns…
micmarty-deepsense Apr 3, 2024
298fad1
update hashes
micmarty-deepsense Apr 3, 2024
339a440
refactor one test in test_email_elements.py
micmarty-deepsense Apr 3, 2024
0d65b02
fix KeyErrors for stanley-cups
micmarty-deepsense Apr 3, 2024
9dbf6ae
merge 2 tests into 1
micmarty-deepsense Apr 3, 2024
a2e4302
update pdf hashes
micmarty-deepsense Apr 3, 2024
cd3cdc4
fix label studio tests
micmarty-deepsense Apr 3, 2024
2d27057
fix baseplate tests
micmarty-deepsense Apr 3, 2024
d1ecb40
add element ID design principles section in the documentation
micmarty-deepsense Apr 3, 2024
fd3b55a
Better element IDs <- Ingest test fixtures update (#2840)
ryannikolaidis Apr 3, 2024
ebb1209
update Element docstrings
micmarty-deepsense Apr 3, 2024
2c398e6
change num of expected files in local ingest from 12 to 13
micmarty-deepsense Apr 3, 2024
6b5dddf
Merge branch 'CORE-3587/better-element-ids' of https://github.com/Uns…
micmarty-deepsense Apr 3, 2024
639423a
modify default behavior of Element, Text, and Name class
micmarty-deepsense Apr 3, 2024
ad9f58f
quickfix id initialization in Element
micmarty-deepsense Apr 3, 2024
ee90096
move id initialization to Element
micmarty-deepsense Apr 3, 2024
2d9a127
refactor id assertions in test_elements.py
micmarty-deepsense Apr 3, 2024
f5c650a
quickfix bug, forgot to remove invalid assignment
micmarty-deepsense Apr 3, 2024
11c4041
add changelog entry
micmarty-deepsense Apr 3, 2024
20d7c2f
adjust email tests
micmarty-deepsense Apr 3, 2024
4aadf22
fix chunking
micmarty-deepsense Apr 3, 2024
45973ee
remove unnecessary enumeration and remove argument to id_to_hash
micmarty-deepsense Apr 3, 2024
3f51745
remove unused import
micmarty-deepsense Apr 3, 2024
bc93f54
quickfix support for | operand in 3.9
micmarty-deepsense Apr 3, 2024
e9a5dcf
add design principles in overview.rst
micmarty-deepsense Apr 3, 2024
f36f76c
fix staging test by using deterministic hashes
micmarty-deepsense Apr 3, 2024
71e70d7
fix tests that were failing due to invalid text_as_html consolidation
micmarty-deepsense Apr 3, 2024
17d6585
add empty lines
micmarty-deepsense Apr 3, 2024
511cc05
quickfix typo
micmarty-deepsense Apr 3, 2024
96e5b67
parametrize test_text_uuid
micmarty-deepsense Apr 3, 2024
d555736
Preparing the ground for better element IDs <- Ingest test fixtures u…
ryannikolaidis Apr 3, 2024
1a38d70
adjust ingestion chunking config
micmarty-deepsense Apr 3, 2024
920837a
Merge branch 'mike/preparing-ground-for-better-element-ids' of https:…
micmarty-deepsense Apr 3, 2024
933cef0
adjust ingestion chunking config
micmarty-deepsense Apr 3, 2024
53599d3
Preparing the ground for better element IDs <- Ingest test fixtures u…
ryannikolaidis Apr 3, 2024
9507a96
Better element IDs - deterministic and document-unique hashes <- Inge…
ryannikolaidis Apr 3, 2024
39d7958
Merge branch 'main' into mike/preparing-ground-for-better-element-ids
micmarty-deepsense Apr 4, 2024
939f54d
use hashes in partitioner
micmarty-deepsense Apr 4, 2024
c4c91a5
Merge branch 'CORE-3587/better-element-ids' of https://github.com/Uns…
micmarty-deepsense Apr 4, 2024
793ef37
Merge branch 'main' into CORE-3587/better-element-ids
micmarty-deepsense Apr 4, 2024
36aeefd
Merge branch 'main' into mike/preparing-ground-for-better-element-ids
micmarty-deepsense Apr 4, 2024
f0e0149
remove unused import
micmarty-deepsense Apr 4, 2024
1c35139
Merge branch 'mike/preparing-ground-for-better-element-ids' of https:…
micmarty-deepsense Apr 4, 2024
bec0b90
move id_to_hash to interfaces.py
micmarty-deepsense Apr 4, 2024
b5e53bb
Better element IDs - deterministic and document-unique hashes <- Inge…
ryannikolaidis Apr 4, 2024
38cd4fa
ignore mongodb.sh in test-ingest-src.sh
micmarty-deepsense Apr 5, 2024
ac94588
remove redundant loop with id_to_hash
micmarty-deepsense Apr 5, 2024
52e830b
Merge branch 'CORE-3587/better-element-ids' of https://github.com/Uns…
micmarty-deepsense Apr 5, 2024
7052896
Merge branch 'main' into CORE-3587/better-element-ids
micmarty-deepsense Apr 5, 2024
c5b16f3
update changelog and sync version
micmarty-deepsense Apr 5, 2024
af879ab
Better element IDs - deterministic and document-unique hashes <- Inge…
ryannikolaidis Apr 5, 2024
82546ed
revert ignoring mongodb.sh
micmarty-deepsense Apr 5, 2024
79afe97
Merge branch 'CORE-3587/better-element-ids' of https://github.com/Uns…
micmarty-deepsense Apr 5, 2024
ea6b881
Merge branch 'main' into mike/preparing-ground-for-better-element-ids
micmarty-deepsense Apr 8, 2024
4d9dbc2
rename assign_hash_ids to assign_and_map_hash_ids
micmarty-deepsense Apr 8, 2024
adb4592
change expected argument type for element_id in CheckBox
micmarty-deepsense Apr 8, 2024
836e514
add a test utility for assigning hash ids
micmarty-deepsense Apr 8, 2024
c0add80
more detailed element test
micmarty-deepsense Apr 8, 2024
e20666d
rename test
micmarty-deepsense Apr 8, 2024
cf62230
remove redundant line
micmarty-deepsense Apr 8, 2024
178bf57
bump version
micmarty-deepsense Apr 8, 2024
73d8edd
Merge branch 'mike/preparing-ground-for-better-element-ids' into CORE…
micmarty-deepsense Apr 8, 2024
170141f
update test name
micmarty-deepsense Apr 8, 2024
3937ac4
quickfix amgiguity in hash assigning function calls
micmarty-deepsense Apr 8, 2024
3ee35ce
update CHANGELOG
micmarty-deepsense Apr 8, 2024
5017e6c
remove unused import
micmarty-deepsense Apr 8, 2024
f0def7b
adjust hashes in test
micmarty-deepsense Apr 8, 2024
560d2cd
fix missing argument to id_to_hash
micmarty-deepsense Apr 8, 2024
bde2907
update hash in test
micmarty-deepsense Apr 8, 2024
ba28243
update email tests
micmarty-deepsense Apr 8, 2024
ca5861c
Better element IDs - deterministic and document-unique hashes <- Inge…
ryannikolaidis Apr 8, 2024
2a9f0b8
fix a bug: sharing one memory address
micmarty-deepsense Apr 8, 2024
21914f2
refactor assign_and_map_hash_ids according to review sugestions
micmarty-deepsense Apr 8, 2024
6e80fb4
Merge branch 'CORE-3587/better-element-ids' of https://github.com/Uns…
micmarty-deepsense Apr 8, 2024
3916d0d
make pytest.mark.parametrize body compact
micmarty-deepsense Apr 8, 2024
3250abe
add 2 example docs and adjust related tests
micmarty-deepsense Apr 8, 2024
fe7fa00
move assign_hash_ids from test_utils to unit_utils
micmarty-deepsense Apr 8, 2024
66c3f23
apply other minor review suggestions
micmarty-deepsense Apr 8, 2024
8fa2666
remove unused import
micmarty-deepsense Apr 8, 2024
aab6bad
add pdf with duplicate page and refactor related test
micmarty-deepsense Apr 8, 2024
4fd7d62
quickfix importing assign_hash_ids
micmarty-deepsense Apr 8, 2024
90a1880
remove unused imports
micmarty-deepsense Apr 8, 2024
ae2cd30
get rid of List type
micmarty-deepsense Apr 8, 2024
c0d1bb1
Merge branch 'main' into CORE-3587/better-element-ids
micmarty-deepsense Apr 8, 2024
bdc0c3b
Merge branch 'CORE-3587/better-element-ids' of https://github.com/Uns…
micmarty-deepsense Apr 8, 2024
461f9b9
remove unused imports
micmarty-deepsense Apr 8, 2024
624ba1a
remove unused imports
micmarty-deepsense Apr 8, 2024
8e100f7
Merge branch 'CORE-3587/better-element-ids' of https://github.com/Uns…
micmarty-deepsense Apr 8, 2024
4ef4821
Better element IDs - deterministic and document-unique hashes <- Inge…
ryannikolaidis Apr 9, 2024
a3e5d60
remove unused argument
micmarty-deepsense Apr 9, 2024
562df25
Merge branch 'CORE-3587/better-element-ids' of https://github.com/Uns…
micmarty-deepsense Apr 9, 2024
b17c80f
Merge branch 'main' into CORE-3587/better-element-ids
micmarty-deepsense Apr 18, 2024
89c7f27
clean up after resolving conflicts
micmarty-deepsense Apr 18, 2024
e2f4c3c
update hash ids for test
micmarty-deepsense Apr 18, 2024
3c07881
use seq_on_page in hash calculation
micmarty-deepsense Apr 18, 2024
e0b02ec
support for starting_page_number in ODT files
micmarty-deepsense Apr 18, 2024
cf45f7a
update hashes for doc and docx tests, remove redundant assertion
micmarty-deepsense Apr 18, 2024
ff2fd2f
include filename in hash calculation
micmarty-deepsense Apr 18, 2024
eeb1ea6
fix bug of sharing one metadata object by multiple elements for msg f…
micmarty-deepsense Apr 18, 2024
33ae279
update hashes in tests and refactor them slightly
micmarty-deepsense Apr 18, 2024
f6ec6a0
adjust pptx test cases
micmarty-deepsense Apr 18, 2024
9f4aded
update hashes for staging tests
micmarty-deepsense Apr 18, 2024
3dad5ae
update hashes for PDF tests
micmarty-deepsense Apr 18, 2024
4463830
fix line too long
micmarty-deepsense Apr 18, 2024
a93a156
reformat elements.py and add more comments
micmarty-deepsense Apr 18, 2024
c8b7c66
Better element IDs - deterministic and document-unique hashes <- Inge…
ryannikolaidis Apr 18, 2024
d8e9a2f
update changelog and version
micmarty-deepsense Apr 18, 2024
ee0392a
Merge branch 'CORE-3587/better-element-ids' of https://github.com/Uns…
micmarty-deepsense Apr 18, 2024
e578aba
update overview.rst
micmarty-deepsense Apr 18, 2024
b9baf8a
make tests more compact
micmarty-deepsense Apr 18, 2024
a2227fc
update html hashes
micmarty-deepsense Apr 18, 2024
b22784c
remove redundant uniqueness assertion
micmarty-deepsense Apr 18, 2024
68f93bd
update almost all hashes in spring-weather (there are still problemat…
micmarty-deepsense Apr 19, 2024
f50c13c
update hashes for spring-weather
micmarty-deepsense Apr 19, 2024
9ac682f
revert spring water example doc to original
micmarty-deepsense Apr 19, 2024
b10b3c1
Merge branch 'main' into CORE-3587/better-element-ids
micmarty-deepsense Apr 19, 2024
1178ca0
assign hash ids when doing ingestion
micmarty-deepsense Apr 19, 2024
232c405
revert all changes to test_unstructured_ingest
micmarty-deepsense Apr 19, 2024
973dc29
Better element IDs - deterministic and document-unique hashes <- Inge…
ryannikolaidis Apr 19, 2024
3280625
increase num of expected files in local.sh
micmarty-deepsense Apr 19, 2024
22991c8
Merge branch 'CORE-3587/better-element-ids' of https://github.com/Uns…
micmarty-deepsense Apr 19, 2024
cc8be15
Merge branch 'main' into CORE-3587/better-element-ids
micmarty-deepsense Apr 22, 2024
779db46
update version
micmarty-deepsense Apr 22, 2024
af28f77
refactor 1 test in test_auto.py
micmarty-deepsense Apr 22, 2024
ca13902
remove changelong entry duplicate
micmarty-deepsense Apr 23, 2024
f4fd49a
Merge branch 'main' into CORE-3587/better-element-ids
micmarty-deepsense Apr 23, 2024
4a0b27d
Merge branch 'main' into CORE-3587/better-element-ids
cragwolfe Apr 24, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 34 additions & 0 deletions example-docs/fake-html-duplicates.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
<!DOCTYPE html>
<html>

<body>
<h1>First Page</h1>
<p>Some text</p>
<p>Some text</p>
<p>Some text</p>

<hr>

<h3>Second Page</h3>
<p>Some text</p>
<p>Some text</p>
<p>Some text</p>
<p>Some text</p>
<table>
<tr>
<th>Column 1</th>
<th>Column 2</th>
</tr>
<tr>
<td>Row 1, Cell 1</td>
<td>Row 1, Cell 2</td>
</tr>
<tr>
<td>Row 2, Cell 1</td>
<td>Row 2, Cell 2</td>
</tr>
</table>

</body>

</html>
47 changes: 46 additions & 1 deletion test_unstructured/documents/test_elements.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,52 @@

def test_text_id():
text_element = Text(text="hello there!")
assert text_element.id == "c69509590d81db2f37f9d75480c8efed"
assert text_element.id == "038d47b4730901555da92f924541b5ce"


def test_text_id_recalculates_id_deterministically_on_metadata_update():
original_page = 1
text_element = Text(text="hello there!", metadata=ElementMetadata(page_number=original_page))
assert text_element.id == "6203bf4b51bd9139cbb1e9ab45636d5c"
id_before_update = text_element.id

text_element.metadata.page_number = 2
assert text_element.id == "6759346774e1cc088bc663d3be7b738f"

text_element.metadata.page_number = original_page
assert text_element.id == id_before_update


def test_text_id_same_page():
text_element_1 = Text(text="hello there!", metadata=ElementMetadata(page_number=1))
text_element_2 = Text(text="hello there!", metadata=ElementMetadata(page_number=1))
assert text_element_1.id == text_element_2.id


def test_text_id_same_page_different_same_index_on_page():
text_element_1 = Text(
text="hello there!", metadata=ElementMetadata(page_number=1, index_on_page=0)
)
text_element_2 = Text(
text="hello there!", metadata=ElementMetadata(page_number=1, index_on_page=0)
)
assert text_element_1.id == text_element_2.id


def test_text_id_same_page_different_index_on_page():
text_element_1 = Text(
text="hello there!", metadata=ElementMetadata(page_number=1, index_on_page=0)
)
text_element_2 = Text(
text="hello there!", metadata=ElementMetadata(page_number=1, index_on_page=1)
)
assert text_element_1.id != text_element_2.id


def test_text_id_different_pages():
text_element_1 = Text(text="hello there!", metadata=ElementMetadata(page_number=1))
text_element_2 = Text(text="hello there!", metadata=ElementMetadata(page_number=2))
assert text_element_1.id != text_element_2.id


def test_text_uuid():
Expand Down
166 changes: 166 additions & 0 deletions test_unstructured/partition/pdf_image/test_pdf.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,14 @@
import math
import os
import tempfile
from pathlib import Path
from tempfile import SpooledTemporaryFile
from unittest import mock

import pytest
from pdf2image.exceptions import PDFPageCountError
from PIL import Image
from pypdf import PdfReader, PdfWriter
from unstructured_inference.inference import layout

from test_unstructured.unit_utils import assert_round_trips_through_JSON, example_doc_path
Expand All @@ -32,6 +34,9 @@
UNSTRUCTURED_INCLUDE_DEBUG_METADATA,
PartitionStrategy,
)
from unstructured.staging.base import (
convert_to_dataframe,
)


class MockResponse:
Expand Down Expand Up @@ -1230,3 +1235,164 @@ def test_partition_pdf_always_keep_all_image_elements(
)
image_elements = [el for el in elements if el.category == ElementType.IMAGE]
assert len(image_elements) == 3


@pytest.fixture
def mock_pdfwriter_with_duplicate_pages():
micmarty-deepsense marked this conversation as resolved.
Show resolved Hide resolved
test_file = Path("example-docs") / "DA-1p.pdf"
original_pdf = PdfReader(test_file)

writer = PdfWriter()
first_page = original_pdf.pages[0]

# Duplicate the first page
writer.add_page(first_page)
writer.add_page(first_page)
return writer


@pytest.fixture
def expected_element_ids_for_fast_strategy():
return [
"bcb52980d2fa3de4b5c2ae7cc059e587",
"54f37f5855a683c59c7e5e462e8bf835",
"80e72e579e82c5e8374f4bae04afbee0",
"f64e1c2f5188d3484d07b8ee56c810b5",
"219a7245a961b2f31014c2e003325679",
"443b33e809f7a87878a531ac67d9fab4",
"beae05a2dea26cb4566cbb1b0ccc5283",
"7f3e7aeb75daa69802da98b2c87d4ef8",
"4465cf24225d06d0def7def21dc8f223",
"f1cd29d4f771feb9b3927d86f7604b4c",
"b1d948cd5f08332e48f8175e58d15341",
"47f7c90be21c7e072ade41aa0463d48e",
"5f3b10d86b63dd2d0c9219a1f86cc3fb",
"dabf78977a2f19d7511f9febf87634bb",
"8c4e7756efdf3ee18b7b2a8ed9554f72",
"f373da11f92112fd0980ceb225cba85a",
"16baf3ece0b01d18adcaf6cb09ef9054",
"3546b0401eacd629177b9dd53a30a8a7",
"0b60e2e272de96c80524bb24d19272c5",
"9300ec3b150913be07c6d2773aa317ea",
"e0f0640fdc9ea6e0f16a8c5932dc53ab",
"e50df5fd6aec3d1c3d8ad43dc780f0ba",
"72b4cfbee0cfe3ca1e2026ab0ef895f7",
"ef4b5abfcc9837004e238c2095318191",
"7b785f7841ba51dbabc14a0f15e75e9f",
"d9d7e2cfe23c90d2bd4c0041ce227199",
"774c5160e8a6fcf41c7e3777c32f26de",
"725b3a36665e5dcad3e83668d120e711",
]


@pytest.fixture
def expected_element_ids_for_hi_res_strategy():
return [
"d2168dfb5101ed87783a1c498499240b",
"a97d8cfd2077e256e0afb3ddec350d0c",
"9bdbcdf919bb3e24f09b600138fd6a87",
"41d61209e7d8fc4f8b0b8e079b92dc02",
"c1c36d770736b12ebc9736051fee9406",
"34d45558e7858e85a2022a7249e0a644",
"9de30506d148c4c197a59424c0867651",
"d37bfdea7b6b343add8d9c05f79da8e5",
"460efe5a8373cc94ba55b5307b8f77ed",
"1b96e2e0574708389acb7845185e9bc7",
"cd6c228f41b85b46f108606d409bf6c1",
"4fe254276aa49edbb2a7f55308079a94",
"206aff2421a3ed7182caa1a8bfdaadba",
"c9f814999431b8e3334aec004ee87c5f",
"e5925dab23eca6183271c1332bd0ee9b",
"cf09e049df542adfcbd00715bb80dc29",
"c0acb783decbb19e970c4b883baa7f78",
"099388b6f77882a27599684b9ed8d3e5",
"beecf51bf88541e82a300818ae7044d2",
"f42e0b40d0dce09e5f9869baeaa80eb1",
"770479f626902e3ab28285b23b1a2d9e",
"4012272115da7453568606c2c0e2a0f5",
"cd6da2d1f5bb8a92865b7c6b9ee91fc9",
"31694515c44340f70f5ad182cc30fcf8",
"368b23047d1a5118f7ee3aba09e898af",
"667cf233836b96608bba0963525fa93e",
"009c68131212cebb6f320222c703c36e",
"f55c0b2fbe7947b502e83774d9605e77",
]


@pytest.fixture
def expected_element_ids_for_ocr_strategy():
return [
"bcb52980d2fa3de4b5c2ae7cc059e587",
"54f37f5855a683c59c7e5e462e8bf835",
"80e72e579e82c5e8374f4bae04afbee0",
"f64e1c2f5188d3484d07b8ee56c810b5",
"219a7245a961b2f31014c2e003325679",
"443b33e809f7a87878a531ac67d9fab4",
"beae05a2dea26cb4566cbb1b0ccc5283",
"7f3e7aeb75daa69802da98b2c87d4ef8",
"4465cf24225d06d0def7def21dc8f223",
"f1cd29d4f771feb9b3927d86f7604b4c",
"b1d948cd5f08332e48f8175e58d15341",
"47f7c90be21c7e072ade41aa0463d48e",
"5f3b10d86b63dd2d0c9219a1f86cc3fb",
"8c4e7756efdf3ee18b7b2a8ed9554f72",
"f373da11f92112fd0980ceb225cba85a",
"16baf3ece0b01d18adcaf6cb09ef9054",
"3546b0401eacd629177b9dd53a30a8a7",
"0b60e2e272de96c80524bb24d19272c5",
"9300ec3b150913be07c6d2773aa317ea",
"e0f0640fdc9ea6e0f16a8c5932dc53ab",
"e50df5fd6aec3d1c3d8ad43dc780f0ba",
"72b4cfbee0cfe3ca1e2026ab0ef895f7",
"ef4b5abfcc9837004e238c2095318191",
"7b785f7841ba51dbabc14a0f15e75e9f",
"d9d7e2cfe23c90d2bd4c0041ce227199",
"774c5160e8a6fcf41c7e3777c32f26de",
]


@pytest.fixture
def expected_ids(request):
return request.getfixturevalue(request.param)


@pytest.mark.parametrize(
("strategy", "expected_ids"),
[
(PartitionStrategy.FAST, "expected_element_ids_for_fast_strategy"),
(PartitionStrategy.HI_RES, "expected_element_ids_for_hi_res_strategy"),
(PartitionStrategy.OCR_ONLY, "expected_element_ids_for_ocr_strategy"),
],
indirect=["expected_ids"],
)
def test_unique_and_deterministic_element_ids(
strategy, expected_ids, mock_pdfwriter_with_duplicate_pages, tmpdir
):
# GIVEN
pdf_path = Path(tmpdir) / "mock.pdf"
mock_pdfwriter_with_duplicate_pages.write(pdf_path)

# WHEN
elements = pdf.partition_pdf(pdf_path, strategy=strategy)
elements_df = convert_to_dataframe(elements)

# THEN
duplicated_text_example = "MAIN GAME"
element_repetitions = elements_df["text"].str.count(duplicated_text_example).sum()

# Ensure fixture is working as expected
assert (
element_repetitions == 2
), f"Element {duplicated_text_example} is supposed to be duplicated"
assert {element.metadata.page_number for element in elements} == {
1,
2,
}, "Page numbers are incorrect"

# Expect uniqueness
assert elements_df["element_id"].is_unique, "Element IDs are not unique"

# Expect determinism
assert all(
element.id == expected_id for element, expected_id in zip(elements, expected_ids)
), "Element IDs do not match expected IDs"
43 changes: 43 additions & 0 deletions test_unstructured/partition/test_html_partition.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,9 @@
from unstructured.documents.elements import EmailAddress, ListItem, NarrativeText, Table, Title
from unstructured.documents.html import HTMLTitle
from unstructured.partition.html import partition_html
from unstructured.staging.base import (
convert_to_dataframe,
)
scanny marked this conversation as resolved.
Show resolved Hide resolved

DIRECTORY = pathlib.Path(__file__).parent.resolve()

Expand Down Expand Up @@ -723,3 +726,43 @@ def test_partition_html_with_table_without_tbody(tag: str, expected: str):
)
partitions = partition_html(text=table_html)
assert partitions[0].metadata.text_as_html == expected


@pytest.fixture
def partitioned_html_with_duplicate_elements():
filename = "example-docs/fake-html-duplicates.html"
with open(filename) as f:
elements = partition_html(file=f)
return elements
scanny marked this conversation as resolved.
Show resolved Hide resolved


def test_each_element_has_page_number_and_index_metadata(partitioned_html_with_duplicate_elements):
for element in partitioned_html_with_duplicate_elements:
assert element.metadata.page_number is not None
assert element.metadata.index_on_page is not None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test seems a little weak. There's no indication that page_number is an int or index_on_page is not str or something. I think it would be better as something like:

page_index_pairs = [
    (e.metadata.page_number, e.metadata.index_on_page)
    for e in partition_html("example-docs/fake-html-duplicates.html")
]
assert page_index_pairs == [(1, 0), (1, 1), ...]

Also this give the user some concrete intuitive sense of how these values proceed along an element-stream.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 you're right, my implementation was quite naive and needs to be refactored the way you proposed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

outdated, please resolve



def test_all_element_ids_are_unique(partitioned_html_with_duplicate_elements):
assert convert_to_dataframe(partitioned_html_with_duplicate_elements).element_id.is_unique
scanny marked this conversation as resolved.
Show resolved Hide resolved


@pytest.mark.parametrize(
"expected_ids",
[
[
"e5fd0a829b734744f52ae195859f7741",
"a68c75c49dde8b76d6d7208ad6ec9b6a",
"98a69145a0223bc7f82a75a11293ba3d",
"f36ded43d030fb5c6146d5961f767a85",
"d6ca7824f33202a7a86c02c979b42c91",
"a677501bee476d34f96af566d648531e",
"7583e5d08909a23e8214f3d1ee87e50b",
"490190ffe8519e8cd8b6d750d334397e",
"dbf7bcdc973922a41c8c51505b873340",
"346cc2595ed5c0b34c7d818b9cc0c891",
],
],
)
def test_element_ids_are_deterministic(partitioned_html_with_duplicate_elements, expected_ids):
"""Test that element IDs are deterministic and match the expected IDs."""
scanny marked this conversation as resolved.
Show resolved Hide resolved
assert [element.id for element in partitioned_html_with_duplicate_elements] == expected_ids
scanny marked this conversation as resolved.
Show resolved Hide resolved
Loading
Loading