Preparing the foundation for better element IDs #2842

micmarty-deepsense · 2024-04-03T14:22:00Z

Part one of the issue described here: #2461

It does not change how hashing algorithm works, just reworks how ids are assigned:

Element ID Design Principles

A partitioning function can assign only one of two available ID types to a returned element: a hash or UUID.

All elements that are returned come with an ID, which is never None.

No matter which type of ID is used, it will always be in string format.

Partitioning a document returns elements with hashes as their default IDs.

Big thanks to @scanny for explaining the current design and suggesting ways to do it right, especially with chunking.

Here's the next PR in line: #2673

…es update (#2895) This pull request includes updated ingest test fixtures. Please review and merge if appropriate. Co-authored-by: micmarty-deepsense <[email protected]>

…//github.com/Unstructured-IO/unstructured into mike/preparing-ground-for-better-element-ids

scanny

Hi Mike, I'm a little confused. I thought this PR was to prepare for better element-ids, but not to actually implement them yet. That the bit that changed the hash algorithm to include page_number and page_sequence_number was going to be separate (because that's what triggers all the ingest-test changes).

~~Did we change our mind on that?~~

scanny

Hmm, okay, nevermind. I mistook something else for a change to the hash algorithm.

LGTM :)

Part two of: #2842 Main changes compared to part one: * hash computation includes element's sequence number on page, page number, document filename and its text * there are more test for deterministic behavior of IDs returned by partitioning functions + their uniqueness (guaranteed at the document level, and high probability across multiple documents) This PR addresses the following issue: #2461

…tting (#400) This PR enables the Python and JS clients to partition PDF pages independently after splitting them on their side (`split_pdf_page=True`). Splitting is also supported by API itself - this makes sense when users send their requests without using our dedicated clients. Related to: * Unstructured-IO/unstructured#2842 * Unstructured-IO/unstructured#2673 It should be merged before these: * Unstructured-IO/unstructured-js-client#55 * Unstructured-IO/unstructured-python-client#72 **The tests for this PR won't pass until the related PRs are both merged.** ## How to test it locally Unfortunately the `pytest` test is not fully implemented, it fails - see [this comment](#400 (comment)) 1. Clone Python client and checkout to this PR: Unstructured-IO/unstructured-js-client#55 2. `cd unstructured-client; pip install --editable .` 3. `make run-web-app` 4. `python <script-below>.py` ```python from unstructured_client import UnstructuredClient from unstructured_client.models import shared from unstructured_client.models.errors import SDKError s = UnstructuredClient(api_key_auth=os.environ["UNS_API_KEY"], server_url="http://localhost:8000") # -- this file is included in this PR -- filename = "sample-docs/DA-1p-with-duplicate-pages.pdf" with open(filename, "rb") as f: files = shared.Files(content=f.read(), file_name=filename) req = shared.PartitionParameters( files=files, strategy="fast", languages=["eng"], split_pdf_page=False, # this forces splitting on API side (if parallelization is enabled) # split_pdf_page=True, # forces client-side splitting, implemented here: Unstructured-IO/unstructured-js-client#55 ) resp = s.general.partition(req) ids = [e["element_id"] for e in resp.elements] page_numbers = [e["metadata"]["page_number"] for e in resp.elements] # this PDF contains 3 identical pages, 13 elements each assert page_numbers == [1,1,1,1,1,1,1,1,1,1,1,1,1, 2,2,2,2,2,2,2,2,2,2,2,2,2, 3,3,3,3,3,3,3,3,3,3,3,3,3] assert len(ids) == len(set(ids)), "Element IDs are not unique" ``` --------- Co-authored-by: cragwolfe <[email protected]> Co-authored-by: Austin Walker <[email protected]>

micmarty-deepsense added 3 commits April 3, 2024 15:37

modify default behavior of Element, Text, and Name class

639423a

move id initialization to Element

ee90096

refactor id assertions in test_elements.py

2d9a127

micmarty-deepsense self-assigned this Apr 3, 2024

add changelog entry

11c4041

micmarty-deepsense marked this pull request as ready for review April 3, 2024 15:00

micmarty-deepsense added 4 commits April 3, 2024 17:17

adjust email tests

20d7c2f

fix chunking

4aadf22

remove unnecessary enumeration and remove argument to id_to_hash

45973ee

remove unused import

3f51745

micmarty-deepsense force-pushed the mike/preparing-ground-for-better-element-ids branch from 10a3775 to 3f51745 Compare April 3, 2024 15:31

micmarty-deepsense had a problem deploying to ci April 3, 2024 15:33 — with GitHub Actions Failure

micmarty-deepsense had a problem deploying to ci April 3, 2024 15:33 — with GitHub Actions Error

quickfix support for | operand in 3.9

bc93f54

micmarty-deepsense temporarily deployed to ci April 3, 2024 19:21 — with GitHub Actions Inactive

micmarty-deepsense added 3 commits April 3, 2024 23:09

add design principles in overview.rst

e9a5dcf

fix staging test by using deterministic hashes

f36f76c

fix tests that were failing due to invalid text_as_html consolidation

71e70d7

micmarty-deepsense temporarily deployed to ci April 3, 2024 21:13 — with GitHub Actions Inactive

add empty lines

17d6585

micmarty-deepsense temporarily deployed to ci April 3, 2024 21:15 — with GitHub Actions Inactive

quickfix typo

511cc05

micmarty-deepsense temporarily deployed to ci April 3, 2024 21:21 — with GitHub Actions Inactive

micmarty-deepsense added 2 commits April 16, 2024 13:27

update tests and fix import errors

7eb3855

update elements tests

f5d46df

micmarty-deepsense force-pushed the mike/preparing-ground-for-better-element-ids branch from 9c7807f to f5d46df Compare April 16, 2024 11:27

update changelog and bump version

3808e85

micmarty-deepsense temporarily deployed to ci April 16, 2024 11:30 — with GitHub Actions Inactive

update Element docstring

97926a4

micmarty-deepsense temporarily deployed to ci April 16, 2024 11:33 — with GitHub Actions Inactive

Preparing the foundation for better element IDs <- Ingest test fixtur…

fb509ab

…es update (#2895) This pull request includes updated ingest test fixtures. Please review and merge if appropriate. Co-authored-by: micmarty-deepsense <[email protected]>

micmarty-deepsense temporarily deployed to ci April 16, 2024 14:46 — with GitHub Actions Inactive

micmarty-deepsense added 4 commits April 16, 2024 21:05

update docstrings

253927f

more verbose email elements tests

f219ae7

reformat to make a statement more compact

d5b0ac5

Merge branch 'mike/preparing-ground-for-better-element-ids' of https:…

25d24e7

…//github.com/Unstructured-IO/unstructured into mike/preparing-ground-for-better-element-ids

micmarty-deepsense temporarily deployed to ci April 16, 2024 19:08 — with GitHub Actions Inactive

reformat comments and assertion messages for 2 tests

50f2183

micmarty-deepsense temporarily deployed to ci April 16, 2024 19:14 — with GitHub Actions Inactive

scanny self-requested a review April 16, 2024 20:55

scanny requested changes Apr 16, 2024

View reviewed changes

scanny approved these changes Apr 16, 2024

View reviewed changes

scanny added this pull request to the merge queue Apr 16, 2024

Merged via the queue into main with commit 001fa17 Apr 16, 2024
42 checks passed

scanny deleted the mike/preparing-ground-for-better-element-ids branch April 16, 2024 21:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preparing the foundation for better element IDs #2842

Preparing the foundation for better element IDs #2842

micmarty-deepsense commented Apr 3, 2024 •

edited

Loading

scanny left a comment •

edited

Loading

scanny left a comment

Preparing the foundation for better element IDs #2842

Preparing the foundation for better element IDs #2842

Conversation

micmarty-deepsense commented Apr 3, 2024 • edited Loading

scanny left a comment • edited Loading

Choose a reason for hiding this comment

scanny left a comment

Choose a reason for hiding this comment

micmarty-deepsense commented Apr 3, 2024 •

edited

Loading

scanny left a comment •

edited

Loading