Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preparing the foundation for better element IDs #2842

Merged
merged 59 commits into from
Apr 16, 2024

Conversation

micmarty-deepsense
Copy link
Contributor

@micmarty-deepsense micmarty-deepsense commented Apr 3, 2024

Part one of the issue described here: #2461

It does not change how hashing algorithm works, just reworks how ids are assigned:

Element ID Design Principles

  1. A partitioning function can assign only one of two available ID types to a returned element: a hash or UUID.
  2. All elements that are returned come with an ID, which is never None.
  3. No matter which type of ID is used, it will always be in string format.
  4. Partitioning a document returns elements with hashes as their default IDs.

Big thanks to @scanny for explaining the current design and suggesting ways to do it right, especially with chunking.

Here's the next PR in line: #2673

@micmarty-deepsense micmarty-deepsense self-assigned this Apr 3, 2024
@micmarty-deepsense micmarty-deepsense marked this pull request as ready for review April 3, 2024 15:00
@micmarty-deepsense micmarty-deepsense force-pushed the mike/preparing-ground-for-better-element-ids branch from 9c7807f to f5d46df Compare April 16, 2024 11:27
…es update (#2895)

This pull request includes updated ingest test fixtures.
Please review and merge if appropriate.

Co-authored-by: micmarty-deepsense <[email protected]>
Copy link
Collaborator

@scanny scanny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Mike, I'm a little confused. I thought this PR was to prepare for better element-ids, but not to actually implement them yet. That the bit that changed the hash algorithm to include page_number and page_sequence_number was going to be separate (because that's what triggers all the ingest-test changes).

Did we change our mind on that?

Copy link
Collaborator

@scanny scanny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, okay, nevermind. I mistook something else for a change to the hash algorithm.

LGTM :)

@scanny scanny added this pull request to the merge queue Apr 16, 2024
Merged via the queue into main with commit 001fa17 Apr 16, 2024
42 checks passed
@scanny scanny deleted the mike/preparing-ground-for-better-element-ids branch April 16, 2024 21:46
cragwolfe pushed a commit that referenced this pull request Apr 24, 2024
Part two of: #2842

Main changes compared to part one:
* hash computation includes element's sequence number on page, page
number, document filename and its text
* there are more test for deterministic behavior of IDs returned by
partitioning functions + their uniqueness (guaranteed at the document
level, and high probability across multiple documents)

This PR addresses the following issue:
#2461
awalker4 added a commit to Unstructured-IO/unstructured-api that referenced this pull request Apr 30, 2024
…tting (#400)

This PR enables the Python and JS clients to partition PDF pages
independently after splitting them on their side
(`split_pdf_page=True`). Splitting is also supported by API itself -
this makes sense when users send their requests without using our
dedicated clients.

Related to: 
* Unstructured-IO/unstructured#2842
* Unstructured-IO/unstructured#2673

It should be merged before these:
* Unstructured-IO/unstructured-js-client#55
* Unstructured-IO/unstructured-python-client#72

**The tests for this PR won't pass until the related PRs are both
merged.**

## How to test it locally
Unfortunately the `pytest` test is not fully implemented, it fails - see
[this
comment](#400 (comment))
1. Clone Python client and checkout to this PR:
Unstructured-IO/unstructured-js-client#55
2. `cd unstructured-client; pip install --editable .`
3. `make run-web-app`
4.  `python <script-below>.py`

```python
from unstructured_client import UnstructuredClient
from unstructured_client.models import shared
from unstructured_client.models.errors import SDKError

s = UnstructuredClient(api_key_auth=os.environ["UNS_API_KEY"], server_url="http://localhost:8000")

# -- this file is included in this PR --
filename = "sample-docs/DA-1p-with-duplicate-pages.pdf"
with open(filename, "rb") as f:
    files = shared.Files(content=f.read(), file_name=filename)

req = shared.PartitionParameters(
    files=files,
    strategy="fast",
    languages=["eng"],
    split_pdf_page=False, # this forces splitting on API side (if parallelization is enabled)
    # split_pdf_page=True,  # forces client-side splitting, implemented here: Unstructured-IO/unstructured-js-client#55
)
resp = s.general.partition(req)
ids = [e["element_id"] for e in resp.elements]
page_numbers = [e["metadata"]["page_number"] for e in resp.elements]
# this PDF contains 3 identical pages, 13 elements each
assert page_numbers == [1,1,1,1,1,1,1,1,1,1,1,1,1,
                        2,2,2,2,2,2,2,2,2,2,2,2,2,
                        3,3,3,3,3,3,3,3,3,3,3,3,3]
assert len(ids) == len(set(ids)), "Element IDs are not unique"
```

---------

Co-authored-by: cragwolfe <[email protected]>
Co-authored-by: Austin Walker <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants