Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Starting page number handling for split pdf page #55

Merged
merged 9 commits into from
May 1, 2024

Conversation

mpolomdeepsense
Copy link
Contributor

@mpolomdeepsense mpolomdeepsense commented Apr 26, 2024

Only SplitPdfHook.ts, SplitPdfHook.test.ts and overlay_client.yaml files were modified by human. Rest of them were auto generated.

To run integration tests first run unstructured-api on port 8000

@mpolomdeepsense mpolomdeepsense marked this pull request as ready for review April 26, 2024 15:00
awalker4 added a commit to Unstructured-IO/unstructured-api that referenced this pull request Apr 30, 2024
…tting (#400)

This PR enables the Python and JS clients to partition PDF pages
independently after splitting them on their side
(`split_pdf_page=True`). Splitting is also supported by API itself -
this makes sense when users send their requests without using our
dedicated clients.

Related to: 
* Unstructured-IO/unstructured#2842
* Unstructured-IO/unstructured#2673

It should be merged before these:
* Unstructured-IO/unstructured-js-client#55
* Unstructured-IO/unstructured-python-client#72

**The tests for this PR won't pass until the related PRs are both
merged.**

## How to test it locally
Unfortunately the `pytest` test is not fully implemented, it fails - see
[this
comment](#400 (comment))
1. Clone Python client and checkout to this PR:
Unstructured-IO/unstructured-js-client#55
2. `cd unstructured-client; pip install --editable .`
3. `make run-web-app`
4.  `python <script-below>.py`

```python
from unstructured_client import UnstructuredClient
from unstructured_client.models import shared
from unstructured_client.models.errors import SDKError

s = UnstructuredClient(api_key_auth=os.environ["UNS_API_KEY"], server_url="http://localhost:8000")

# -- this file is included in this PR --
filename = "sample-docs/DA-1p-with-duplicate-pages.pdf"
with open(filename, "rb") as f:
    files = shared.Files(content=f.read(), file_name=filename)

req = shared.PartitionParameters(
    files=files,
    strategy="fast",
    languages=["eng"],
    split_pdf_page=False, # this forces splitting on API side (if parallelization is enabled)
    # split_pdf_page=True,  # forces client-side splitting, implemented here: Unstructured-IO/unstructured-js-client#55
)
resp = s.general.partition(req)
ids = [e["element_id"] for e in resp.elements]
page_numbers = [e["metadata"]["page_number"] for e in resp.elements]
# this PDF contains 3 identical pages, 13 elements each
assert page_numbers == [1,1,1,1,1,1,1,1,1,1,1,1,1,
                        2,2,2,2,2,2,2,2,2,2,2,2,2,
                        3,3,3,3,3,3,3,3,3,3,3,3,3]
assert len(ids) == len(set(ids)), "Element IDs are not unique"
```

---------

Co-authored-by: cragwolfe <[email protected]>
Co-authored-by: Austin Walker <[email protected]>
@awalker4 awalker4 merged commit f540594 into main May 1, 2024
@awalker4 awalker4 deleted the starting_page_number branch May 1, 2024 20:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants