Starting page number handling for split pdf page #55

mpolomdeepsense · 2024-04-26T07:52:56Z

Only SplitPdfHook.ts, SplitPdfHook.test.ts and overlay_client.yaml files were modified by human. Rest of them were auto generated.

To run integration tests first run unstructured-api on port 8000

… node maxEventListeners warning;

…tting (#400) This PR enables the Python and JS clients to partition PDF pages independently after splitting them on their side (`split_pdf_page=True`). Splitting is also supported by API itself - this makes sense when users send their requests without using our dedicated clients. Related to: * Unstructured-IO/unstructured#2842 * Unstructured-IO/unstructured#2673 It should be merged before these: * Unstructured-IO/unstructured-js-client#55 * Unstructured-IO/unstructured-python-client#72 **The tests for this PR won't pass until the related PRs are both merged.** ## How to test it locally Unfortunately the `pytest` test is not fully implemented, it fails - see [this comment](#400 (comment)) 1. Clone Python client and checkout to this PR: Unstructured-IO/unstructured-js-client#55 2. `cd unstructured-client; pip install --editable .` 3. `make run-web-app` 4. `python <script-below>.py` ```python from unstructured_client import UnstructuredClient from unstructured_client.models import shared from unstructured_client.models.errors import SDKError s = UnstructuredClient(api_key_auth=os.environ["UNS_API_KEY"], server_url="http://localhost:8000") # -- this file is included in this PR -- filename = "sample-docs/DA-1p-with-duplicate-pages.pdf" with open(filename, "rb") as f: files = shared.Files(content=f.read(), file_name=filename) req = shared.PartitionParameters( files=files, strategy="fast", languages=["eng"], split_pdf_page=False, # this forces splitting on API side (if parallelization is enabled) # split_pdf_page=True, # forces client-side splitting, implemented here: Unstructured-IO/unstructured-js-client#55 ) resp = s.general.partition(req) ids = [e["element_id"] for e in resp.elements] page_numbers = [e["metadata"]["page_number"] for e in resp.elements] # this PDF contains 3 identical pages, 13 elements each assert page_numbers == [1,1,1,1,1,1,1,1,1,1,1,1,1, 2,2,2,2,2,2,2,2,2,2,2,2,2, 3,3,3,3,3,3,3,3,3,3,3,3,3] assert len(ids) == len(set(ids)), "Element IDs are not unique" ``` --------- Co-authored-by: cragwolfe <[email protected]> Co-authored-by: Austin Walker <[email protected]>

mpolomdeepsense added 8 commits April 26, 2024 09:50

Added starting_page_number parameter handling for split pdf hook; Fix…

cdae1ea

… node maxEventListeners warning;

SplitPdfHook integration tests

6029fba

Readme parallel limit update

f3e983f

Documentation

8eca866

Test fix

e5e375d

temporary starting_page_number overlay addition

f23ca53

Auto-generated code for starting_page_number parameter

60e63bb

Fix starting_page_number getter

46d1b63

mpolomdeepsense marked this pull request as ready for review April 26, 2024 15:00

Remove page_number exclude from split_pdf_page integration test

284cd72

mpolomdeepsense requested a review from awalker4 April 26, 2024 15:13

micmarty-deepsense mentioned this pull request Apr 26, 2024

Support for starting_page_number parameter when doing PDF page splitting Unstructured-IO/unstructured-api#400

Merged

awalker4 approved these changes May 1, 2024

View reviewed changes

awalker4 merged commit f540594 into main May 1, 2024

awalker4 deleted the starting_page_number branch May 1, 2024 20:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Starting page number handling for split pdf page #55

Starting page number handling for split pdf page #55

mpolomdeepsense commented Apr 26, 2024 •

edited

Loading

Starting page number handling for split pdf page #55

Starting page number handling for split pdf page #55

Conversation

mpolomdeepsense commented Apr 26, 2024 • edited Loading

mpolomdeepsense commented Apr 26, 2024 •

edited

Loading