Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug/coroutine-await #149

Open
tluthra opened this issue Aug 8, 2024 · 6 comments
Open

bug/coroutine-await #149

tluthra opened this issue Aug 8, 2024 · 6 comments
Labels
bug Something isn't working

Comments

@tluthra
Copy link

tluthra commented Aug 8, 2024

Describe the bug
i'm currently using the open source version of unstructured, and getting an exception about coroutine is being awaited already. This causes the parse to fail, and not work.

To Reproduce
It just seems to happen every so often, I'm not sure how to reproduce it. It's not on the same files, if I retry them, they'll succeed. I am putting a lot of files + data through the system, but nothing that seems to be something it shouldnt be able to handle.

Expected behavior
Not get this error.

Screenshots
N/A

Environment Info
I'm pullling downloads.unstructured.io/unstructured-io/unstructured-api:latest as my container and just running this.
Also running unstructured-client = "^0.23.8" on my client side.

Additional context
There's also a few warning that are showing up related to coroutines too:

[2024-08-08 17:10:50,239: WARNING/MainProcess] /home/appuser/.cache/pypoetry/virtualenvs/app-9TtSrW0h-py3.11/lib/python3.11/site-packages/unstructured_client/_hooks/custom/split_pdf_hook.py:202: RuntimeWarning: coroutine 'SplitPdfHook.before_request.<locals>.call_api_partial' was never awaited
  self.coroutines_to_execute[operation_id] = []
[2024-08-08 17:47:48,365: WARNING/MainProcess] Traceback (most recent call last):
[2024-08-08 17:47:48,365: WARNING/MainProcess]   File "<string>", line 1, in <lambda>
[2024-08-08 17:47:48,365: WARNING/MainProcess] KeyError
[2024-08-08 17:47:48,365: WARNING/MainProcess] :
[2024-08-08 17:47:48,365: WARNING/MainProcess] '__import__'
[2024-08-08 17:47:48,366: WARNING/MainProcess] Exception ignored in:
[2024-08-08 17:47:48,366: WARNING/MainProcess] <coroutine object SplitPdfHook.before_request.<locals>.call_api_partial at 0x7f4753372480> 
@tluthra tluthra added the bug Something isn't working label Aug 8, 2024
@awalker4 awalker4 transferred this issue from Unstructured-IO/unstructured Aug 8, 2024
@tluthra
Copy link
Author

tluthra commented Aug 8, 2024

[2024-08-08 22:45:35,624: ERROR/MainProcess] Task exception was never retrieved
future: <Task finished name='Task-189' coro=<_wrapped_create_task_py37.<locals>.traced_coro() done, defined at /home/appuser/.cache/pypoetry/virtualenvs/app-9TtSrW0h-py3.11/lib/python3.11/site-packages/ddtrace/contrib/asyncio/patch.py:47> exception=RuntimeError('coroutine is being awaited already')>
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/asyncio/tasks.py", line 277, in __step
    result = coro.send(None)
             ^^^^^^^^^^^^^^^
  File "/home/appuser/.cache/pypoetry/virtualenvs/app-9TtSrW0h-py3.11/lib/python3.11/site-packages/ddtrace/contrib/asyncio/patch.py", line 50, in traced_coro
    return await coro
           ^^^^^^^^^^
  File "/home/appuser/.cache/pypoetry/virtualenvs/app-9TtSrW0h-py3.11/lib/python3.11/site-packages/unstructured_client/_hooks/custom/split_pdf_hook.py", line 49, in _order_keeper
    response = await coro
               ^^^^^^^^^^
RuntimeError: coroutine is being awaited already
[2024-08-08 22:45:35,639: ERROR/MainProcess] Task exception was never retrieved
future: <Task finished name='Task-188' coro=<_wrapped_create_task_py37.<locals>.traced_coro() done, defined at /home/appuser/.cache/pypoetry/virtualenvs/app-9TtSrW0h-py3.11/lib/python3.11/site-packages/ddtrace/contrib/asyncio/patch.py:47> exception=RuntimeError('coroutine is being awaited already')>
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/asyncio/tasks.py", line 277, in __step
    result = coro.send(None)
             ^^^^^^^^^^^^^^^
  File "/home/appuser/.cache/pypoetry/virtualenvs/app-9TtSrW0h-py3.11/lib/python3.11/site-packages/ddtrace/contrib/asyncio/patch.py", line 50, in traced_coro
    return await coro
           ^^^^^^^^^^
  File "/home/appuser/.cache/pypoetry/virtualenvs/app-9TtSrW0h-py3.11/lib/python3.11/site-packages/unstructured_client/_hooks/custom/split_pdf_hook.py", line 49, in _order_keeper
    response = await coro
               ^^^^^^^^^^
RuntimeError: coroutine is being awaited already

this is a more full exception stack I'm getting here

@awalker4
Copy link
Collaborator

Hi @tluthra - just want to let you know that this is on our radar! The next release of the SDK will provide much better asyncio support, and I'm chasing down these sorts of errors now.

@tluthra
Copy link
Author

tluthra commented Aug 21, 2024

@awalker4 appreciate the update, until then is there anything you can suggest as a work around?

@awalker4
Copy link
Collaborator

For now, you can sidestep the async code by running with split_pdf_page=False. If you are working with very large pdfs, and you still need to split up the file before sending, the next thing to try is split_pdf_allow_failed=True. We've identified some bugs when this option is False, which means stop processing the doc if any of the underlying splits hits an error. It seems there's some asyncio cleanup that isn't happening in this path.

@awalker4
Copy link
Collaborator

Also, can I see how you are calling the client? Just to make sure I can reproduce all of the asyncio warnings here.

@tluthra
Copy link
Author

tluthra commented Aug 30, 2024

thanks will try that out, yea, I'm using it like this:

Initializing

            cls._client = UnstructuredClientSDK(
                server_url=os.environ["UNSTRUCTURED_SERVER_URL"],
                api_key_auth=None,
            )

More or less how I'm calling it

        with open(local_filename, "rb") as f:
            files = shared.Files(
                content=f.read(),
                file_name=local_filename,
            )

            req = shared.PartitionParameters(
                files=files,
                strategy="auto",
                languages=["eng"],
                chunking_strategy=ChunkingStrategy.BY_TITLE,
                combine_under_n_chars=2500,
                max_characters=3000,
                overlap=250,
            )
            resp = cls._client.general.partition(req)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants