Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Internal server error - Extracting tables from a PDF file #182

Closed
skarampatakis opened this issue Aug 9, 2023 · 4 comments
Closed

Internal server error - Extracting tables from a PDF file #182

skarampatakis opened this issue Aug 9, 2023 · 4 comments

Comments

@skarampatakis
Copy link

Hi, I m using the following request to the API in order to extract some tables from a PDF file:

curl --location --request POST 'http://localhost:8000/general/v0/general' \
--form 'strategy="hi_res"' \
--form 'pdf_infer_table_structure="true"' \
--form 'files=@"TelecomArgentina_Report 2020.pdf"' \
--form 'ocr_languages="eng"' \
--form 'skip_infer_table_types=""'

The request fails after 14 mins, I see the following on the logs:

2023-08-09 06:52:34,237 172.17.0.1:46490 POST /general/v0/general HTTP/1.1 - 500 Internal Server Error
2023-08-09 06:52:34,238 uvicorn.error ERROR Exception in ASGI application
Traceback (most recent call last):
  File "/home/notebook-user/.local/lib/python3.8/site-packages/uvicorn/protocols/http/httptools_impl.py", line 426, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/home/notebook-user/.local/lib/python3.8/site-packages/uvicorn/middleware/proxy_headers.py", line 84, in __call__
    return await self.app(scope, receive, send)
  File "/home/notebook-user/.local/lib/python3.8/site-packages/fastapi/applications.py", line 289, in __call__
    await super().__call__(scope, receive, send)
  File "/home/notebook-user/.local/lib/python3.8/site-packages/starlette/applications.py", line 122, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/home/notebook-user/.local/lib/python3.8/site-packages/starlette/middleware/errors.py", line 184, in __call__
    raise exc
  File "/home/notebook-user/.local/lib/python3.8/site-packages/starlette/middleware/errors.py", line 162, in __call__
    await self.app(scope, receive, _send)
  File "/home/notebook-user/.local/lib/python3.8/site-packages/starlette/middleware/exceptions.py", line 79, in __call__
    raise exc
  File "/home/notebook-user/.local/lib/python3.8/site-packages/starlette/middleware/exceptions.py", line 68, in __call__
    await self.app(scope, receive, sender)
  File "/home/notebook-user/.local/lib/python3.8/site-packages/fastapi/middleware/asyncexitstack.py", line 20, in __call__
    raise e
  File "/home/notebook-user/.local/lib/python3.8/site-packages/fastapi/middleware/asyncexitstack.py", line 17, in __call__
    await self.app(scope, receive, send)
  File "/home/notebook-user/.local/lib/python3.8/site-packages/starlette/routing.py", line 718, in __call__
    await route.handle(scope, receive, send)
  File "/home/notebook-user/.local/lib/python3.8/site-packages/starlette/routing.py", line 276, in handle
    await self.app(scope, receive, send)
  File "/home/notebook-user/.local/lib/python3.8/site-packages/starlette/routing.py", line 66, in app
    response = await func(request)
  File "/home/notebook-user/.local/lib/python3.8/site-packages/fastapi/routing.py", line 273, in app
    raw_response = await run_endpoint_function(
  File "/home/notebook-user/.local/lib/python3.8/site-packages/fastapi/routing.py", line 192, in run_endpoint_function
    return await run_in_threadpool(dependant.call, **values)
  File "/home/notebook-user/.local/lib/python3.8/site-packages/starlette/concurrency.py", line 41, in run_in_threadpool
    return await anyio.to_thread.run_sync(func, *args)
  File "/home/notebook-user/.local/lib/python3.8/site-packages/anyio/to_thread.py", line 33, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/home/notebook-user/.local/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread
    return await future
  File "/home/notebook-user/.local/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 807, in run
    result = context.run(func, *args)
  File "/home/notebook-user/prepline_general/api/general.py", line 591, in pipeline_1
    list(response_generator(is_multipart=False))[0]
  File "/home/notebook-user/prepline_general/api/general.py", line 535, in response_generator
    response = pipeline_api(
  File "/home/notebook-user/prepline_general/api/general.py", line 339, in pipeline_api
    elements = partition(
  File "/home/notebook-user/.local/lib/python3.8/site-packages/unstructured/partition/auto.py", line 221, in partition
    elements = partition_pdf(
  File "/home/notebook-user/.local/lib/python3.8/site-packages/unstructured/documents/elements.py", line 222, in wrapper
    elements = func(*args, **kwargs)
  File "/home/notebook-user/.local/lib/python3.8/site-packages/unstructured/file_utils/filetype.py", line 628, in wrapper
    elements = func(*args, **kwargs)
  File "/home/notebook-user/.local/lib/python3.8/site-packages/unstructured/partition/pdf.py", line 95, in partition_pdf
    return partition_pdf_or_image(
  File "/home/notebook-user/.local/lib/python3.8/site-packages/unstructured/partition/pdf.py", line 184, in partition_pdf_or_image
    layout_elements = _partition_pdf_or_image_local(
  File "/home/notebook-user/.local/lib/python3.8/site-packages/unstructured/utils.py", line 43, in wrapper
    return func(*args, **kwargs)
  File "/home/notebook-user/.local/lib/python3.8/site-packages/unstructured/partition/pdf.py", line 248, in _partition_pdf_or_image_local
    layout = process_data_with_model(
  File "/home/notebook-user/.local/lib/python3.8/site-packages/unstructured_inference/inference/layout.py", line 344, in process_data_with_model
    layout = process_file_with_model(
  File "/home/notebook-user/.local/lib/python3.8/site-packages/unstructured_inference/inference/layout.py", line 389, in process_file_with_model
    else DocumentLayout.from_file(
  File "/home/notebook-user/.local/lib/python3.8/site-packages/unstructured_inference/inference/layout.py", line 109, in from_file
    page = PageLayout.from_image(
  File "/home/notebook-user/.local/lib/python3.8/site-packages/unstructured_inference/inference/layout.py", line 315, in from_image
    page.get_elements_with_detection_model()
  File "/home/notebook-user/.local/lib/python3.8/site-packages/unstructured_inference/inference/layout.py", line 235, in get_elements_with_detection_model
    elements = self.get_elements_from_layout(inferred_layout)
  File "/home/notebook-user/.local/lib/python3.8/site-packages/unstructured_inference/inference/layout.py", line 246, in get_elements_from_layout
    elements = [
  File "/home/notebook-user/.local/lib/python3.8/site-packages/unstructured_inference/inference/layout.py", line 247, in <listcomp>
    get_element_from_block(
  File "/home/notebook-user/.local/lib/python3.8/site-packages/unstructured_inference/inference/layout.py", line 414, in get_element_from_block
    element.text = element.extract_text(
  File "/home/notebook-user/.local/lib/python3.8/site-packages/unstructured_inference/inference/layoutelement.py", line 32, in extract_text
    text = super().extract_text(
  File "/home/notebook-user/.local/lib/python3.8/site-packages/unstructured_inference/inference/elements.py", line 216, in extract_text
    text = aggregate_by_block(self, image, objects, ocr_strategy)
  File "/home/notebook-user/.local/lib/python3.8/site-packages/unstructured_inference/inference/elements.py", line 316, in aggregate_by_block
    text = ocr(text_region, image, languages=ocr_languages)
  File "/home/notebook-user/.local/lib/python3.8/site-packages/unstructured_inference/inference/elements.py", line 272, in ocr
    return agent.detect(cropped_image)
  File "/home/notebook-user/.local/lib/python3.8/site-packages/layoutparser/ocr/tesseract_agent.py", line 122, in detect
    res = self._detect(image)
  File "/home/notebook-user/.local/lib/python3.8/site-packages/layoutparser/ocr/tesseract_agent.py", line 89, in _detect
    res["text"] = pytesseract.image_to_string(
  File "/home/notebook-user/.local/lib/python3.8/site-packages/pytesseract/pytesseract.py", line 423, in image_to_string
    return {
  File "/home/notebook-user/.local/lib/python3.8/site-packages/pytesseract/pytesseract.py", line 426, in <lambda>
    Output.STRING: lambda: run_and_get_output(*args),
  File "/home/notebook-user/.local/lib/python3.8/site-packages/pytesseract/pytesseract.py", line 288, in run_and_get_output
    run_tesseract(**kwargs)
  File "/home/notebook-user/.local/lib/python3.8/site-packages/pytesseract/pytesseract.py", line 264, in run_tesseract
    raise TesseractError(proc.returncode, get_errors(error_string))
pytesseract.pytesseract.TesseractError: (-8, 'Estimating resolution as 1425')

If I run the same query but with the fast strategy, then everything works fine but the results are not acceptable.
I took a look on the tesseract repo but could not find anything relevant. Would be glad for any help.
I am running the api locally through docker, all on default settings.

@shreyanid
Copy link
Contributor

Thanks @skarampatakis ! Do you mind sharing the file that caused this error?

@skarampatakis
Copy link
Author

It is this one,
Integrated Report 2020.pdf. Thanks a lot for taking a look.

cragwolfe added a commit to Unstructured-IO/unstructured that referenced this issue Aug 10, 2023
* build(release): bump unstructured-inference

Related to downstream issue:
Unstructured-IO/unstructured-api#182

And upstream PR:
Unstructured-IO/unstructured-inference#165

---------

Co-authored-by: Shreya Nidadavolu <[email protected]>
shreyanid added a commit that referenced this issue Aug 14, 2023
Related to downstream issue: #182
And upstream PR: Unstructured-IO/unstructured-inference#165

* remove test_parallel_mode_correct_result
* dropped the file_directory field from elements metadata
@awalker4
Copy link
Collaborator

This is fixed as of 0.0.35! You can now get the latest image from quay, or pull the repo and rebuild.

@skarampatakis
Copy link
Author

Thanks a lot for solving that issue quickly. I get no tesseract errors now. The problem is that I do not see the tables extracted properly, but I think this is already mentioned in #191 , seems like I have the same issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants