Internal server error - Extracting tables from a PDF file #182

skarampatakis · 2023-08-09T07:15:12Z

Hi, I m using the following request to the API in order to extract some tables from a PDF file:

curl --location --request POST 'http://localhost:8000/general/v0/general' \
--form 'strategy="hi_res"' \
--form 'pdf_infer_table_structure="true"' \
--form 'files=@"TelecomArgentina_Report 2020.pdf"' \
--form 'ocr_languages="eng"' \
--form 'skip_infer_table_types=""'

The request fails after 14 mins, I see the following on the logs:

2023-08-09 06:52:34,237 172.17.0.1:46490 POST /general/v0/general HTTP/1.1 - 500 Internal Server Error
2023-08-09 06:52:34,238 uvicorn.error ERROR Exception in ASGI application
Traceback (most recent call last):
  File "/home/notebook-user/.local/lib/python3.8/site-packages/uvicorn/protocols/http/httptools_impl.py", line 426, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/home/notebook-user/.local/lib/python3.8/site-packages/uvicorn/middleware/proxy_headers.py", line 84, in __call__
    return await self.app(scope, receive, send)
  File "/home/notebook-user/.local/lib/python3.8/site-packages/fastapi/applications.py", line 289, in __call__
    await super().__call__(scope, receive, send)
  File "/home/notebook-user/.local/lib/python3.8/site-packages/starlette/applications.py", line 122, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/home/notebook-user/.local/lib/python3.8/site-packages/starlette/middleware/errors.py", line 184, in __call__
    raise exc
  File "/home/notebook-user/.local/lib/python3.8/site-packages/starlette/middleware/errors.py", line 162, in __call__
    await self.app(scope, receive, _send)
  File "/home/notebook-user/.local/lib/python3.8/site-packages/starlette/middleware/exceptions.py", line 79, in __call__
    raise exc
  File "/home/notebook-user/.local/lib/python3.8/site-packages/starlette/middleware/exceptions.py", line 68, in __call__
    await self.app(scope, receive, sender)
  File "/home/notebook-user/.local/lib/python3.8/site-packages/fastapi/middleware/asyncexitstack.py", line 20, in __call__
    raise e
  File "/home/notebook-user/.local/lib/python3.8/site-packages/fastapi/middleware/asyncexitstack.py", line 17, in __call__
    await self.app(scope, receive, send)
  File "/home/notebook-user/.local/lib/python3.8/site-packages/starlette/routing.py", line 718, in __call__
    await route.handle(scope, receive, send)
  File "/home/notebook-user/.local/lib/python3.8/site-packages/starlette/routing.py", line 276, in handle
    await self.app(scope, receive, send)
  File "/home/notebook-user/.local/lib/python3.8/site-packages/starlette/routing.py", line 66, in app
    response = await func(request)
  File "/home/notebook-user/.local/lib/python3.8/site-packages/fastapi/routing.py", line 273, in app
    raw_response = await run_endpoint_function(
  File "/home/notebook-user/.local/lib/python3.8/site-packages/fastapi/routing.py", line 192, in run_endpoint_function
    return await run_in_threadpool(dependant.call, **values)
  File "/home/notebook-user/.local/lib/python3.8/site-packages/starlette/concurrency.py", line 41, in run_in_threadpool
    return await anyio.to_thread.run_sync(func, *args)
  File "/home/notebook-user/.local/lib/python3.8/site-packages/anyio/to_thread.py", line 33, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/home/notebook-user/.local/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread
    return await future
  File "/home/notebook-user/.local/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 807, in run
    result = context.run(func, *args)
  File "/home/notebook-user/prepline_general/api/general.py", line 591, in pipeline_1
    list(response_generator(is_multipart=False))[0]
  File "/home/notebook-user/prepline_general/api/general.py", line 535, in response_generator
    response = pipeline_api(
  File "/home/notebook-user/prepline_general/api/general.py", line 339, in pipeline_api
    elements = partition(
  File "/home/notebook-user/.local/lib/python3.8/site-packages/unstructured/partition/auto.py", line 221, in partition
    elements = partition_pdf(
  File "/home/notebook-user/.local/lib/python3.8/site-packages/unstructured/documents/elements.py", line 222, in wrapper
    elements = func(*args, **kwargs)
  File "/home/notebook-user/.local/lib/python3.8/site-packages/unstructured/file_utils/filetype.py", line 628, in wrapper
    elements = func(*args, **kwargs)
  File "/home/notebook-user/.local/lib/python3.8/site-packages/unstructured/partition/pdf.py", line 95, in partition_pdf
    return partition_pdf_or_image(
  File "/home/notebook-user/.local/lib/python3.8/site-packages/unstructured/partition/pdf.py", line 184, in partition_pdf_or_image
    layout_elements = _partition_pdf_or_image_local(
  File "/home/notebook-user/.local/lib/python3.8/site-packages/unstructured/utils.py", line 43, in wrapper
    return func(*args, **kwargs)
  File "/home/notebook-user/.local/lib/python3.8/site-packages/unstructured/partition/pdf.py", line 248, in _partition_pdf_or_image_local
    layout = process_data_with_model(
  File "/home/notebook-user/.local/lib/python3.8/site-packages/unstructured_inference/inference/layout.py", line 344, in process_data_with_model
    layout = process_file_with_model(
  File "/home/notebook-user/.local/lib/python3.8/site-packages/unstructured_inference/inference/layout.py", line 389, in process_file_with_model
    else DocumentLayout.from_file(
  File "/home/notebook-user/.local/lib/python3.8/site-packages/unstructured_inference/inference/layout.py", line 109, in from_file
    page = PageLayout.from_image(
  File "/home/notebook-user/.local/lib/python3.8/site-packages/unstructured_inference/inference/layout.py", line 315, in from_image
    page.get_elements_with_detection_model()
  File "/home/notebook-user/.local/lib/python3.8/site-packages/unstructured_inference/inference/layout.py", line 235, in get_elements_with_detection_model
    elements = self.get_elements_from_layout(inferred_layout)
  File "/home/notebook-user/.local/lib/python3.8/site-packages/unstructured_inference/inference/layout.py", line 246, in get_elements_from_layout
    elements = [
  File "/home/notebook-user/.local/lib/python3.8/site-packages/unstructured_inference/inference/layout.py", line 247, in <listcomp>
    get_element_from_block(
  File "/home/notebook-user/.local/lib/python3.8/site-packages/unstructured_inference/inference/layout.py", line 414, in get_element_from_block
    element.text = element.extract_text(
  File "/home/notebook-user/.local/lib/python3.8/site-packages/unstructured_inference/inference/layoutelement.py", line 32, in extract_text
    text = super().extract_text(
  File "/home/notebook-user/.local/lib/python3.8/site-packages/unstructured_inference/inference/elements.py", line 216, in extract_text
    text = aggregate_by_block(self, image, objects, ocr_strategy)
  File "/home/notebook-user/.local/lib/python3.8/site-packages/unstructured_inference/inference/elements.py", line 316, in aggregate_by_block
    text = ocr(text_region, image, languages=ocr_languages)
  File "/home/notebook-user/.local/lib/python3.8/site-packages/unstructured_inference/inference/elements.py", line 272, in ocr
    return agent.detect(cropped_image)
  File "/home/notebook-user/.local/lib/python3.8/site-packages/layoutparser/ocr/tesseract_agent.py", line 122, in detect
    res = self._detect(image)
  File "/home/notebook-user/.local/lib/python3.8/site-packages/layoutparser/ocr/tesseract_agent.py", line 89, in _detect
    res["text"] = pytesseract.image_to_string(
  File "/home/notebook-user/.local/lib/python3.8/site-packages/pytesseract/pytesseract.py", line 423, in image_to_string
    return {
  File "/home/notebook-user/.local/lib/python3.8/site-packages/pytesseract/pytesseract.py", line 426, in <lambda>
    Output.STRING: lambda: run_and_get_output(*args),
  File "/home/notebook-user/.local/lib/python3.8/site-packages/pytesseract/pytesseract.py", line 288, in run_and_get_output
    run_tesseract(**kwargs)
  File "/home/notebook-user/.local/lib/python3.8/site-packages/pytesseract/pytesseract.py", line 264, in run_tesseract
    raise TesseractError(proc.returncode, get_errors(error_string))
pytesseract.pytesseract.TesseractError: (-8, 'Estimating resolution as 1425')

If I run the same query but with the fast strategy, then everything works fine but the results are not acceptable.
I took a look on the tesseract repo but could not find anything relevant. Would be glad for any help.
I am running the api locally through docker, all on default settings.

The text was updated successfully, but these errors were encountered:

shreyanid · 2023-08-09T16:32:49Z

Thanks @skarampatakis ! Do you mind sharing the file that caused this error?

skarampatakis · 2023-08-10T06:32:23Z

It is this one,
Integrated Report 2020.pdf. Thanks a lot for taking a look.

* build(release): bump unstructured-inference Related to downstream issue: Unstructured-IO/unstructured-api#182 And upstream PR: Unstructured-IO/unstructured-inference#165 --------- Co-authored-by: Shreya Nidadavolu <[email protected]>

Related to downstream issue: #182 And upstream PR: Unstructured-IO/unstructured-inference#165 * remove test_parallel_mode_correct_result * dropped the file_directory field from elements metadata

awalker4 · 2023-08-15T18:09:22Z

This is fixed as of 0.0.35! You can now get the latest image from quay, or pull the repo and rebuild.

skarampatakis · 2023-08-21T08:30:07Z

Thanks a lot for solving that issue quickly. I get no tesseract errors now. The problem is that I do not see the tables extracted properly, but I think this is already mentioned in #191 , seems like I have the same issue.

cragwolfe mentioned this issue Aug 10, 2023

build(release): bump unstructured-inference Unstructured-IO/unstructured#1074

Merged

shreyanid mentioned this issue Aug 11, 2023

build(release): bump unstructured #183

Merged

awalker4 closed this as completed Aug 15, 2023

awalker4 mentioned this issue Aug 18, 2023

bug/another instance of TesseractError - Estimating resolution as x Unstructured-IO/unstructured-inference#179

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Internal server error - Extracting tables from a PDF file #182

Internal server error - Extracting tables from a PDF file #182

skarampatakis commented Aug 9, 2023

shreyanid commented Aug 9, 2023

skarampatakis commented Aug 10, 2023

awalker4 commented Aug 15, 2023

skarampatakis commented Aug 21, 2023

Internal server error - Extracting tables from a PDF file #182

Internal server error - Extracting tables from a PDF file #182

Comments

skarampatakis commented Aug 9, 2023

shreyanid commented Aug 9, 2023

skarampatakis commented Aug 10, 2023

awalker4 commented Aug 15, 2023

skarampatakis commented Aug 21, 2023