Table parsing doesn't work #191

netapy · 2023-08-20T17:59:23Z

Hi, the table parsing doesn't seem to work at all in my case.
I tried with multiple files (.pdf, .jpeg, .docx...)

It returns most cells as UncategorizedText and a few as Title.

I call the API using the following parameters :

data = aiohttp.FormData()
data.add_field('files', file_content, filename=file.filename, content_type=file.content_type)
data.add_field('ocr_languages', "fra")
data.add_field('strategy', "ocr_only" if file.filename.lower().endswith(".jpeg") or file.filename.lower().endswith(".jpg") or file.filename.lower().endswith(".png") else "auto")
data.add_field('include_page_breaks', "true")
data.add_field('pdf_infer_table_structure', "true")

and

async with session.post(
                "http://unstructured-api:8000/general/v0/general",
                headers={'accept': 'application/json'},
                data=data
            ) as response:

Thanks !

The text was updated successfully, but these errors were encountered:

yuming-long · 2023-08-22T15:19:39Z

Hi @netapy Thank you for reaching out! Currently, table support parameters only work with hi_res strategy, so thats why you didn't see the extracted tables.

If you wise to enable table support for PDFs only, you can set data.add_field('pdf_infer_table_structure', "true") with data.add_field('strategy', "hi_res") according to https://github.com/Unstructured-IO/unstructured-api#pdf-table-extraction.

But more generally if you want to enable table support for different file types, i would suggest you to use the skip_infer_table_types parameters by specifying not to skip table support for your file types. For example you can set data.add_field('skip_infer_table_types', "[]") with data.add_field('strategy', "hi_res"), which means not to skip table support for our default files types: pdf, jpg and png. (doc here: https://github.com/Unstructured-IO/unstructured-api#skip-table-extraction)

If table support works as expected, you can see text_as_html field in metadata field from the return json output, which are the extracted tables :)

netapy · 2023-08-23T15:13:58Z

Thank you for that complete answer.
Unfortunately it doesn't work....

I use the following pdf document as a test : https://www.w3.org/WAI/WCAG21/working-examples/pdf-table/table.pdf

data = aiohttp.FormData()
data.add_field('files', file_content, filename=file.filename, content_type=file.content_type)
data.add_field('ocr_languages', "fra")
data.add_field('strategy', "ocr_only" 
         if file.filename.lower().endswith((".jpeg", ".jpg", ".png"))
         else "hi_res")
data.add_field('include_page_breaks', "true")
data.add_field('pdf_infer_table_structure', "true") # also tried without
data.add_field('skip_infer_table_types', "[]")
 async with session.post(
         "http://unstructured-api:8000/general/v0.0.37/general",    # I tried both v0.0.37 or v0 endpoints
          headers={'accept': 'application/json'},
              data=data
          ) as response:
....

And here is the ouput :

[
   {
      "type":"UncategorizedText",
      "element_id":"9f22f5965a6040914e3b03dde86bc6f5",
      "metadata":{
         "filename":"table.pdf",
         "filetype":"application/pdf",
         "page_number":1
      },
      "text":"Example table This is an example of a data table."
   },
   {
      "type":"UncategorizedText",
      "element_id":"1798828015cc74d8682b337c3076b303",
      "metadata":{
         "filename":"table.pdf",
         "filetype":"application/pdf",
         "page_number":1
      },
      "text":"Results"
   },
   {
      "type":"UncategorizedText",
      "element_id":"51e2b20ad3d3718686c5af2219390f3b",
      "metadata":{
         "filename":"table.pdf",
         "filetype":"application/pdf",
         "page_number":1
      },
      "text":"Ballots Incomplete/ Terminated"
   },
   {
      "type":"UncategorizedText",
      "element_id":"f29fb1454ce3644524f2d263ca1c6f87",
      "metadata":{
         "filename":"table.pdf",
         "filetype":"application/pdf",
         "page_number":1
      },
      "text":"Disability Category"
   },
   {
      "type":"UncategorizedText",
      "element_id":"83a1c031bd39d9d60ca64966b04c1042",
      "metadata":{
         "filename":"table.pdf",
         "filetype":"application/pdf",
         "page_number":1
      },
      "text":"Ballots Completed"
   },
   {
      "type":"UncategorizedText",
      "element_id":"13439dbfb9a5ba2c710f5976a03b4209",
      "metadata":{
         "filename":"table.pdf",
         "filetype":"application/pdf",
         "page_number":1
      },
      "text":"Participants"
   },
   
[...]

   {
      "type":"UncategorizedText",
      "element_id":"2d1f51b216f38179e2158a27df1cdeff",
      "metadata":{
         "filename":"table.pdf",
         "filetype":"application/pdf",
         "page_number":1
      },
      "text":"3"
   },
   {
      "type":"UncategorizedText",
      "element_id":"2d1f51b216f38179e2158a27df1cdeff",
      "metadata":{
         "filename":"table.pdf",
         "filetype":"application/pdf",
         "page_number":1
      },
      "text":"3"
   },
   {
      "type":"UncategorizedText",
      "element_id":"41076331dd794a2a155e9a375f6d9227",
      "metadata":{
         "filename":"table.pdf",
         "filetype":"application/pdf",
         "page_number":1
      },
      "text":"0"
   },
   {
      "type":"UncategorizedText",
      "element_id":"59361434f3a6365b2c1a7cce2be78350",
      "metadata":{
         "filename":"table.pdf",
         "filetype":"application/pdf",
         "page_number":1
      },
      "text":"95.4%, n=3"
   },
   {
      "type":"UncategorizedText",
      "element_id":"8685a405333b8d0a041bc3df6bfa3008",
      "metadata":{
         "filename":"table.pdf",
         "filetype":"application/pdf",
         "page_number":1
      },
      "text":"1416 sec, n=3"
   }
]

yuming-long · 2023-08-23T19:14:55Z

@netapy Thanks for pointing it out! I did reproduce on that PDF and got the same result, i also tried to run with yolox model but didn't seems to help, here is the code snippet

url = 'http://127.0.0.1:8000/general/v0/general'
file_path = 'table.pdf'
async def send_request():
    async with aiohttp.ClientSession() as session:
        with open(file_path, 'rb') as file:
            file_content = file.read()
            data = FormData()
            # Add the file field
            data.add_field('files', file_content, filename=file_path, content_type='application/octet-stream')
            # Add other fields
            data.add_field('strategy', "hi_res")
            data.add_field('skip_infer_table_types', "[]")
            data.add_field('hi_res_model_name', "yolox")
            async with session.post(url, data=data) as response:
                response_text = await response.text()
                print(response_text)

sorry that didn't work, this is a long existing problem with our detectron2 model and we are building a quantized version of yolox that may be run in the api and hopefully can improve this.

In the meantime, i would still encourage you to run yolox locally (start the api with make run-web-app and run the code snippet) to see if it makes a difference.

happysalada · 2023-08-26T08:57:53Z

Could you update the readme on what env var needs to be set to have it use the yolox model ?

updating readme as to comment in this GH issue: #191 ### Summary * add documentation to `hi_res_model_name` parameter in readme

netapy · 2023-09-04T10:55:40Z

Hi! Thanks for the update – it does parse tables now. However It's really not that great at doing it.

Here is my code using YOLOX :

            data = aiohttp.FormData()
            data.add_field('files', file_content, filename=file.filename, content_type=file.content_type)
            data.add_field('ocr_languages', "fra")
            data.add_field('strategy', "ocr_only" 
                           if file.filename.lower().endswith((".jpeg", ".jpg", ".png"))
                           else "hi_res")
            data.add_field('include_page_breaks', "true")
            data.add_field('pdf_infer_table_structure', "true")
            data.add_field('hi_res_model_name', "yolox")
            data.add_field('skip_infer_table_types', "[]")
            
            async with session.post(
                "http://unstructured-api:8000/general/v0.0.42/general",
                headers={'accept': 'application/json'},
                data=data
            ) as response:

And here is the html table I get from the sample we talked about (https://www.w3.org/WAI/WCAG21/working-examples/pdf-table/table.pdf) :

Why is an ML model needed ? are there no other possibilities of parsing written tables inside a pdf ?

Thanks again ! :)

yuming-long · 2023-09-05T14:41:46Z

Hi there!
An ML model is needed when we are inferencing the PDF document when calling the partition bricks. The inference pipeline support difference models to find text elements in a document page then extracting the contents of the elements. As for table parsing, we use ML model to identify tables and extract texts from them.

Right now I can't think of any other way to help improve the parsing, but i will rise this issue to our Engineering team and see what we can help :)

yuming-long · 2023-09-05T22:15:33Z

Hi @netapy,

May I ask if you are running the table parsing on a M1/M2 chip?
If not, running yolox detection model and using paddleOCR for OCR on x86 architecture might help (i.e. pip install unstructured_paddleocr in the env that you were running yolox model). here is the readme doc for paddleOCR on inference repo: https://github.com/Unstructured-IO/unstructured-inference#paddleocr

netapy · 2023-09-06T07:13:43Z

Hi @netapy,

May I ask if you are running the table parsing on a M1/M2 chip? If not, running yolox detection model and using paddleOCR for OCR on x86 architecture might help (i.e. pip install unstructured_paddleocr in the env that you were running yolox model). here is the readme doc for paddleOCR on inference repo: https://github.com/Unstructured-IO/unstructured-inference#paddleocr

Hi Yuming –
No, I'm running the API beta via docker on a dedicated server.
GPU : Intel Core i7-4790K - 4c/8t - 4 GHz/4.4 GHz

Are there easy steps to automate this using the docker image or shall I dig into the container bash ?

yuming-long · 2023-09-06T14:27:35Z

Thanks for follow up!

I actually tired myself and got the "text_as_html" field, but looks like as you said the output is not great.... But since i run it on x86, paddleocr might not work as expected, so maybe if you run it on a intel system the result could be different.

{
    "type": "Table",
    "element_id": "a33876037db1821fbfe882a0b5851af3",
    "metadata": {
      "filename": "table.pdf",
      "filetype": "application/pdf",
      "page_number": 1,
      "text_as_html": "<table><thead><th rowspan=\"2\">Disability Category</th><th rowspan=\"2\">Participants</th><th rowspan=\"2\">Ballots Completed</th><th rowspan=\"2\">Ballots Incomplete/ Terminated</th><th colspan=\"2\"></th></thead><thead><th>Accuracy</th><th>Time to complete</th></thead><tr><td></td><td>Kl</td><td></td><td></td><td></td><td></td></tr><tr><td>Dexterity</td><td></td><td></td><td></td><td>98.3%, n=4</td><td>1672.1 sec, n=4</td></tr><tr><td>Mobility</td><td></td><td></td><td></td><td>3 95.4%, n=</td><td>1416 sec, n=3</td></tr></table>"
    },
    "text": "Blind 5 1 34.5%, n=1 1199 sec, n=1 Low Vision 5 2 98.3% n=2 1716 sec, n=3 (97.7%, n=3) | (1934 sec, n=2) Dexterity 5 4 98.3%, n=4 1672.1 sec, n=4 Mobility 3 3 95.4%, n=3 1416 sec, n=3"
  },

here are the steps i used to reproduce:

on dockerfile: add line && su -l ${NB_USER} -c 'pip3.10 install unstructured_paddleocr' \ after line 29
run make docker-build (if you are building for the first time, it might take some time)
run make docker-start-api
and then post like how you did before, i used curl to test:

curl -X 'POST'  'http://0.0.0.0:8000/general/v0/general'  \
-H 'accept: application/json'  \
-H 'Content-Type: multipart/form-data' \
-F 'skip_infer_table_types=[]' \
-F '[email protected]' \
-F 'strategy=hi_res' \
-F 'hi_res_model_name=yolox' \
| jq -C . | less -R

skarampatakis mentioned this issue Aug 21, 2023

Internal server error - Extracting tables from a PDF file #182

Closed

yuming-long mentioned this issue Aug 30, 2023

doc: add hi_res_model_name in readme #213

Merged

yuming-long added a commit that referenced this issue Aug 30, 2023

doc: add hi_res_model_name in readme (#213)

727eb6c

updating readme as to comment in this GH issue: #191 ### Summary * add documentation to `hi_res_model_name` parameter in readme

yuming-long closed this as completed Aug 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Table parsing doesn't work #191

Table parsing doesn't work #191

netapy commented Aug 20, 2023

yuming-long commented Aug 22, 2023 •

edited

Loading

netapy commented Aug 23, 2023

yuming-long commented Aug 23, 2023

happysalada commented Aug 26, 2023

netapy commented Sep 4, 2023

yuming-long commented Sep 5, 2023

yuming-long commented Sep 5, 2023

netapy commented Sep 6, 2023

yuming-long commented Sep 6, 2023 •

edited

Loading

Table parsing doesn't work #191

Table parsing doesn't work #191

Comments

netapy commented Aug 20, 2023

yuming-long commented Aug 22, 2023 • edited Loading

netapy commented Aug 23, 2023

yuming-long commented Aug 23, 2023

happysalada commented Aug 26, 2023

netapy commented Sep 4, 2023

yuming-long commented Sep 5, 2023

yuming-long commented Sep 5, 2023

netapy commented Sep 6, 2023

yuming-long commented Sep 6, 2023 • edited Loading

yuming-long commented Aug 22, 2023 •

edited

Loading

yuming-long commented Sep 6, 2023 •

edited

Loading