Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Table parsing doesn't work #191

Closed
netapy opened this issue Aug 20, 2023 · 9 comments
Closed

Table parsing doesn't work #191

netapy opened this issue Aug 20, 2023 · 9 comments

Comments

@netapy
Copy link

netapy commented Aug 20, 2023

Hi, the table parsing doesn't seem to work at all in my case.
I tried with multiple files (.pdf, .jpeg, .docx...)

It returns most cells as UncategorizedText and a few as Title.

I call the API using the following parameters :

data = aiohttp.FormData()
data.add_field('files', file_content, filename=file.filename, content_type=file.content_type)
data.add_field('ocr_languages', "fra")
data.add_field('strategy', "ocr_only" if file.filename.lower().endswith(".jpeg") or file.filename.lower().endswith(".jpg") or file.filename.lower().endswith(".png") else "auto")
data.add_field('include_page_breaks', "true")
data.add_field('pdf_infer_table_structure', "true")

and

async with session.post(
                "http://unstructured-api:8000/general/v0/general",
                headers={'accept': 'application/json'},
                data=data
            ) as response:

Thanks !

@yuming-long
Copy link
Contributor

yuming-long commented Aug 22, 2023

Hi @netapy Thank you for reaching out! Currently, table support parameters only work with hi_res strategy, so thats why you didn't see the extracted tables.

If you wise to enable table support for PDFs only, you can set data.add_field('pdf_infer_table_structure', "true") with data.add_field('strategy', "hi_res") according to https://github.com/Unstructured-IO/unstructured-api#pdf-table-extraction.

But more generally if you want to enable table support for different file types, i would suggest you to use the skip_infer_table_types parameters by specifying not to skip table support for your file types. For example you can set data.add_field('skip_infer_table_types', "[]") with data.add_field('strategy', "hi_res"), which means not to skip table support for our default files types: pdf, jpg and png. (doc here: https://github.com/Unstructured-IO/unstructured-api#skip-table-extraction)

If table support works as expected, you can see text_as_html field in metadata field from the return json output, which are the extracted tables :)

@netapy
Copy link
Author

netapy commented Aug 23, 2023

Thank you for that complete answer.
Unfortunately it doesn't work....

I use the following pdf document as a test : https://www.w3.org/WAI/WCAG21/working-examples/pdf-table/table.pdf

data = aiohttp.FormData()
data.add_field('files', file_content, filename=file.filename, content_type=file.content_type)
data.add_field('ocr_languages', "fra")
data.add_field('strategy', "ocr_only" 
         if file.filename.lower().endswith((".jpeg", ".jpg", ".png"))
         else "hi_res")
data.add_field('include_page_breaks', "true")
data.add_field('pdf_infer_table_structure', "true") # also tried without
data.add_field('skip_infer_table_types', "[]")
 async with session.post(
         "http://unstructured-api:8000/general/v0.0.37/general",    # I tried both v0.0.37 or v0 endpoints
          headers={'accept': 'application/json'},
              data=data
          ) as response:
....

And here is the ouput :

[
   {
      "type":"UncategorizedText",
      "element_id":"9f22f5965a6040914e3b03dde86bc6f5",
      "metadata":{
         "filename":"table.pdf",
         "filetype":"application/pdf",
         "page_number":1
      },
      "text":"Example table This is an example of a data table."
   },
   {
      "type":"UncategorizedText",
      "element_id":"1798828015cc74d8682b337c3076b303",
      "metadata":{
         "filename":"table.pdf",
         "filetype":"application/pdf",
         "page_number":1
      },
      "text":"Results"
   },
   {
      "type":"UncategorizedText",
      "element_id":"51e2b20ad3d3718686c5af2219390f3b",
      "metadata":{
         "filename":"table.pdf",
         "filetype":"application/pdf",
         "page_number":1
      },
      "text":"Ballots Incomplete/ Terminated"
   },
   {
      "type":"UncategorizedText",
      "element_id":"f29fb1454ce3644524f2d263ca1c6f87",
      "metadata":{
         "filename":"table.pdf",
         "filetype":"application/pdf",
         "page_number":1
      },
      "text":"Disability Category"
   },
   {
      "type":"UncategorizedText",
      "element_id":"83a1c031bd39d9d60ca64966b04c1042",
      "metadata":{
         "filename":"table.pdf",
         "filetype":"application/pdf",
         "page_number":1
      },
      "text":"Ballots Completed"
   },
   {
      "type":"UncategorizedText",
      "element_id":"13439dbfb9a5ba2c710f5976a03b4209",
      "metadata":{
         "filename":"table.pdf",
         "filetype":"application/pdf",
         "page_number":1
      },
      "text":"Participants"
   },
   
[...]

   {
      "type":"UncategorizedText",
      "element_id":"2d1f51b216f38179e2158a27df1cdeff",
      "metadata":{
         "filename":"table.pdf",
         "filetype":"application/pdf",
         "page_number":1
      },
      "text":"3"
   },
   {
      "type":"UncategorizedText",
      "element_id":"2d1f51b216f38179e2158a27df1cdeff",
      "metadata":{
         "filename":"table.pdf",
         "filetype":"application/pdf",
         "page_number":1
      },
      "text":"3"
   },
   {
      "type":"UncategorizedText",
      "element_id":"41076331dd794a2a155e9a375f6d9227",
      "metadata":{
         "filename":"table.pdf",
         "filetype":"application/pdf",
         "page_number":1
      },
      "text":"0"
   },
   {
      "type":"UncategorizedText",
      "element_id":"59361434f3a6365b2c1a7cce2be78350",
      "metadata":{
         "filename":"table.pdf",
         "filetype":"application/pdf",
         "page_number":1
      },
      "text":"95.4%, n=3"
   },
   {
      "type":"UncategorizedText",
      "element_id":"8685a405333b8d0a041bc3df6bfa3008",
      "metadata":{
         "filename":"table.pdf",
         "filetype":"application/pdf",
         "page_number":1
      },
      "text":"1416 sec, n=3"
   }
]

@yuming-long
Copy link
Contributor

@netapy Thanks for pointing it out! I did reproduce on that PDF and got the same result, i also tried to run with yolox model but didn't seems to help, here is the code snippet

url = 'http://127.0.0.1:8000/general/v0/general'
file_path = 'table.pdf'
async def send_request():
    async with aiohttp.ClientSession() as session:
        with open(file_path, 'rb') as file:
            file_content = file.read()
            data = FormData()
            # Add the file field
            data.add_field('files', file_content, filename=file_path, content_type='application/octet-stream')
            # Add other fields
            data.add_field('strategy', "hi_res")
            data.add_field('skip_infer_table_types', "[]")
            data.add_field('hi_res_model_name', "yolox")
            async with session.post(url, data=data) as response:
                response_text = await response.text()
                print(response_text)

sorry that didn't work, this is a long existing problem with our detectron2 model and we are building a quantized version of yolox that may be run in the api and hopefully can improve this.

In the meantime, i would still encourage you to run yolox locally (start the api with make run-web-app and run the code snippet) to see if it makes a difference.

@happysalada
Copy link

Could you update the readme on what env var needs to be set to have it use the yolox model ?

yuming-long added a commit that referenced this issue Aug 30, 2023
updating readme as to comment in this GH issue:
#191

### Summary

* add documentation to `hi_res_model_name` parameter in readme
@netapy
Copy link
Author

netapy commented Sep 4, 2023

Hi! Thanks for the update – it does parse tables now. However It's really not that great at doing it.

Here is my code using YOLOX :

            data = aiohttp.FormData()
            data.add_field('files', file_content, filename=file.filename, content_type=file.content_type)
            data.add_field('ocr_languages', "fra")
            data.add_field('strategy', "ocr_only" 
                           if file.filename.lower().endswith((".jpeg", ".jpg", ".png"))
                           else "hi_res")
            data.add_field('include_page_breaks', "true")
            data.add_field('pdf_infer_table_structure', "true")
            data.add_field('hi_res_model_name', "yolox")
            data.add_field('skip_infer_table_types', "[]")
            
            async with session.post(
                "http://unstructured-api:8000/general/v0.0.42/general",
                headers={'accept': 'application/json'},
                data=data
            ) as response:

And here is the html table I get from the sample we talked about (https://www.w3.org/WAI/WCAG21/working-examples/pdf-table/table.pdf) :

Capture d’écran 2023-09-04 à 12 52 20

Why is an ML model needed ? are there no other possibilities of parsing written tables inside a pdf ?

Thanks again ! :)

@yuming-long
Copy link
Contributor

Hi there!
An ML model is needed when we are inferencing the PDF document when calling the partition bricks. The inference pipeline support difference models to find text elements in a document page then extracting the contents of the elements. As for table parsing, we use ML model to identify tables and extract texts from them.

Right now I can't think of any other way to help improve the parsing, but i will rise this issue to our Engineering team and see what we can help :)

@yuming-long
Copy link
Contributor

Hi @netapy,

May I ask if you are running the table parsing on a M1/M2 chip?
If not, running yolox detection model and using paddleOCR for OCR on x86 architecture might help (i.e. pip install unstructured_paddleocr in the env that you were running yolox model). here is the readme doc for paddleOCR on inference repo: https://github.com/Unstructured-IO/unstructured-inference#paddleocr

@netapy
Copy link
Author

netapy commented Sep 6, 2023

Hi @netapy,

May I ask if you are running the table parsing on a M1/M2 chip? If not, running yolox detection model and using paddleOCR for OCR on x86 architecture might help (i.e. pip install unstructured_paddleocr in the env that you were running yolox model). here is the readme doc for paddleOCR on inference repo: https://github.com/Unstructured-IO/unstructured-inference#paddleocr

Hi Yuming –
No, I'm running the API beta via docker on a dedicated server.
GPU : Intel Core i7-4790K - 4c/8t - 4 GHz/4.4 GHz

Are there easy steps to automate this using the docker image or shall I dig into the container bash ?

@yuming-long
Copy link
Contributor

yuming-long commented Sep 6, 2023

Thanks for follow up!

I actually tired myself and got the "text_as_html" field, but looks like as you said the output is not great.... But since i run it on x86, paddleocr might not work as expected, so maybe if you run it on a intel system the result could be different.

{
    "type": "Table",
    "element_id": "a33876037db1821fbfe882a0b5851af3",
    "metadata": {
      "filename": "table.pdf",
      "filetype": "application/pdf",
      "page_number": 1,
      "text_as_html": "<table><thead><th rowspan=\"2\">Disability Category</th><th rowspan=\"2\">Participants</th><th rowspan=\"2\">Ballots Completed</th><th rowspan=\"2\">Ballots Incomplete/ Terminated</th><th colspan=\"2\"></th></thead><thead><th>Accuracy</th><th>Time to complete</th></thead><tr><td></td><td>Kl</td><td></td><td></td><td></td><td></td></tr><tr><td>Dexterity</td><td></td><td></td><td></td><td>98.3%, n=4</td><td>1672.1 sec, n=4</td></tr><tr><td>Mobility</td><td></td><td></td><td></td><td>3 95.4%, n=</td><td>1416 sec, n=3</td></tr></table>"
    },
    "text": "Blind 5 1 34.5%, n=1 1199 sec, n=1 Low Vision 5 2 98.3% n=2 1716 sec, n=3 (97.7%, n=3) | (1934 sec, n=2) Dexterity 5 4 98.3%, n=4 1672.1 sec, n=4 Mobility 3 3 95.4%, n=3 1416 sec, n=3"
  },

here are the steps i used to reproduce:

  • on dockerfile: add line && su -l ${NB_USER} -c 'pip3.10 install unstructured_paddleocr' \ after line 29
  • run make docker-build (if you are building for the first time, it might take some time)
  • run make docker-start-api
  • and then post like how you did before, i used curl to test:
curl -X 'POST'  'http://0.0.0.0:8000/general/v0/general'  \
-H 'accept: application/json'  \
-H 'Content-Type: multipart/form-data' \
-F 'skip_infer_table_types=[]' \
-F '[email protected]' \
-F 'strategy=hi_res' \
-F 'hi_res_model_name=yolox' \
| jq -C . | less -R

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants