Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: synchronize the default layout model #1604

Closed
christinestraub opened this issue Oct 2, 2023 · 1 comment · Fixed by #1613
Closed

bug: synchronize the default layout model #1604

christinestraub opened this issue Oct 2, 2023 · 1 comment · Fixed by #1613
Assignees
Labels
bug Something isn't working

Comments

@christinestraub
Copy link
Collaborator

christinestraub commented Oct 2, 2023

Describe the bug
Currently, unstructured and unstructured-inference use different default layout models, so there are many differences in the elements extracted by the two libraries.

To Reproduce
PDF: references.pdf

  • unstructured
    elements = partition_pdf(filename, strategy="hi_res")
    print(len(elements))
  • unstructured-inference
    layout = process_file_with_model(filename="references.pdf", model_name=None)
    print(len(layout.pages[0].elements))

Screenshots

  • Elements extracted with the unstructured Library
    hi_res-1

  • Elements extracted with the unstructured-inference Library
    references_1_final

Expected behavior
The elements extracted by the two libraries shouldn't be too different.

Environment Info

unstructured             0.10.18
unstructured-inference   0.6.6

Additional context
This issue is related to issue #1602.

@christinestraub
Copy link
Collaborator Author

Addressed by PR #1607

cragwolfe pushed a commit that referenced this issue Oct 5, 2023
This PR was initially created to close GitHub Issue #1604 (Synchronizing the default
layout model), but since it was already resolved in PR
[#1607](#1607), this
PR now only adds the visualization script used to investigate the issue.

### Summary
- add python script to annotate elements

PDF:
[references.pdf](https://github.com/Unstructured-IO/unstructured/files/12778270/references.pdf)

### Evaluation
```
PYTHONPATH=. python examples/layout-analysis/visualization.py references.pdf hi_res
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant