Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: change default hi_res model to yolox quantized #1607

Merged
merged 8 commits into from
Oct 4, 2023

Conversation

badGarnet
Copy link
Collaborator

  • refactor the way to set default model for hi_res mode for image and pdf partition into a function that is callable and returns either an env varaible or a default model
  • this keeps the current pattern for setting the default hi_res model
  • change the default model name from detectron2_onnx to yolox_quantized
  • the new default mode has better recall for tables and richer categories for partitioned elements than detectron2

- refactor the way to set default model for hi_res mode for image and
  pdf partition into a function that is callable and returns either an
  env varaible or a default model
- this keeps the current pattern for setting the default hi_res model
- change the default model name from `detectron2_onnx` to
  `yolox_quantized`
- the new default mode has better recall for tables and richer
  categories for partitioned elements than detectron2
badGarnet and others added 2 commits October 2, 2023 15:41
…ixtures update (#1618)

This pull request includes updated ingest test fixtures.
Please review and merge if appropriate.

Co-authored-by: badGarnet <[email protected]>
@badGarnet
Copy link
Collaborator Author

please review comments here for the ingest diff eval: #1618

@badGarnet badGarnet marked this pull request as ready for review October 3, 2023 02:22
@badGarnet badGarnet requested review from qued and MthwRobinson October 3, 2023 02:23
Copy link
Contributor

@benjats07 benjats07 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, however, I would like to count with the metrics to make this change (not necessarily a blocker). I was able to manually found a lot of the text of the previous model in the new ingest tests.

@badGarnet
Copy link
Collaborator Author

LGTM, however, I would like to count with the metrics to make this change (not necessarily a blocker). I was able to manually found a lot of the text of the previous model in the new ingest tests.

actually do we have gold label for those ingest test documents? if not maybe that is something we should have even before generating metrics

Copy link
Contributor

@qued qued left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM other than a documentation nit.

unstructured/partition/pdf.py Outdated Show resolved Hide resolved
Co-authored-by: qued <[email protected]>
@badGarnet badGarnet enabled auto-merge (squash) October 4, 2023 03:03
@badGarnet badGarnet merged commit 19d8bff into main Oct 4, 2023
39 checks passed
@badGarnet badGarnet deleted the yao/set-default-element-to-use-yolox branch October 4, 2023 03:28
cragwolfe pushed a commit that referenced this pull request Oct 5, 2023
This PR was initially created to close GitHub Issue #1604 (Synchronizing the default
layout model), but since it was already resolved in PR
[#1607](#1607), this
PR now only adds the visualization script used to investigate the issue.

### Summary
- add python script to annotate elements

PDF:
[references.pdf](https://github.com/Unstructured-IO/unstructured/files/12778270/references.pdf)

### Evaluation
```
PYTHONPATH=. python examples/layout-analysis/visualization.py references.pdf hi_res
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants