Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: process chipper hierarchy #1634

Merged
merged 32 commits into from
Oct 13, 2023
Merged

chore: process chipper hierarchy #1634

merged 32 commits into from
Oct 13, 2023

Conversation

qued
Copy link
Contributor

@qued qued commented Oct 3, 2023

PR to support schema changes introduced from PR 232 in unstructured-inference.

Specifically what needs to be supported is:

  • Change to the way LayoutElement from unstructured-inference is structured, specifically that this class is no longer a subclass of Rectangle, and instead LayoutElement has a bbox property that captures the location information and a from_coords method that allows construction of a LayoutElement directly from coordinates.
  • Removal of LocationlessLayoutElement since chipper now exports bounding boxes, and if we need to support elements without bounding boxes, we can make the bbox property mentioned above optional.
  • Getting hierarchy data directly from the inference elements rather than in post-processing
  • Don't try to reorder elements received from chipper v2, as they should already be ordered.

Testing:

The following demonstrates that the new version of chipper is inferring hierarchy.

from unstructured.partition.pdf import partition_pdf
elements = partition_pdf("example-docs/layout-parser-paper-fast.pdf", strategy="hi_res", model_name="chipper")
children = [el for el in elements if el.metadata.parent_id is not None]
print(children)

Also verify that running the traditional hi_res gives different results:

from unstructured.partition.pdf import partition_pdf
elements = partition_pdf("example-docs/layout-parser-paper-fast.pdf", strategy="hi_res")

@qued qued marked this pull request as ready for review October 12, 2023 17:27
Copy link
Collaborator

@christinestraub christinestraub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@qued
Copy link
Contributor Author

qued commented Oct 12, 2023

Ingest test diffs are caused by this change, and I think this is the sole source of diffs.

@christinestraub
Copy link
Collaborator

Ingest test diffs are caused by this change, and I think this is the sole source of diffs.

Yes, that's right.

@ajjimeno ajjimeno self-requested a review October 12, 2023 22:23
Copy link
Contributor

@ajjimeno ajjimeno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@qued qued enabled auto-merge October 13, 2023 00:39
@qued qued added this pull request to the merge queue Oct 13, 2023
Merged via the queue into main with commit 8100f1e Oct 13, 2023
@qued qued deleted the chore/process-chipper-hierarchy branch October 13, 2023 02:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants