Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: coordinates bug on pdf parsing #1462

Merged
merged 8 commits into from
Sep 20, 2023
Merged

fix: coordinates bug on pdf parsing #1462

merged 8 commits into from
Sep 20, 2023

Conversation

amanda103
Copy link
Contributor

Addresses: #1460

We were raising an error with invalid coordinates, which prevented us from continuing to return the element and continue parsing the pdf. Now instead of raising the error we'll return early.

to test:

from unstructured.partition.auto import partition

elements = partition(url='https://www.apple.com/environment/pdf/Apple_Environmental_Progress_Report_2022.pdf', strategy="fast")

@amanda103 amanda103 requested a review from cragwolfe September 19, 2023 21:50
@@ -761,7 +761,7 @@ def check_coords_within_boundary(
a float ranges from [0,1] to scale the horizontal (x-axis) boundary
"""
if not coord_has_valid_points(coordinates) and not coord_has_valid_points(boundary):
raise ValueError("Invalid coordinates.")
return False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

worth adding a trace detail message i think, like:

trace_logger.detail( # type: ignore

.
is there a single page this fails on that could be added to a test? one could extract the problematic page with something like:

qpdf --empty --pages orignal.pdf 89 -- ~/tmp/docs/single-page-p89.pdf

finally, does not need to be addressed in this PR, but what is the why for the invalid coordinate? is it negative values? (which i think are actually legit in PDF's in certain cases)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good - I'll add trace detail and a test!

And yes - the values that it was erroring on were negative.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CHANGELOG.md Outdated Show resolved Hide resolved
Copy link
Contributor

@cragwolfe cragwolfe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

amanda103 and others added 2 commits September 19, 2023 16:42
Co-authored-by: cragwolfe <[email protected]>
Co-authored-by: cragwolfe <[email protected]>
@amanda103 amanda103 merged commit e359afa into main Sep 20, 2023
39 checks passed
@amanda103 amanda103 deleted the amanda/coordinates-bug branch September 20, 2023 02:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants