-
Notifications
You must be signed in to change notification settings - Fork 807
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: coordinates bug on pdf parsing #1462
Conversation
@@ -761,7 +761,7 @@ def check_coords_within_boundary( | |||
a float ranges from [0,1] to scale the horizontal (x-axis) boundary | |||
""" | |||
if not coord_has_valid_points(coordinates) and not coord_has_valid_points(boundary): | |||
raise ValueError("Invalid coordinates.") | |||
return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
worth adding a trace detail message i think, like:
trace_logger.detail( # type: ignore |
.
is there a single page this fails on that could be added to a test? one could extract the problematic page with something like:
qpdf --empty --pages orignal.pdf 89 -- ~/tmp/docs/single-page-p89.pdf
finally, does not need to be addressed in this PR, but what is the why for the invalid coordinate? is it negative values? (which i think are actually legit in PDF's in certain cases)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good - I'll add trace detail and a test!
And yes - the values that it was erroring on were negative.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here's where it was failing: https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/utils/sorting.py#L30
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉
Co-authored-by: cragwolfe <[email protected]>
Co-authored-by: cragwolfe <[email protected]>
Addresses: #1460
We were raising an error with invalid coordinates, which prevented us from continuing to return the element and continue parsing the pdf. Now instead of raising the error we'll return early.
to test: