-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: invalid cm/tm in visitor functions #2206
Conversation
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## main #2206 +/- ##
==========================================
+ Coverage 94.40% 94.44% +0.04%
==========================================
Files 43 43
Lines 7599 7621 +22
Branches 1501 1502 +1
==========================================
+ Hits 7174 7198 +24
+ Misses 262 261 -1
+ Partials 163 162 -1
☔ View full report in Codecov by Sentry. |
@troethe, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! I think this approach is currently the best way to handle #2075. The two comments I added are minutiae and not necessarily important.
m_prev = mult(tm_prev, cm_prev) | ||
m = mult(tm_matrix, cm_matrix) | ||
orientation = orient(m) | ||
delta_x = m[4] - m_prev[4] | ||
delta_y = m[5] - m_prev[5] | ||
k = math.sqrt(abs(m[0] * m[3]) + abs(m[1] * m[2])) | ||
f = font_size * k | ||
cm_prev = m |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't look quite right. I think this line can however be safely removed, because cm_prev
doesn't get accessed until it is being assigned to again at the end of this function (i.e. this is currently a noop).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree.
Good point to have noticed it . Actually I've forgot to pass the multiplied cm_matrix to the visitor
pypdf/_page.py
Outdated
@@ -1946,7 +1946,10 @@ def _extract_text( | |||
1.0, | |||
0.0, | |||
0.0, | |||
] # will store previous tm_matrix | |||
] # will store cm_matrix * tm_matrix |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the old comment was more correct, right? This change was probably a side effect of the revert.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct we store inhere only tm_matrix
@troethe @srogmann |
@pubpub-zz I also have no example of cm_matrix without tm_matrix at hand. PDF-reference, "5.3.3 Text Space Details", uses the product of cm_matrix and tm_matrix you mentioned. An argument to stay at cm_matrix instead of the product might be the performance if the product is computed but not used. In that case the publish of a mult()-function might be useful. In text-extraction-test 3 of tests/test_page.py I used visitor_operand_after to verify the Td-operation (see visitor_td). The return of the operator by visit_operand_before() can support patches or studies like the one in #1389 (comment). Can or should the operand-list be modified inside visit_operand_before() by using append(...), clear(...), ...? Otherwise it would be better to return the tuple (operator, operands). |
@MartinThoma |
I don't have an opinion on this topic. To be honest, I haven't read the specs so I need to trust you on those changes. If two of you agree that this PR is good (and CI is green), then I would merge :-) |
@MartinThoma / @srogmann |
Plus stdby |
I take this as "please standby", so I should not merge? |
Correct interpretation |
@MartinThoma |
Co-authored-by: Martin Thoma <[email protected]>
Co-authored-by: Martin Thoma <[email protected]>
The cm/tm parts of the visitor functions are really hard to get right. Thank you for taking care of it @pubpub-zz 🙏 |
## What's new ### Bug Fixes (BUG) - invalid cm/tm in visitor functions (#2206) by @pubpub-zz - Encrypt / decrypt Stream object dictionaries (#2228) by @pilotandy - Support nested color spaces for the /DeviceN color space (#2241) by @Stefan - images property fails if NullObject in list (#2215) by @pubpub-zz ### Robustness (ROB) - XYZ destination to cope with missing left and top param (#2237) by @pubpub-zz ### Documentation (DOC) - Add pilotandy for #2228 as a contributor by @MartinThoma - Link to CONTRIBUTING.md in README.md (#2232) by @markpfeifle - Changelog by @MartinThoma ### Developer Experience (DEV) - Exclude tests from mypy checks by @MartinThoma - Unify mypy options and warn redundant workarounds (#2223) by @exiledkingcc - Stabilize Pillow test with Pillow missing (#2229) by @Stefan ### Maintenance (MAINT) - Update pinned packages (#2243) by @MartinThoma ### Testing (TST) - Regression test against non-deterministic accidental object reuse (#2244) by @MartinThoma [Full Changelog](3.16.2...3.16.3)
closes #2200
closes #2075
reworks and is still valid to close #2059