BUG: Incorrect text matrix passed to visitor_text in page.extract_text #2059
Labels
is-bug
From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF
workflow-advanced-text-extraction
Getting coordinates, font weight, font type, ...
When supplying
page.extract_text
with a visitor_text function, the callback to the function is made with an incorrecttm_matrix
(text matrix) parameter.I believe this is due to the fact that here, the visitor is called with the new
tm_matrix
, even though at this point, the text matrix may already be changed by the currently handled OP.Environment
Running the current
main
branch of pypdf on Debian.$ python -m platform Linux-5.10.0-23-amd64-x86_64-with-glibc2.31 $ python -c "import pypdf;print(pypdf.__version__)" 3.14.0
Code + PDF
Here is an example script that converts a testdoc.pdf to SVG, logs the OP's in the page and the calls to the visitor with x, y coordinates.
Output:
As can be seen (also in the resulting SVG), the coordinates of the line "A B C" get affected by the
Td
following after it, because this is whencrlf_space_check
detects a new line. It then however supplies thetm_matrix
that was already altered to the call to the visitor, instead of the one that was active before the operator was applied.The text was updated successfully, but these errors were encountered: