-
Notifications
You must be signed in to change notification settings - Fork 562
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lines of text are sometimes split into two #3653
Comments
This is no bug. |
Oh ok, my bad, sorry for the inconvenience (and thank you for the script)! |
No problem at all. The question was absolutely justified, and a reproducing PDF was attached. |
Hi @JorjMcKie I have reviewed your recovered lines script, my question is how to use this script, is it going to edit the pdf with recovered lines or after reading the we need to all the recovered lines function. I need to recovered the lines and make those lines a single block in PDF so later save all the properties in data frame. Below is my data frame object. Pleast help with saving the block text as recovered lines with applied html tags Could you please help? Step 5: Group by block_id and concatenate HTML-formatted text
|
Description of the bug
In the same context as #3650, I am parsing many table of contents of french PDF documents.
I'm not sure if it's a real "bug", but sometimes, in some documents, blocks of text are separated into two distinct lines, while other similar blocks are returned unified.
How to reproduce the bug
Here is an example with this file (on its second page): 2024-06-18-6670a9f1447abe73af0e9179fda392cc.pdf
And here is the corresponding part of the PDF:
As you can see, the first item is on two lines but
get_text()
returns it entirely with just a linebreak, but the second one is returned as two separate blocks. But, well, I'm not sure it's a real "bug", sorry if it's a false report...In the meantime, just in case someone has the same problem, I partially fixed it by iterating on the blocks and, for each one, checking its bbox to see if it's close enough to the bbox of the last block (comparing
last_block[y1]
withblock[y0]
), and merging both if they're separated by less than 2pts.PyMuPDF version
1.24.7
Operating system
Linux
Python version
3.11
The text was updated successfully, but these errors were encountered: