-
Notifications
You must be signed in to change notification settings - Fork 273
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PageLayout to_s() merges TextRuns that overlap #290
Comments
I ran into the same problem, I know this isn't a solution, but for me since I just needed it as a one-off for one file, https://github.com/yomurb/yomu did the rendering for overlapping texts |
I've got some logic for handling overlapping characters nearly ready to merge in #299. However, for now it's only throwing away identical characters that overlap. Maybe we could add an option to Alternatively, we could add some logic that throws away invisible characters. The work in #301 would help with this - it should allow recording an alpha value on each TextRun.
I'm very open to this. It'd be a nice "escape hatch" for folks who find |
yob, thank you great and clean code for reading PDF. I have a question that relates this issue.
instead of
Because the current code get the height of the font size. |
Hi @akiotajima, That's quite possible, I'm a bit hazy on why past-me structured the code that way. I tried changing it and a number of tests fail. It's possible the tests are wrong too, but failing tests mean it's not a straight forward change. Do you have a sample PDF that renders correctly with your change? I'm also interested in how this relates to the current issue. I guess incorrect horizontal displacement could result in some characters overlapping when they shouldn't? |
Hi Below example text (ABC123) is psuedo text because the original one is printed in Japanes letters. With exact example, the page.content_raw is below.
The text with narrow font (<0BBE... 7AA> is scaled 4.44 according to the first Tm command and the text with normal font (<022A00... ) is scaled 6 according to the second Tm command. Then I inserted the debug line in PageTextReader#internal_show_text as
I got below output.
The x positions of TextRun is why I got '関(0.2%)' instead of '関連業(0.2%)'. However I'm not certain why you wrote PageState#font_size as it is, for example many English PDF changes font height and it should do that. |
Thanks for the extra details @akiotajima. To be honest, that code was written so long ago that I've forgotten exactly what i was thinking at the time. I also have very little experience with PDFs that have Japanese text, so it's quite likely there's a bug that is more significant to Japanese text than English text. I'm happy to accept pull requests if you have the time to put one together. Ideally, it'd be great to have a spec in If you want to continue the discussion, can we move it to a new github issue? This issue is primarily about characters where the PDF file intentionally overlaps them, however it sounds like you've hit a bug where characters are overlapping when they shouldn't be. |
Hi yob It's my pleasure to create a pull request, however, for this issue and at this time, I have no tool to create PDF or have little knowledge to create by myself for the specs. |
Absolutely, I'm in no hurry. I'd offer to help create a sample PDF, but it may be difficult without reading Japanese. I understand it may not be possible due to privacy reasons, but could we take the file you have and strip it back to a single page with the characters you've mentioned (by editing the file and deleting everything else on the page)? If we're lucky, that might leave us with a small PDF that exhibits the issue. |
I notice that some PDFs have extra apparently spurious text in them, eg some bank statements (presumably the bank puts them into to make it hard to parse them).
An example is where you have a transparent text run of '6' in one text run and an amount of say '50.00' in a text run that overlaps the '6'. PDF Reader's Page text() method outputs these two as 650.00, so it incorrectly looks like the amount is $650 instead of $50. The overlap also occurs when the '6' ends in the column immediately before the '50.00'.
If I view the PDF in Evince, the spurious text is rendered transparently, so the document looks fine unless I select the text for copy and paste. In the pasted output, the two strings appear with a space between them, ie '6 50.00'. So it's not ideal, but at least the you can recognise that the amount is $50 and not $650.
The PageLayout to_s method is doing the hard work of mapping the TextRun objects and rendering them to a string. It calls local_string_insert to insert each text at its x_pos and y_pos (x_pos and y_pos are converted into columns from the raw x and y coords).
Brainstorming, there might be a couple of ways around this:
I have tried moving the text runs that overlap prior to calling PageLayout's to_s method (eg at the end of PageLayout's initialize method) to ensure that there is least one column between them. This fixes the issue - I get 6 50.00 instead of 650.00, so it matches how Evince works. I did it by grouping the text runs into a hash of { y column => [ ary of TextRun ] } and then sorting each ary by its start x column. Then I check for overlap by comparing the endx column of one text run against the following text run's x column. The disadvantage of doing this is that potentially you could lose text off the right hand side of the page, because to_s checks that the text run starts within the expected number of columns on the page before inserting it. Maybe we could remove that check so the text isn't lost.
We could add an alternative method to page.text() that returns the TextRun objects directly, eg as a hash of { y_column => [ ary of TextRun ] } or as an Array of [ ary of TextRun ]. If the TextRun object had methods to return its x_col and endx_col as well as the raw x and endx, the caller could figure out for themself where they are located on the page. (As a side benefit, the caller could also see the TextRun attributes like font_size and width. We could even make the TextRun store its font so the user could see which font is applicable, which might help with Parse font of given text #272.)
This might also be related to #43.
I'm using pdf-reader v2.2.0.
The text was updated successfully, but these errors were encountered: