Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Centered text on different lines don't get grouped into text boxes #382

Closed
jstockwin opened this issue Mar 3, 2020 · 1 comment · Fixed by #384
Closed

Centered text on different lines don't get grouped into text boxes #382

jstockwin opened this issue Mar 3, 2020 · 1 comment · Fixed by #384
Labels
component: converter Related to any PDFLayoutAnalyzer

Comments

@jstockwin
Copy link
Member

jstockwin commented Mar 3, 2020

Is your feature request related to a problem? Please describe.

Example PDF here

When extracting text from the example PDF (example code below), you can see that "Text 1" and "Text 2" are grouped into one text box, "Text 3" and "Text 4" are getting grouped into another, but "Long Text 1" is not.

If the text was left-justified or right-justified, they would all get merged into one text box.

Why does this happen?
This happens because of the find_neighbors function

In particular, you can see that lines which are within the line margin (the plane.find bit) are considered for grouping. They are grouped if the lines are a similar height (abs(obj.height-self.height) < d), and if either the left end of the text is in a similar place, i.e. the text is left-justified (abs(obj.x0-self.x0) < d), or if the text is right-justified (abs(obj.x1-self.x1) < d).

Describe the solution you'd like

It might make sense for centered text to become grouped too.

This could be achieved by adding an additional check to see if the center of the elements are aligned, something like abs((obj.x0 + obj.x1)/2 - (self.x0 + self.x1)/2) < d.

I'd be very happy to PR a fix for this, but thought I'd check it was something you'd like first.

Describe alternatives you've considered
None.

Additional notes

  • It feels a bit strange to me that d = ratio*self.height where ratio=laparams.line_margin is being used as a measure of how close things on the x axis need to be. Could this be explained? My current guess is that it's simply to avoid needing a new parameter?

  • The above is all posed in relation to horizontal lines, but the equivalent change would be made to the vertical case.

Example code

from pdfminer import converter, pdfdocument, pdfinterp, pdfpage, pdfparser
from pdfminer.layout import LTTextContainer, LAParams
path_to_file = "centered_text_example.pdf"

with open(path_to_file, "rb") as pdf_file:
    parser = pdfparser.PDFParser(pdf_file)
    document = pdfdocument.PDFDocument(parser)
    resource_manager = pdfinterp.PDFResourceManager()
    device = converter.PDFPageAggregator(resource_manager, laparams=LAParams())
    interpreter = pdfinterp.PDFPageInterpreter(resource_manager, device)
    for page in pdfpage.PDFPage.create_pages(document):
        interpreter.process_page(page)
        results = device.get_result()
        elements = [
            element for element in results if isinstance(element, LTTextContainer)
        ]
device.close()

print(elements)
@pietermarsman
Copy link
Member

I'd be very happy to PR a fix for this, but thought I'd check it was something you'd like first.

This is a valuable addition to pdfminer.six :)

It feels a bit strange to me that d = ratio*self.height where ratio=laparams.line_margin is being used as a measure of how close things on the x axis need to be. Could this be explained? My current guess is that it's simply to avoid needing a new parameter?

I agree. In #383 I suggested to use d = self.height instead.

jstockwin added a commit to jstockwin/pdfminer.six that referenced this issue Mar 16, 2020
pietermarsman pushed a commit that referenced this issue Mar 23, 2020
…right-aligned text lines (#382) (#384)

* Group text lines if they are centered (#382)

Closes #382

* Add comparison private methods to LTTextLines

* Add missing docstrings

* Add tests for find_neighbors

* Update changelog

* Cosmetic changes from code review
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component: converter Related to any PDFLayoutAnalyzer
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants