Centered text on different lines don't get grouped into text boxes #382

jstockwin · 2020-03-03T10:37:34Z

Is your feature request related to a problem? Please describe.

When extracting text from the example PDF (example code below), you can see that "Text 1" and "Text 2" are grouped into one text box, "Text 3" and "Text 4" are getting grouped into another, but "Long Text 1" is not.

If the text was left-justified or right-justified, they would all get merged into one text box.

Why does this happen?
This happens because of the find_neighbors function

In particular, you can see that lines which are within the line margin (the plane.find bit) are considered for grouping. They are grouped if the lines are a similar height (abs(obj.height-self.height) < d), and if either the left end of the text is in a similar place, i.e. the text is left-justified (abs(obj.x0-self.x0) < d), or if the text is right-justified (abs(obj.x1-self.x1) < d).

Describe the solution you'd like

It might make sense for centered text to become grouped too.

This could be achieved by adding an additional check to see if the center of the elements are aligned, something like abs((obj.x0 + obj.x1)/2 - (self.x0 + self.x1)/2) < d.

I'd be very happy to PR a fix for this, but thought I'd check it was something you'd like first.

Describe alternatives you've considered
None.

Additional notes

It feels a bit strange to me that d = ratio*self.height where ratio=laparams.line_margin is being used as a measure of how close things on the x axis need to be. Could this be explained? My current guess is that it's simply to avoid needing a new parameter?
The above is all posed in relation to horizontal lines, but the equivalent change would be made to the vertical case.

Example code

from pdfminer import converter, pdfdocument, pdfinterp, pdfpage, pdfparser
from pdfminer.layout import LTTextContainer, LAParams
path_to_file = "centered_text_example.pdf"

with open(path_to_file, "rb") as pdf_file:
    parser = pdfparser.PDFParser(pdf_file)
    document = pdfdocument.PDFDocument(parser)
    resource_manager = pdfinterp.PDFResourceManager()
    device = converter.PDFPageAggregator(resource_manager, laparams=LAParams())
    interpreter = pdfinterp.PDFPageInterpreter(resource_manager, device)
    for page in pdfpage.PDFPage.create_pages(document):
        interpreter.process_page(page)
        results = device.get_result()
        elements = [
            element for element in results if isinstance(element, LTTextContainer)
        ]
device.close()

print(elements)

The text was updated successfully, but these errors were encountered:

pietermarsman · 2020-03-14T13:48:30Z

I'd be very happy to PR a fix for this, but thought I'd check it was something you'd like first.

This is a valuable addition to pdfminer.six :)

It feels a bit strange to me that d = ratio*self.height where ratio=laparams.line_margin is being used as a measure of how close things on the x axis need to be. Could this be explained? My current guess is that it's simply to avoid needing a new parameter?

I agree. In #383 I suggested to use d = self.height instead.

Closes pdfminer#382

…right-aligned text lines (#382) (#384) * Group text lines if they are centered (#382) Closes #382 * Add comparison private methods to LTTextLines * Add missing docstrings * Add tests for find_neighbors * Update changelog * Cosmetic changes from code review

jstockwin mentioned this issue Mar 3, 2020

Allow negative line margin #383

Closed

jstockwin mentioned this issue Mar 4, 2020

Group text lines if they are centered (#382) #384

Merged

5 tasks

pietermarsman added the component: converter Related to any PDFLayoutAnalyzer label Mar 10, 2020

jstockwin added a commit to jstockwin/pdfminer.six that referenced this issue Mar 16, 2020

Group text lines if they are centered (pdfminer#382)

c0b1fe6

Closes pdfminer#382

pietermarsman closed this as completed in #384 Mar 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Centered text on different lines don't get grouped into text boxes #382

Centered text on different lines don't get grouped into text boxes #382

jstockwin commented Mar 3, 2020 •

edited

Loading

pietermarsman commented Mar 14, 2020

Centered text on different lines don't get grouped into text boxes #382

Centered text on different lines don't get grouped into text boxes #382

Comments

jstockwin commented Mar 3, 2020 • edited Loading

pietermarsman commented Mar 14, 2020

jstockwin commented Mar 3, 2020 •

edited

Loading