You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When extracting text from the example PDF (example code below), you can see that "Text 1" and "Text 2" are grouped into one text box, "Text 3" and "Text 4" are getting grouped into another, but "Long Text 1" is not.
If the text was left-justified or right-justified, they would all get merged into one text box.
Why does this happen?
This happens because of the find_neighbors function
In particular, you can see that lines which are within the line margin (the plane.find bit) are considered for grouping. They are grouped if the lines are a similar height (abs(obj.height-self.height) < d), and if either the left end of the text is in a similar place, i.e. the text is left-justified (abs(obj.x0-self.x0) < d), or if the text is right-justified (abs(obj.x1-self.x1) < d).
Describe the solution you'd like
It might make sense for centered text to become grouped too.
This could be achieved by adding an additional check to see if the center of the elements are aligned, something like abs((obj.x0 + obj.x1)/2 - (self.x0 + self.x1)/2) < d.
I'd be very happy to PR a fix for this, but thought I'd check it was something you'd like first.
Describe alternatives you've considered
None.
Additional notes
It feels a bit strange to me that d = ratio*self.height where ratio=laparams.line_margin is being used as a measure of how close things on the x axis need to be. Could this be explained? My current guess is that it's simply to avoid needing a new parameter?
The above is all posed in relation to horizontal lines, but the equivalent change would be made to the vertical case.
Example code
from pdfminer import converter, pdfdocument, pdfinterp, pdfpage, pdfparser
from pdfminer.layout import LTTextContainer, LAParams
path_to_file = "centered_text_example.pdf"
with open(path_to_file, "rb") as pdf_file:
parser = pdfparser.PDFParser(pdf_file)
document = pdfdocument.PDFDocument(parser)
resource_manager = pdfinterp.PDFResourceManager()
device = converter.PDFPageAggregator(resource_manager, laparams=LAParams())
interpreter = pdfinterp.PDFPageInterpreter(resource_manager, device)
for page in pdfpage.PDFPage.create_pages(document):
interpreter.process_page(page)
results = device.get_result()
elements = [
element for element in results if isinstance(element, LTTextContainer)
]
device.close()
print(elements)
The text was updated successfully, but these errors were encountered:
I'd be very happy to PR a fix for this, but thought I'd check it was something you'd like first.
This is a valuable addition to pdfminer.six :)
It feels a bit strange to me that d = ratio*self.height where ratio=laparams.line_margin is being used as a measure of how close things on the x axis need to be. Could this be explained? My current guess is that it's simply to avoid needing a new parameter?
I agree. In #383 I suggested to use d = self.height instead.
jstockwin
added a commit
to jstockwin/pdfminer.six
that referenced
this issue
Mar 16, 2020
…right-aligned text lines (#382) (#384)
* Group text lines if they are centered (#382)
Closes#382
* Add comparison private methods to LTTextLines
* Add missing docstrings
* Add tests for find_neighbors
* Update changelog
* Cosmetic changes from code review
Is your feature request related to a problem? Please describe.
Example PDF here
When extracting text from the example PDF (example code below), you can see that "Text 1" and "Text 2" are grouped into one text box, "Text 3" and "Text 4" are getting grouped into another, but "Long Text 1" is not.
If the text was left-justified or right-justified, they would all get merged into one text box.
Why does this happen?
This happens because of the find_neighbors function
In particular, you can see that lines which are within the line margin (the
plane.find
bit) are considered for grouping. They are grouped if the lines are a similar height (abs(obj.height-self.height) < d
), and if either the left end of the text is in a similar place, i.e. the text is left-justified (abs(obj.x0-self.x0) < d
), or if the text is right-justified (abs(obj.x1-self.x1) < d
).Describe the solution you'd like
It might make sense for centered text to become grouped too.
This could be achieved by adding an additional check to see if the center of the elements are aligned, something like
abs((obj.x0 + obj.x1)/2 - (self.x0 + self.x1)/2) < d
.I'd be very happy to PR a fix for this, but thought I'd check it was something you'd like first.
Describe alternatives you've considered
None.
Additional notes
It feels a bit strange to me that
d = ratio*self.height
whereratio=laparams.line_margin
is being used as a measure of how close things on thex
axis need to be. Could this be explained? My current guess is that it's simply to avoid needing a new parameter?The above is all posed in relation to horizontal lines, but the equivalent change would be made to the vertical case.
Example code
The text was updated successfully, but these errors were encountered: