You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
That's actually an issue on the original repo (as seen here) that I stumbled upon. A PR with a fix also exists here.
Should I recreate that PR in here?
EDIT - original issue message by @jeresch for the sake of better organization:
When parsing a document, I am getting an exception with the following traceback:
Traceback (most recent call last):
File "./main.py", line 748, in parseEachProfile
self.people.append(parser.parse(None))
File "./main.py", line 322, in parse
poslist = parser.parsepdf2(filename)
File "/Users/____/Dropbox/___/Parser/pdfparsing.py", line 167, in parsepdf2
interpreter.process_page(page)
File "/usr/local/lib/python2.7/site-packages/pdfminer/pdfinterp.py", line 839, in process_page
self.render_contents(page.resources, page.contents, ctm=ctm)
File "/usr/local/lib/python2.7/site-packages/pdfminer/pdfinterp.py", line 852, in render_contents
self.execute(list_value(streams))
File "/usr/local/lib/python2.7/site-packages/pdfminer/pdfinterp.py", line 877, in execute
func(*args)
File "/usr/local/lib/python2.7/site-packages/pdfminer/pdfinterp.py", line 760, in do_TJ
self.device.render_string(self.textstate, seq)
File "/usr/local/lib/python2.7/site-packages/pdfminer/pdfdevice.py", line 82, in render_string
scaling, charspace, wordspace, rise, dxscale)
File "/usr/local/lib/python2.7/site-packages/pdfminer/pdfdevice.py", line 98, in render_string_horizontal
font, fontsize, scaling, rise, cid)
File "/usr/local/lib/python2.7/site-packages/pdfminer/converter.py", line 112, in render_char
textwidth = font.char_width(cid)
File "/usr/local/lib/python2.7/site-packages/pdfminer/pdffont.py", line 517, in char_width
return self.widths[cid] * self.hscale
TypeError: unsupported operand type(s) for *: 'PDFObjRef' and 'float'
When the exception occurred, cid was 7, self.hscale was 0.001, and as shown above, self.widths[7] is not a float, but a PDFObjRef. On probing the object, I got
(Pdb) p self.widths[7].__dict__
{'doc': <pdfminer.pdfdocument.PDFDocument object at 0x10d23d310>, 'objid': 11}
I tried stepping through with pdb, but wasn't quite able to figure out where the PDFObjRef got inserted. In pdffont.py, there are two functions that seem likely: get_width(seq) and get_width2(seq), but this depends on the input obviously.
I can provide the problematic pdf if anyone would find it useful. I'll have to go through and scrub confidential information though, so I'll hold off for now. -Jeremy
The text was updated successfully, but these errors were encountered:
Can you edit the description of this issue to include the information from the original issue? This helps to document everything in one place. Copying and pasting it from the original issue is fine by me, if there are no specific pdfminer2 or pdfminer.six details.
That's actually an issue on the original repo (as seen here) that I stumbled upon. A PR with a fix also exists here.
Should I recreate that PR in here?
EDIT - original issue message by @jeresch for the sake of better organization:
When parsing a document, I am getting an exception with the following traceback:
On closer inspection with pdb, I saw an issue:
When the exception occurred, cid was 7, self.hscale was 0.001, and as shown above, self.widths[7] is not a float, but a PDFObjRef. On probing the object, I got
I tried stepping through with pdb, but wasn't quite able to figure out where the PDFObjRef got inserted. In pdffont.py, there are two functions that seem likely: get_width(seq) and get_width2(seq), but this depends on the input obviously.
I can provide the problematic pdf if anyone would find it useful. I'll have to go through and scrub confidential information though, so I'll hold off for now. -Jeremy
The text was updated successfully, but these errors were encountered: